Commit graph

10 commits

Author SHA1 Message Date
Andre Manoel
61cdeefb17
feat: make async engine the default execution path (#592)
Some checks failed
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish devnotes / deploy (push) Has been cancelled
* feat: make async engine the default execution path

The async engine has been hardening as opt-in for several releases. Make it
the default and address the prerequisites flagged for the flip.

Default flip
- DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites
- Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out
- allow_resize=True still falls back to sync with a DeprecationWarning

Python 3.10 support
- Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with
  gather-with-explicit-cancel; semantics preserved because _run_task already
  swallows its own exceptions and uses _shutdown_event for sibling cancellation
- Remove the sys.version_info < (3, 11) runtime guard
- Remove the matching pytest skipif so the executor tests run on 3.10 too

Derived timeouts (replaces two hardcoded 300s constants)
- ThrottleManager.acquire_sync/async default to timeout=None (no deadline)
  instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds
  actual work, queue waits scale with provider speed and AIMD
- _AsyncBridgedModelFacade derives the sync->async bridge timeout from the
  model's inference_parameters.timeout and the call's max_correction_steps;
  one knob (per-model timeout) drives both deadlines, no new config surface
- Add ModelFacade.request_timeout property so the bridge can read the
  effective timeout the client is configured with

Root-cause surfacing
- AsyncTaskScheduler captures the first non-retryable error and exposes it
  via first_non_retryable_error
- Interface threads it through DataDesignerGenerationError when 0 records
  are produced without early-shutdown, so deterministic failures (e.g. bad
  seed sources) surface their original message instead of a wrapped
  FileNotFoundError on the parquet path

Tests
- New: throttle no-deadline default behavior (sync+async), parametrized
  derived bridge timeout, restored async_concurrency tests on 3.10
- Updated: test_dataset_builder.py uses an autouse fixture to pin its
  Mock-based tests to the sync engine they cover; existing bridge tests
  set facade.request_timeout for the new derivation

Docs
- Replace the stale LiteLLM security notice in README with a short
  async-default heads-up and link to the migration guide
- Add docs/migration-async-default.md covering per-model timeouts,
  custom-column thread safety, mocking model calls, run outcomes, and
  the opt-out
- Append a short Update section to the async-all-the-way-down dev note

* test: extract _compute_bridge_timeout helper for direct testing

The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result``
to capture the timeout the bridge passed in. That reaches into stdlib internals
(DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument
on the parametrize was missing.

Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test
now calls the helper directly with no mocking, and the parametrize gets readable ids.
Behavior is unchanged.

* test(e2e): align demo plugins with async engine contracts

The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two
of them were written against sync-engine semantics that the async engine
restricts:

- DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no
  required_columns. The async engine routes ``no-upstream`` columns
  through the from-scratch path, which passes an empty DataFrame to
  generators that aren't FromScratchColumnGenerator subclasses. The
  generator then produces 0 rows and the scheduler raises
  ``update_batch received 0 values``. Switching the plugin to
  FromScratchColumnGenerator with generate_from_scratch(num_records)
  matches what the plugin actually does (produces a constant column
  without input) and works on both engines.

- RegexFilterProcessor implemented process_before_batch with row-count
  changes. The async engine enforces row-count invariance in pre- and
  post-batch processor stages by design. Moving the filter to
  process_after_generation preserves the plugin's purpose (regex-based
  row filtering) at a stage that supports row-count changes on both
  engines. Test assertions check the final dataset, so the stage shift
  is transparent.

Both changes are demo-plugin updates only; no production code change.

* fix: address Codex review findings on async-default flip

Three bugs and two test-quality concerns surfaced by an independent review of
the prior commits. Each was real and worth fixing in the flip PR.

Bug fixes
- Sync-fallback path was creating async-only model clients. The default flip
  meant ``client_concurrency_mode = ASYNC`` for every default run, but the
  ``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()``
  calls then hit ``SyncClientUnavailableError``. The resolution decision now
  lives at the DataDesigner interface level via
  ``_resolve_client_concurrency_mode``: it considers both the env var and the
  config (allow_resize forces sync clients) and is passed explicitly to
  ``create_resource_provider``. Direct callers of the factory still get the
  env-var default.

- Sync→async bridge timeout ignored the per-call ``timeout=`` override. A
  custom column calling ``model.generate(timeout=600)`` against a slow endpoint
  was being cancelled at the model-config default, not 600s. The bridge now
  prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``.

- Bridge timeout formula missed ``max_conversation_restarts``. One logical
  generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)``
  HTTP requests; the formula now multiplies both, matching the worst-case
  attempt budget.

Engine routing fix (also surfaced by failing e2e plugin tests)
- ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch
  generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn``
  with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now
  passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the
  sync engine's FULL_COLUMN contract.
- The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``)
  is reverted; the engine fix subsumes it. The processor-plugin fix
  (``process_after_generation`` for the regex filter) stays — pre-batch
  row-count change is intentionally rejected by the async engine.

Test improvements
- Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)``
  and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct
  from any finite default. Sync and async counterparts mirror.
- New regression tests for ``first_non_retryable_error`` surfacing covering
  both load-raises and load-returns-empty paths, asserting the original
  exception is chained via ``__cause__`` and that the typed
  ``DataDesignerEarlyShutdownError`` doesn't fire in this branch.
- New parametrized regression test for ``_resolve_client_concurrency_mode``
  covering all four (env × allow_resize) combinations.
- New parametrized test for the per-call ``timeout=`` override flowing into
  the bridge timeout calculation.
- Bridge formula tests extended with ``max_conversation_restarts`` cases.

* test: trim redundant parametrize cases in async-default tests

Three parametrize cases were duplicating coverage already provided by
existing standalone tests:

- ``test_acquire_*_timeout_branches`` parametrized over ``(0.0, raises)``
  and ``(None, waits)``. The ``raises`` half duplicates
  ``test_acquire_*_raises_timeout_when_at_capacity``. Replaced with two
  focused ``..._default_no_deadline_waits_for_release`` tests covering
  only the no-deadline branch.

- ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four
  cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the
  env var alone forces it; the allow_resize input is moot. Dropped.

- ``test_async_bridge_honors_per_call_timeout`` had three cases. The
  "override below floor" case cross-products the per-call override flow
  with the floor-clamping behavior already covered by
  ``test_compute_bridge_timeout``. Dropped.

Net: -25 lines of test code with no loss of essential coverage.

* docs: fold migration page into existing concept docs

The standalone ``Migrating to the async default`` page didn't fit the
existing docs style — present tense, behavior over comparisons, content
in the natural concept home. Folding it in:

- ``architecture-and-performance.md`` gets a new ``Async Engine`` section
  covering per-model timeouts, run outcomes (partial completion +
  ``DataDesignerEarlyShutdownError``), and the transitional opt-out.
  Three stale ``async engine is landing soon`` callouts updated to
  reflect the flip.
- ``custom_columns.md`` gets two short notes: a thread-safety callout
  near Generation Strategies, and a mocking-with-spec note in
  Development Testing.
- ``async-all-the-way-down.md`` Update section now points at the new
  arch-and-perf section.
- README heads-up links to the same anchor.
- ``migration-async-default.md`` removed; mkdocs.yml entry dropped.

* docs: frame Execution Model as sync-engine specifics

Small targeted edits to make the user-facing concept docs consistent
with the post-flip state. No restructuring.

- ``architecture-and-performance.md``: the ``Execution Model`` callout
  now opens with two engines, links to the new ``Async Engine`` section,
  and frames the existing column-at-a-time description as sync-engine
  semantics. The ``Step 2: Process columns sequentially`` paragraph notes
  the async engine relaxes this. The ``Key Concepts`` table differentiates
  per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells``
  is the same on both.
- ``processors.md``: added a warning callout about the async engine's
  row-count invariance in pre- and post-batch stages, with the guidance
  to use ``process_after_generation()`` for row-filtering or expansion.

* fix: address review nits from PR #592 (Nabin)

Four targeted fixes from the review.

Worth-addressing (warning):
- ``test_acquire_async_default_no_deadline_waits_for_release`` was
  spawning the release task without holding a strong reference. The
  loop's weak-ref bookkeeping could GC it before the inner ``await``
  observes the release, producing a CI flake. Hold the task and
  ``await`` it in ``finally``.

Take-it-or-leave-it (applied):
- Root-cause error surfacing now includes the exception type name:
  ``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see
  ``ValueError: ...`` instead of just the message string. The
  ``__cause__`` chain is preserved either way.
- Drop the defensive ``getattr(c, "allow_resize", False)`` in
  ``_resolve_client_concurrency_mode`` — every member of
  ``ColumnConfigT`` inherits ``allow_resize: bool = False`` from
  ``SingleColumnConfig``.
- One-line comment near the root-cause surfacing branch noting that
  ``actual_num_records == 0`` is async-only (sync runs leave it at
  ``-1``), so the branch is async-only by construction.

Not addressed in this PR (filing as follow-ups):
- ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in
  ``column_generators/generators/base.py:_run_coroutine_sync``. That
  bridge has no model-facade context to derive a timeout from, so the
  fix is a structural refactor outside this PR's scope.
- First-error capture loses subsequent-error context. The "first wins"
  heuristic is documented; richer aggregation is a follow-up.

* fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync

This was the third hardcoded 300s timeout (Nabin flagged it on PR #592).
The path is the generic sync→async bridge in ``ColumnGenerator.generate()``:
when a subclass overrides only ``agenerate()``, the sync entry point runs
the coroutine in a background thread.

Same philosophy we applied to the throttle queue wait elsewhere in the
PR: a defensive deadline on top of work that's already bounded by the
HTTP timeout doesn't add safety, it just produces spurious failures on
slow self-hosted endpoints. Drop the constant, the timeout exception
handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)``
becomes the simple cleanup.

Tests in ``test_async_generators.py`` exercise the happy path only and
don't depend on the timeout firing.

* Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync"

This reverts commit 7a0b77d44c.

* docs+feat: deprecate the sync-engine opt-out path

Nabin asked whether the docs should adopt explicit "deprecation" language
on the opt-out path. Doing both:

- Doc: ``architecture-and-performance.md``'s ``Opting out`` section now
  uses an ``!!! warning "Deprecated"`` admonition that names the env var
  as a deprecated escape hatch and notes the run-time warning.
- Code: ``DataDesigner._resolve_client_concurrency_mode`` emits a
  ``DeprecationWarning`` when ``DATA_DESIGNER_ASYNC_ENGINE=0`` is detected.
  Same precedent as the existing ``allow_resize=True`` warning. Auto-fallback
  via ``allow_resize`` does not double-warn here; the builder layer emits
  its own warning later.
- Test: parametrized regression now asserts ``pytest.warns(DeprecationWarning)``
  on the opt-out branch and treats any warning on the async-on branches as
  a failure (``simplefilter("error")`` inside the ``catch_warnings`` block).

* fix: emit logger.warning alongside DeprecationWarning on env-var opt-out

Parity fix from Nabin's re-review of PR #592. The ``allow_resize=True``
auto-fallback path in ``_resolve_async_compatibility`` emits both a
``logger.warning("⚠️ ...")`` and a ``DeprecationWarning``. The new
``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path was only emitting the
``DeprecationWarning``, leaving users who run with default warning
filters silenced and inconsistent with the established precedent.

Match the pattern: same message body, both signals, same stacklevel.

* docs: breadcrumb explaining why SYNC_BRIDGE_TIMEOUT survives PR #592

Nabin's re-review pointed out that ``base.py`` is the lone place where
the 300s pattern survives, while ``custom.py`` and ``throttle_manager.py``
both retired theirs. Without a comment, a future reader (or a lint sweep)
could mistake this for an oversight and "consistency-fix" it the wrong way.

Add a short note at the constant naming the two retired siblings, the
reason this one stayed (no ``ModelFacade`` context to derive from), and
the fact that it's tracked for a structural follow-up.
2026-05-04 16:22:13 -03:00
Eric W. Tramel
28c8345909
feat: add built-in filesystem seed readers (#421) 2026-03-16 17:40:27 -04:00
Andre Manoel
982ce79ca9
feat: add processor plugin support (#299)
* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/
2026-02-25 16:40:01 -03:00
Johnny Greco
1439bbea7e
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.

Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations

Reduces CLI import-time from ~1.67s to ~0.46s.

* perf: defer pandas/numpy in io_helpers and add config_list benchmark

- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
  with module-level __getattr__ (for backwards-compatible external
  access / test mocks) and function-level imports in the 3 functions
  that actually use them (read_parquet_dataset, smart_load_dataframe,
  _convert_to_serializable). Importing io_helpers no longer triggers
  pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
  bodies to avoid loading repositories, Rich, and prompt_toolkit at
  module import time.
- Add `config_list` (data-designer config list) measurement to the
  CLI startup benchmark with isolated cold measurement in a separate
  venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.

* Refine lazy import usage and TYPE_CHECKING cleanup

* Run license header updater on PR-touched files

* fix: update sqlfluff mock target for lazy imports in test_sql

* perf: cache globals() in lazy __getattr__ to avoid repeated lookups

Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.

* perf: lazy CLI command loading and deferred heavy import evaluations

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files

- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes

- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks

- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use

- Update test mock targets to patch at usage-site for module-level imports

* refactor: use direct pandas import in seed_source_dataframe

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.

* update lazy import pattern

* update tests to use lazy import namespace

Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.

* tighten import perf test thresholds

Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.

* document pandas import requirement

Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.

* increase timeout time

* use lazy pandas imports in visualization tests

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted

* fix lazy pandas runtime usage and preview mocks

Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
2026-02-18 16:24:15 -05:00
Johnny Greco
87119a545b
refactor: move SingleColumnConfig to config.base module (#287)
* create top-level base file

* add note

* update license header

* move exportable config and move base to config module

* update references in docs

* do not include single column config in init

* add inverse import order e2e test
2026-02-03 14:04:04 -05:00
Eric W. Tramel
e6e58e692e
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) 2026-02-02 09:41:58 -05:00
Nabin Mulepati
eb5ef279ab
fix litellm issue with lazy load (#228) 2026-01-16 17:41:06 -07:00
Johnny Greco
1ee37bc317
refactor: update single column base class (#206)
* make properties abstract

* add private column emoji attribute

* update e2e plugin tests

* throw error if not default string

* add unit tests

* make emoji a static method

* dont need that docstring

* update unit test
2026-01-15 14:44:58 -05:00
Johnny Greco
3d9f5185d7
refactor: remove task metadata property (#216)
* remove metadata

* docs and tests

* don't need that test

* use static method for generation strategy

* update docs

* add docstring
2026-01-15 14:12:11 -05:00
Johnny Greco
367de1a063
rename (#214) 2026-01-14 15:26:46 -05:00