mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
10 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
61cdeefb17
|
feat: make async engine the default execution path (#592)
Some checks failed
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish devnotes / deploy (push) Has been cancelled
* feat: make async engine the default execution path
The async engine has been hardening as opt-in for several releases. Make it
the default and address the prerequisites flagged for the flip.
Default flip
- DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites
- Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out
- allow_resize=True still falls back to sync with a DeprecationWarning
Python 3.10 support
- Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with
gather-with-explicit-cancel; semantics preserved because _run_task already
swallows its own exceptions and uses _shutdown_event for sibling cancellation
- Remove the sys.version_info < (3, 11) runtime guard
- Remove the matching pytest skipif so the executor tests run on 3.10 too
Derived timeouts (replaces two hardcoded 300s constants)
- ThrottleManager.acquire_sync/async default to timeout=None (no deadline)
instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds
actual work, queue waits scale with provider speed and AIMD
- _AsyncBridgedModelFacade derives the sync->async bridge timeout from the
model's inference_parameters.timeout and the call's max_correction_steps;
one knob (per-model timeout) drives both deadlines, no new config surface
- Add ModelFacade.request_timeout property so the bridge can read the
effective timeout the client is configured with
Root-cause surfacing
- AsyncTaskScheduler captures the first non-retryable error and exposes it
via first_non_retryable_error
- Interface threads it through DataDesignerGenerationError when 0 records
are produced without early-shutdown, so deterministic failures (e.g. bad
seed sources) surface their original message instead of a wrapped
FileNotFoundError on the parquet path
Tests
- New: throttle no-deadline default behavior (sync+async), parametrized
derived bridge timeout, restored async_concurrency tests on 3.10
- Updated: test_dataset_builder.py uses an autouse fixture to pin its
Mock-based tests to the sync engine they cover; existing bridge tests
set facade.request_timeout for the new derivation
Docs
- Replace the stale LiteLLM security notice in README with a short
async-default heads-up and link to the migration guide
- Add docs/migration-async-default.md covering per-model timeouts,
custom-column thread safety, mocking model calls, run outcomes, and
the opt-out
- Append a short Update section to the async-all-the-way-down dev note
* test: extract _compute_bridge_timeout helper for direct testing
The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result``
to capture the timeout the bridge passed in. That reaches into stdlib internals
(DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument
on the parametrize was missing.
Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test
now calls the helper directly with no mocking, and the parametrize gets readable ids.
Behavior is unchanged.
* test(e2e): align demo plugins with async engine contracts
The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two
of them were written against sync-engine semantics that the async engine
restricts:
- DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no
required_columns. The async engine routes ``no-upstream`` columns
through the from-scratch path, which passes an empty DataFrame to
generators that aren't FromScratchColumnGenerator subclasses. The
generator then produces 0 rows and the scheduler raises
``update_batch received 0 values``. Switching the plugin to
FromScratchColumnGenerator with generate_from_scratch(num_records)
matches what the plugin actually does (produces a constant column
without input) and works on both engines.
- RegexFilterProcessor implemented process_before_batch with row-count
changes. The async engine enforces row-count invariance in pre- and
post-batch processor stages by design. Moving the filter to
process_after_generation preserves the plugin's purpose (regex-based
row filtering) at a stage that supports row-count changes on both
engines. Test assertions check the final dataset, so the stage shift
is transparent.
Both changes are demo-plugin updates only; no production code change.
* fix: address Codex review findings on async-default flip
Three bugs and two test-quality concerns surfaced by an independent review of
the prior commits. Each was real and worth fixing in the flip PR.
Bug fixes
- Sync-fallback path was creating async-only model clients. The default flip
meant ``client_concurrency_mode = ASYNC`` for every default run, but the
``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()``
calls then hit ``SyncClientUnavailableError``. The resolution decision now
lives at the DataDesigner interface level via
``_resolve_client_concurrency_mode``: it considers both the env var and the
config (allow_resize forces sync clients) and is passed explicitly to
``create_resource_provider``. Direct callers of the factory still get the
env-var default.
- Sync→async bridge timeout ignored the per-call ``timeout=`` override. A
custom column calling ``model.generate(timeout=600)`` against a slow endpoint
was being cancelled at the model-config default, not 600s. The bridge now
prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``.
- Bridge timeout formula missed ``max_conversation_restarts``. One logical
generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)``
HTTP requests; the formula now multiplies both, matching the worst-case
attempt budget.
Engine routing fix (also surfaced by failing e2e plugin tests)
- ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch
generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn``
with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now
passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the
sync engine's FULL_COLUMN contract.
- The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``)
is reverted; the engine fix subsumes it. The processor-plugin fix
(``process_after_generation`` for the regex filter) stays — pre-batch
row-count change is intentionally rejected by the async engine.
Test improvements
- Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)``
and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct
from any finite default. Sync and async counterparts mirror.
- New regression tests for ``first_non_retryable_error`` surfacing covering
both load-raises and load-returns-empty paths, asserting the original
exception is chained via ``__cause__`` and that the typed
``DataDesignerEarlyShutdownError`` doesn't fire in this branch.
- New parametrized regression test for ``_resolve_client_concurrency_mode``
covering all four (env × allow_resize) combinations.
- New parametrized test for the per-call ``timeout=`` override flowing into
the bridge timeout calculation.
- Bridge formula tests extended with ``max_conversation_restarts`` cases.
* test: trim redundant parametrize cases in async-default tests
Three parametrize cases were duplicating coverage already provided by
existing standalone tests:
- ``test_acquire_*_timeout_branches`` parametrized over ``(0.0, raises)``
and ``(None, waits)``. The ``raises`` half duplicates
``test_acquire_*_raises_timeout_when_at_capacity``. Replaced with two
focused ``..._default_no_deadline_waits_for_release`` tests covering
only the no-deadline branch.
- ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four
cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the
env var alone forces it; the allow_resize input is moot. Dropped.
- ``test_async_bridge_honors_per_call_timeout`` had three cases. The
"override below floor" case cross-products the per-call override flow
with the floor-clamping behavior already covered by
``test_compute_bridge_timeout``. Dropped.
Net: -25 lines of test code with no loss of essential coverage.
* docs: fold migration page into existing concept docs
The standalone ``Migrating to the async default`` page didn't fit the
existing docs style — present tense, behavior over comparisons, content
in the natural concept home. Folding it in:
- ``architecture-and-performance.md`` gets a new ``Async Engine`` section
covering per-model timeouts, run outcomes (partial completion +
``DataDesignerEarlyShutdownError``), and the transitional opt-out.
Three stale ``async engine is landing soon`` callouts updated to
reflect the flip.
- ``custom_columns.md`` gets two short notes: a thread-safety callout
near Generation Strategies, and a mocking-with-spec note in
Development Testing.
- ``async-all-the-way-down.md`` Update section now points at the new
arch-and-perf section.
- README heads-up links to the same anchor.
- ``migration-async-default.md`` removed; mkdocs.yml entry dropped.
* docs: frame Execution Model as sync-engine specifics
Small targeted edits to make the user-facing concept docs consistent
with the post-flip state. No restructuring.
- ``architecture-and-performance.md``: the ``Execution Model`` callout
now opens with two engines, links to the new ``Async Engine`` section,
and frames the existing column-at-a-time description as sync-engine
semantics. The ``Step 2: Process columns sequentially`` paragraph notes
the async engine relaxes this. The ``Key Concepts`` table differentiates
per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells``
is the same on both.
- ``processors.md``: added a warning callout about the async engine's
row-count invariance in pre- and post-batch stages, with the guidance
to use ``process_after_generation()`` for row-filtering or expansion.
* fix: address review nits from PR #592 (Nabin)
Four targeted fixes from the review.
Worth-addressing (warning):
- ``test_acquire_async_default_no_deadline_waits_for_release`` was
spawning the release task without holding a strong reference. The
loop's weak-ref bookkeeping could GC it before the inner ``await``
observes the release, producing a CI flake. Hold the task and
``await`` it in ``finally``.
Take-it-or-leave-it (applied):
- Root-cause error surfacing now includes the exception type name:
``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see
``ValueError: ...`` instead of just the message string. The
``__cause__`` chain is preserved either way.
- Drop the defensive ``getattr(c, "allow_resize", False)`` in
``_resolve_client_concurrency_mode`` — every member of
``ColumnConfigT`` inherits ``allow_resize: bool = False`` from
``SingleColumnConfig``.
- One-line comment near the root-cause surfacing branch noting that
``actual_num_records == 0`` is async-only (sync runs leave it at
``-1``), so the branch is async-only by construction.
Not addressed in this PR (filing as follow-ups):
- ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in
``column_generators/generators/base.py:_run_coroutine_sync``. That
bridge has no model-facade context to derive a timeout from, so the
fix is a structural refactor outside this PR's scope.
- First-error capture loses subsequent-error context. The "first wins"
heuristic is documented; richer aggregation is a follow-up.
* fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync
This was the third hardcoded 300s timeout (Nabin flagged it on PR #592).
The path is the generic sync→async bridge in ``ColumnGenerator.generate()``:
when a subclass overrides only ``agenerate()``, the sync entry point runs
the coroutine in a background thread.
Same philosophy we applied to the throttle queue wait elsewhere in the
PR: a defensive deadline on top of work that's already bounded by the
HTTP timeout doesn't add safety, it just produces spurious failures on
slow self-hosted endpoints. Drop the constant, the timeout exception
handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)``
becomes the simple cleanup.
Tests in ``test_async_generators.py`` exercise the happy path only and
don't depend on the timeout firing.
* Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync"
This reverts commit
|
||
|
|
28c8345909
|
feat: add built-in filesystem seed readers (#421) | ||
|
|
982ce79ca9
|
feat: add processor plugin support (#299)
* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/ |
||
|
|
1439bbea7e
|
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios. |
||
|
|
87119a545b
|
refactor: move SingleColumnConfig to config.base module (#287)
* create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test |
||
|
|
e6e58e692e
|
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) | ||
|
|
eb5ef279ab
|
fix litellm issue with lazy load (#228) | ||
|
|
1ee37bc317
|
refactor: update single column base class (#206)
* make properties abstract * add private column emoji attribute * update e2e plugin tests * throw error if not default string * add unit tests * make emoji a static method * dont need that docstring * update unit test |
||
|
|
3d9f5185d7
|
refactor: remove task metadata property (#216)
* remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring |
||
|
|
367de1a063
|
rename (#214) |