DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Andre Manoel	61cdeefb17	feat: make async engine the default execution path (#592 ) Some checks failed CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details Publish devnotes / deploy (push) Has been cancelled Details * feat: make async engine the default execution path The async engine has been hardening as opt-in for several releases. Make it the default and address the prerequisites flagged for the flip. Default flip - DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites - Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out - allow_resize=True still falls back to sync with a DeprecationWarning Python 3.10 support - Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with gather-with-explicit-cancel; semantics preserved because _run_task already swallows its own exceptions and uses _shutdown_event for sibling cancellation - Remove the sys.version_info < (3, 11) runtime guard - Remove the matching pytest skipif so the executor tests run on 3.10 too Derived timeouts (replaces two hardcoded 300s constants) - ThrottleManager.acquire_sync/async default to timeout=None (no deadline) instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds actual work, queue waits scale with provider speed and AIMD - _AsyncBridgedModelFacade derives the sync->async bridge timeout from the model's inference_parameters.timeout and the call's max_correction_steps; one knob (per-model timeout) drives both deadlines, no new config surface - Add ModelFacade.request_timeout property so the bridge can read the effective timeout the client is configured with Root-cause surfacing - AsyncTaskScheduler captures the first non-retryable error and exposes it via first_non_retryable_error - Interface threads it through DataDesignerGenerationError when 0 records are produced without early-shutdown, so deterministic failures (e.g. bad seed sources) surface their original message instead of a wrapped FileNotFoundError on the parquet path Tests - New: throttle no-deadline default behavior (sync+async), parametrized derived bridge timeout, restored async_concurrency tests on 3.10 - Updated: test_dataset_builder.py uses an autouse fixture to pin its Mock-based tests to the sync engine they cover; existing bridge tests set facade.request_timeout for the new derivation Docs - Replace the stale LiteLLM security notice in README with a short async-default heads-up and link to the migration guide - Add docs/migration-async-default.md covering per-model timeouts, custom-column thread safety, mocking model calls, run outcomes, and the opt-out - Append a short Update section to the async-all-the-way-down dev note * test: extract _compute_bridge_timeout helper for direct testing The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result`` to capture the timeout the bridge passed in. That reaches into stdlib internals (DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument on the parametrize was missing. Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test now calls the helper directly with no mocking, and the parametrize gets readable ids. Behavior is unchanged. * test(e2e): align demo plugins with async engine contracts The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two of them were written against sync-engine semantics that the async engine restricts: - DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no required_columns. The async engine routes ``no-upstream`` columns through the from-scratch path, which passes an empty DataFrame to generators that aren't FromScratchColumnGenerator subclasses. The generator then produces 0 rows and the scheduler raises ``update_batch received 0 values``. Switching the plugin to FromScratchColumnGenerator with generate_from_scratch(num_records) matches what the plugin actually does (produces a constant column without input) and works on both engines. - RegexFilterProcessor implemented process_before_batch with row-count changes. The async engine enforces row-count invariance in pre- and post-batch processor stages by design. Moving the filter to process_after_generation preserves the plugin's purpose (regex-based row filtering) at a stage that supports row-count changes on both engines. Test assertions check the final dataset, so the stage shift is transparent. Both changes are demo-plugin updates only; no production code change. * fix: address Codex review findings on async-default flip Three bugs and two test-quality concerns surfaced by an independent review of the prior commits. Each was real and worth fixing in the flip PR. Bug fixes - Sync-fallback path was creating async-only model clients. The default flip meant ``client_concurrency_mode = ASYNC`` for every default run, but the ``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()`` calls then hit ``SyncClientUnavailableError``. The resolution decision now lives at the DataDesigner interface level via ``_resolve_client_concurrency_mode``: it considers both the env var and the config (allow_resize forces sync clients) and is passed explicitly to ``create_resource_provider``. Direct callers of the factory still get the env-var default. - Sync→async bridge timeout ignored the per-call ``timeout=`` override. A custom column calling ``model.generate(timeout=600)`` against a slow endpoint was being cancelled at the model-config default, not 600s. The bridge now prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``. - Bridge timeout formula missed ``max_conversation_restarts``. One logical generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)`` HTTP requests; the formula now multiplies both, matching the worst-case attempt budget. Engine routing fix (also surfaced by failing e2e plugin tests) - ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn`` with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the sync engine's FULL_COLUMN contract. - The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``) is reverted; the engine fix subsumes it. The processor-plugin fix (``process_after_generation`` for the regex filter) stays — pre-batch row-count change is intentionally rejected by the async engine. Test improvements - Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)`` and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct from any finite default. Sync and async counterparts mirror. - New regression tests for ``first_non_retryable_error`` surfacing covering both load-raises and load-returns-empty paths, asserting the original exception is chained via ``__cause__`` and that the typed ``DataDesignerEarlyShutdownError`` doesn't fire in this branch. - New parametrized regression test for ``_resolve_client_concurrency_mode`` covering all four (env × allow_resize) combinations. - New parametrized test for the per-call ``timeout=`` override flowing into the bridge timeout calculation. - Bridge formula tests extended with ``max_conversation_restarts`` cases. * test: trim redundant parametrize cases in async-default tests Three parametrize cases were duplicating coverage already provided by existing standalone tests: - ``test_acquire__timeout_branches`` parametrized over ``(0.0, raises)`` and ``(None, waits)``. The ``raises`` half duplicates ``test_acquire__raises_timeout_when_at_capacity``. Replaced with two focused ``..._default_no_deadline_waits_for_release`` tests covering only the no-deadline branch. - ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the env var alone forces it; the allow_resize input is moot. Dropped. - ``test_async_bridge_honors_per_call_timeout`` had three cases. The "override below floor" case cross-products the per-call override flow with the floor-clamping behavior already covered by ``test_compute_bridge_timeout``. Dropped. Net: -25 lines of test code with no loss of essential coverage. * docs: fold migration page into existing concept docs The standalone ``Migrating to the async default`` page didn't fit the existing docs style — present tense, behavior over comparisons, content in the natural concept home. Folding it in: - ``architecture-and-performance.md`` gets a new ``Async Engine`` section covering per-model timeouts, run outcomes (partial completion + ``DataDesignerEarlyShutdownError``), and the transitional opt-out. Three stale ``async engine is landing soon`` callouts updated to reflect the flip. - ``custom_columns.md`` gets two short notes: a thread-safety callout near Generation Strategies, and a mocking-with-spec note in Development Testing. - ``async-all-the-way-down.md`` Update section now points at the new arch-and-perf section. - README heads-up links to the same anchor. - ``migration-async-default.md`` removed; mkdocs.yml entry dropped. * docs: frame Execution Model as sync-engine specifics Small targeted edits to make the user-facing concept docs consistent with the post-flip state. No restructuring. - ``architecture-and-performance.md``: the ``Execution Model`` callout now opens with two engines, links to the new ``Async Engine`` section, and frames the existing column-at-a-time description as sync-engine semantics. The ``Step 2: Process columns sequentially`` paragraph notes the async engine relaxes this. The ``Key Concepts`` table differentiates per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells`` is the same on both. - ``processors.md``: added a warning callout about the async engine's row-count invariance in pre- and post-batch stages, with the guidance to use ``process_after_generation()`` for row-filtering or expansion. * fix: address review nits from PR #592 (Nabin) Four targeted fixes from the review. Worth-addressing (warning): - ``test_acquire_async_default_no_deadline_waits_for_release`` was spawning the release task without holding a strong reference. The loop's weak-ref bookkeeping could GC it before the inner ``await`` observes the release, producing a CI flake. Hold the task and ``await`` it in ``finally``. Take-it-or-leave-it (applied): - Root-cause error surfacing now includes the exception type name: ``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see ``ValueError: ...`` instead of just the message string. The ``__cause__`` chain is preserved either way. - Drop the defensive ``getattr(c, "allow_resize", False)`` in ``_resolve_client_concurrency_mode`` — every member of ``ColumnConfigT`` inherits ``allow_resize: bool = False`` from ``SingleColumnConfig``. - One-line comment near the root-cause surfacing branch noting that ``actual_num_records == 0`` is async-only (sync runs leave it at ``-1``), so the branch is async-only by construction. Not addressed in this PR (filing as follow-ups): - ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in ``column_generators/generators/base.py:_run_coroutine_sync``. That bridge has no model-facade context to derive a timeout from, so the fix is a structural refactor outside this PR's scope. - First-error capture loses subsequent-error context. The "first wins" heuristic is documented; richer aggregation is a follow-up. * fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync This was the third hardcoded 300s timeout (Nabin flagged it on PR #592). The path is the generic sync→async bridge in ``ColumnGenerator.generate()``: when a subclass overrides only ``agenerate()``, the sync entry point runs the coroutine in a background thread. Same philosophy we applied to the throttle queue wait elsewhere in the PR: a defensive deadline on top of work that's already bounded by the HTTP timeout doesn't add safety, it just produces spurious failures on slow self-hosted endpoints. Drop the constant, the timeout exception handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)`` becomes the simple cleanup. Tests in ``test_async_generators.py`` exercise the happy path only and don't depend on the timeout firing. * Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync" This reverts commit `7a0b77d44c`. * docs+feat: deprecate the sync-engine opt-out path Nabin asked whether the docs should adopt explicit "deprecation" language on the opt-out path. Doing both: - Doc: ``architecture-and-performance.md``'s ``Opting out`` section now uses an ``!!! warning "Deprecated"`` admonition that names the env var as a deprecated escape hatch and notes the run-time warning. - Code: ``DataDesigner._resolve_client_concurrency_mode`` emits a ``DeprecationWarning`` when ``DATA_DESIGNER_ASYNC_ENGINE=0`` is detected. Same precedent as the existing ``allow_resize=True`` warning. Auto-fallback via ``allow_resize`` does not double-warn here; the builder layer emits its own warning later. - Test: parametrized regression now asserts ``pytest.warns(DeprecationWarning)`` on the opt-out branch and treats any warning on the async-on branches as a failure (``simplefilter("error")`` inside the ``catch_warnings`` block). * fix: emit logger.warning alongside DeprecationWarning on env-var opt-out Parity fix from Nabin's re-review of PR #592. The ``allow_resize=True`` auto-fallback path in ``_resolve_async_compatibility`` emits both a ``logger.warning("⚠️ ...")`` and a ``DeprecationWarning``. The new ``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path was only emitting the ``DeprecationWarning``, leaving users who run with default warning filters silenced and inconsistent with the established precedent. Match the pattern: same message body, both signals, same stacklevel. * docs: breadcrumb explaining why SYNC_BRIDGE_TIMEOUT survives PR #592 Nabin's re-review pointed out that ``base.py`` is the lone place where the 300s pattern survives, while ``custom.py`` and ``throttle_manager.py`` both retired theirs. Without a comment, a future reader (or a lint sweep) could mistake this for an oversight and "consistency-fix" it the wrong way. Add a short note at the constant naming the two retired siblings, the reason this one stayed (no ``ModelFacade`` context to derive from), and the fact that it's tracked for a structural follow-up.	2026-05-04 16:22:13 -03:00
Eric W. Tramel	28c8345909	feat: add built-in filesystem seed readers (#421 )	2026-03-16 17:40:27 -04:00
Andre Manoel	982ce79ca9	feat: add processor plugin support (#299 ) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/	2026-02-25 16:40:01 -03:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Johnny Greco	87119a545b	refactor: move SingleColumnConfig to config.base module (#287 ) * create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test	2026-02-03 14:04:04 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Nabin Mulepati	eb5ef279ab	fix litellm issue with lazy load (#228 )	2026-01-16 17:41:06 -07:00
Johnny Greco	1ee37bc317	refactor: update single column base class (#206 ) * make properties abstract * add private column emoji attribute * update e2e plugin tests * throw error if not default string * add unit tests * make emoji a static method * dont need that docstring * update unit test	2026-01-15 14:44:58 -05:00
Johnny Greco	3d9f5185d7	refactor: remove task metadata property (#216 ) * remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring	2026-01-15 14:12:11 -05:00
Johnny Greco	367de1a063	rename (#214 )	2026-01-14 15:26:46 -05:00

10 commits