DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Eric W. Tramel	c0a4dcbb85	feat: implement async scheduling admission control (#661 ) Some checks are pending CI / End to end test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Coverage Check (Python 3.11) (push) Blocked by required conditions Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details	2026-05-20 20:58:05 -04:00
Andre Manoel	61cdeefb17	feat: make async engine the default execution path (#592 ) Some checks failed CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details Publish devnotes / deploy (push) Has been cancelled Details * feat: make async engine the default execution path The async engine has been hardening as opt-in for several releases. Make it the default and address the prerequisites flagged for the flip. Default flip - DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites - Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out - allow_resize=True still falls back to sync with a DeprecationWarning Python 3.10 support - Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with gather-with-explicit-cancel; semantics preserved because _run_task already swallows its own exceptions and uses _shutdown_event for sibling cancellation - Remove the sys.version_info < (3, 11) runtime guard - Remove the matching pytest skipif so the executor tests run on 3.10 too Derived timeouts (replaces two hardcoded 300s constants) - ThrottleManager.acquire_sync/async default to timeout=None (no deadline) instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds actual work, queue waits scale with provider speed and AIMD - _AsyncBridgedModelFacade derives the sync->async bridge timeout from the model's inference_parameters.timeout and the call's max_correction_steps; one knob (per-model timeout) drives both deadlines, no new config surface - Add ModelFacade.request_timeout property so the bridge can read the effective timeout the client is configured with Root-cause surfacing - AsyncTaskScheduler captures the first non-retryable error and exposes it via first_non_retryable_error - Interface threads it through DataDesignerGenerationError when 0 records are produced without early-shutdown, so deterministic failures (e.g. bad seed sources) surface their original message instead of a wrapped FileNotFoundError on the parquet path Tests - New: throttle no-deadline default behavior (sync+async), parametrized derived bridge timeout, restored async_concurrency tests on 3.10 - Updated: test_dataset_builder.py uses an autouse fixture to pin its Mock-based tests to the sync engine they cover; existing bridge tests set facade.request_timeout for the new derivation Docs - Replace the stale LiteLLM security notice in README with a short async-default heads-up and link to the migration guide - Add docs/migration-async-default.md covering per-model timeouts, custom-column thread safety, mocking model calls, run outcomes, and the opt-out - Append a short Update section to the async-all-the-way-down dev note * test: extract _compute_bridge_timeout helper for direct testing The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result`` to capture the timeout the bridge passed in. That reaches into stdlib internals (DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument on the parametrize was missing. Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test now calls the helper directly with no mocking, and the parametrize gets readable ids. Behavior is unchanged. * test(e2e): align demo plugins with async engine contracts The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two of them were written against sync-engine semantics that the async engine restricts: - DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no required_columns. The async engine routes ``no-upstream`` columns through the from-scratch path, which passes an empty DataFrame to generators that aren't FromScratchColumnGenerator subclasses. The generator then produces 0 rows and the scheduler raises ``update_batch received 0 values``. Switching the plugin to FromScratchColumnGenerator with generate_from_scratch(num_records) matches what the plugin actually does (produces a constant column without input) and works on both engines. - RegexFilterProcessor implemented process_before_batch with row-count changes. The async engine enforces row-count invariance in pre- and post-batch processor stages by design. Moving the filter to process_after_generation preserves the plugin's purpose (regex-based row filtering) at a stage that supports row-count changes on both engines. Test assertions check the final dataset, so the stage shift is transparent. Both changes are demo-plugin updates only; no production code change. * fix: address Codex review findings on async-default flip Three bugs and two test-quality concerns surfaced by an independent review of the prior commits. Each was real and worth fixing in the flip PR. Bug fixes - Sync-fallback path was creating async-only model clients. The default flip meant ``client_concurrency_mode = ASYNC`` for every default run, but the ``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()`` calls then hit ``SyncClientUnavailableError``. The resolution decision now lives at the DataDesigner interface level via ``_resolve_client_concurrency_mode``: it considers both the env var and the config (allow_resize forces sync clients) and is passed explicitly to ``create_resource_provider``. Direct callers of the factory still get the env-var default. - Sync→async bridge timeout ignored the per-call ``timeout=`` override. A custom column calling ``model.generate(timeout=600)`` against a slow endpoint was being cancelled at the model-config default, not 600s. The bridge now prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``. - Bridge timeout formula missed ``max_conversation_restarts``. One logical generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)`` HTTP requests; the formula now multiplies both, matching the worst-case attempt budget. Engine routing fix (also surfaced by failing e2e plugin tests) - ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn`` with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the sync engine's FULL_COLUMN contract. - The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``) is reverted; the engine fix subsumes it. The processor-plugin fix (``process_after_generation`` for the regex filter) stays — pre-batch row-count change is intentionally rejected by the async engine. Test improvements - Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)`` and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct from any finite default. Sync and async counterparts mirror. - New regression tests for ``first_non_retryable_error`` surfacing covering both load-raises and load-returns-empty paths, asserting the original exception is chained via ``__cause__`` and that the typed ``DataDesignerEarlyShutdownError`` doesn't fire in this branch. - New parametrized regression test for ``_resolve_client_concurrency_mode`` covering all four (env × allow_resize) combinations. - New parametrized test for the per-call ``timeout=`` override flowing into the bridge timeout calculation. - Bridge formula tests extended with ``max_conversation_restarts`` cases. * test: trim redundant parametrize cases in async-default tests Three parametrize cases were duplicating coverage already provided by existing standalone tests: - ``test_acquire__timeout_branches`` parametrized over ``(0.0, raises)`` and ``(None, waits)``. The ``raises`` half duplicates ``test_acquire__raises_timeout_when_at_capacity``. Replaced with two focused ``..._default_no_deadline_waits_for_release`` tests covering only the no-deadline branch. - ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the env var alone forces it; the allow_resize input is moot. Dropped. - ``test_async_bridge_honors_per_call_timeout`` had three cases. The "override below floor" case cross-products the per-call override flow with the floor-clamping behavior already covered by ``test_compute_bridge_timeout``. Dropped. Net: -25 lines of test code with no loss of essential coverage. * docs: fold migration page into existing concept docs The standalone ``Migrating to the async default`` page didn't fit the existing docs style — present tense, behavior over comparisons, content in the natural concept home. Folding it in: - ``architecture-and-performance.md`` gets a new ``Async Engine`` section covering per-model timeouts, run outcomes (partial completion + ``DataDesignerEarlyShutdownError``), and the transitional opt-out. Three stale ``async engine is landing soon`` callouts updated to reflect the flip. - ``custom_columns.md`` gets two short notes: a thread-safety callout near Generation Strategies, and a mocking-with-spec note in Development Testing. - ``async-all-the-way-down.md`` Update section now points at the new arch-and-perf section. - README heads-up links to the same anchor. - ``migration-async-default.md`` removed; mkdocs.yml entry dropped. * docs: frame Execution Model as sync-engine specifics Small targeted edits to make the user-facing concept docs consistent with the post-flip state. No restructuring. - ``architecture-and-performance.md``: the ``Execution Model`` callout now opens with two engines, links to the new ``Async Engine`` section, and frames the existing column-at-a-time description as sync-engine semantics. The ``Step 2: Process columns sequentially`` paragraph notes the async engine relaxes this. The ``Key Concepts`` table differentiates per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells`` is the same on both. - ``processors.md``: added a warning callout about the async engine's row-count invariance in pre- and post-batch stages, with the guidance to use ``process_after_generation()`` for row-filtering or expansion. * fix: address review nits from PR #592 (Nabin) Four targeted fixes from the review. Worth-addressing (warning): - ``test_acquire_async_default_no_deadline_waits_for_release`` was spawning the release task without holding a strong reference. The loop's weak-ref bookkeeping could GC it before the inner ``await`` observes the release, producing a CI flake. Hold the task and ``await`` it in ``finally``. Take-it-or-leave-it (applied): - Root-cause error surfacing now includes the exception type name: ``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see ``ValueError: ...`` instead of just the message string. The ``__cause__`` chain is preserved either way. - Drop the defensive ``getattr(c, "allow_resize", False)`` in ``_resolve_client_concurrency_mode`` — every member of ``ColumnConfigT`` inherits ``allow_resize: bool = False`` from ``SingleColumnConfig``. - One-line comment near the root-cause surfacing branch noting that ``actual_num_records == 0`` is async-only (sync runs leave it at ``-1``), so the branch is async-only by construction. Not addressed in this PR (filing as follow-ups): - ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in ``column_generators/generators/base.py:_run_coroutine_sync``. That bridge has no model-facade context to derive a timeout from, so the fix is a structural refactor outside this PR's scope. - First-error capture loses subsequent-error context. The "first wins" heuristic is documented; richer aggregation is a follow-up. * fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync This was the third hardcoded 300s timeout (Nabin flagged it on PR #592). The path is the generic sync→async bridge in ``ColumnGenerator.generate()``: when a subclass overrides only ``agenerate()``, the sync entry point runs the coroutine in a background thread. Same philosophy we applied to the throttle queue wait elsewhere in the PR: a defensive deadline on top of work that's already bounded by the HTTP timeout doesn't add safety, it just produces spurious failures on slow self-hosted endpoints. Drop the constant, the timeout exception handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)`` becomes the simple cleanup. Tests in ``test_async_generators.py`` exercise the happy path only and don't depend on the timeout firing. * Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync" This reverts commit `7a0b77d44c`. * docs+feat: deprecate the sync-engine opt-out path Nabin asked whether the docs should adopt explicit "deprecation" language on the opt-out path. Doing both: - Doc: ``architecture-and-performance.md``'s ``Opting out`` section now uses an ``!!! warning "Deprecated"`` admonition that names the env var as a deprecated escape hatch and notes the run-time warning. - Code: ``DataDesigner._resolve_client_concurrency_mode`` emits a ``DeprecationWarning`` when ``DATA_DESIGNER_ASYNC_ENGINE=0`` is detected. Same precedent as the existing ``allow_resize=True`` warning. Auto-fallback via ``allow_resize`` does not double-warn here; the builder layer emits its own warning later. - Test: parametrized regression now asserts ``pytest.warns(DeprecationWarning)`` on the opt-out branch and treats any warning on the async-on branches as a failure (``simplefilter("error")`` inside the ``catch_warnings`` block). * fix: emit logger.warning alongside DeprecationWarning on env-var opt-out Parity fix from Nabin's re-review of PR #592. The ``allow_resize=True`` auto-fallback path in ``_resolve_async_compatibility`` emits both a ``logger.warning("⚠️ ...")`` and a ``DeprecationWarning``. The new ``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path was only emitting the ``DeprecationWarning``, leaving users who run with default warning filters silenced and inconsistent with the established precedent. Match the pattern: same message body, both signals, same stacklevel. * docs: breadcrumb explaining why SYNC_BRIDGE_TIMEOUT survives PR #592 Nabin's re-review pointed out that ``base.py`` is the lone place where the 300s pattern survives, while ``custom.py`` and ``throttle_manager.py`` both retired theirs. Without a comment, a future reader (or a lint sweep) could mistake this for an oversight and "consistency-fix" it the wrong way. Add a short note at the constant naming the two retired siblings, the reason this one stayed (no ``ModelFacade`` context to derive from), and the fact that it's tracked for a structural follow-up.	2026-05-04 16:22:13 -03:00
Johnny Greco	2528741eb2	fix: bump pytest, aiohttp, and cryptography for security CVEs (#535 ) * fix: bump pytest, aiohttp, and cryptography for security CVEs - pytest 9.0.2 → 9.0.3 (CVE-2025-71176, High — RCE via symlink TOCTOU) - aiohttp 3.13.3 → 3.13.5 (10 Medium CVEs — DoS, CRLF injection, credential theft, request smuggling) - cryptography 46.0.6 → 46.0.7 (CVE-2026-39892, Medium — buffer overflow on Python >3.11) Add constraint-dependencies for transitive deps (aiohttp, cryptography) to enforce minimum safe versions across both workspace and e2e lockfiles. * style: fix indentation in tests_e2e/pyproject.toml Match the 2-space indentation used throughout the file.	2026-04-13 10:23:13 -04:00
Andre Manoel	7e812630cf	feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder (#429 ) * feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder * chore: add async benchmark notebook and demo scripts * fix: address all PR review comments on async builder integration - Wire on_batch_complete through on_row_group_complete callback - Mark trailing slots as dropped in replace_dataframe when processor filters rows - Ensure checkpoint still runs when on_before_checkpoint raises - Gate non-seed task dispatch on pre-batch completion - Add public run_pre_batch_on_df to ProcessorRunner (replaces private _run_stage call) - Add public is_column_complete_for_rg to CompletionTracker (replaces private _completed access) - Type task_traces as list[TaskTrace] in results.py - Add async_trace docstring to RunConfig - Move module-level log into _build_async - Add replace_dataframe unit tests (same-size, dropped rows, fewer rows) - Assert on public outcomes in scheduler tests instead of private attributes - Parametrize allow_resize validation tests - Cache seed_cols before main loop - Remove redundant disable_early_shutdown from AsyncTaskScheduler * style: fix ruff format for lambda expression * fix: address open review issues on async scheduler - Flush completed row groups before breaking on early shutdown (data loss) - Change error rate check from >= to > so disable_early_shutdown sentinel (1.0) never triggers at 100% failure rate - Extract seeds-complete check into helper and call it in salvage rounds via _drain_frontier, with pre-batch gating, so pre-batch processor runs even when seed tasks succeed only after retry - Fix is_column_complete_for_rg to check _batch_complete first, then verify all non-dropped rows for CELL_BY_CELL columns - Replace O(\|in-flight\|) scan in _in_flight_for_rg with per-RG counter * fix: sync pre-batch row drops to CompletionTracker and restore stderr safely Pre-batch processors that filter rows marked them as dropped in RowGroupBufferManager but not in CompletionTracker, causing unnecessary LLM calls for rows that would be discarded at checkpoint time. Also wrap the benchmark warmup stderr redirect in try/finally so stderr is restored if _run_once raises. * fix: prune _admitted_rg_ids on row group checkpoint Prevents unbounded growth of the admission set across large runs. * chore: remove demo/async files from PR Dev-time benchmarks and manual test scripts - kept locally, not needed in the PR. * fix: wire disable_early_shutdown into AsyncTaskScheduler RunConfig.disable_early_shutdown was forwarded to the sync executor but silently ignored in the async path. Now passed through to the scheduler's _check_error_rate. * test: add e2e test for async engine concurrency Verifies the async scheduler dispatches independent LLM columns concurrently by checking for overlapping task trace intervals. Uses a wide DAG (sampler -> 2 parallel LLM columns) with 2 records. Requires NVIDIA_API_KEY. * fix: drop row group on on_before_checkpoint failure instead of writing unprocessed data Matches on_seeds_complete failure behavior and avoids silently checkpointing unfiltered rows when a post-batch processor fails. * fix: skip on_before_checkpoint when no POST_BATCH processors configured Avoids unnecessary DataFrame round-trip for every row group in the common case where no post-batch processors exist. * fix: address remaining review nits from nabinchha and greptile summary - Gate on_seeds_complete on PRE_BATCH processors (matches on_before_checkpoint pattern) - Cache seed_cols as instance attr instead of recomputing in _dispatch_seeds - Iterate list(self._active_rgs) snapshot in _run_seeds_complete_check - Add logger.debug to telemetry except block - Add design comment on on_before_checkpoint failure drop behavior - Rename row_group param to row_group_index in is_column_complete_for_rg - Document rg_id as current_batch_number equivalence - Use mock.patch.object in e2e test instead of direct mutation - Add max(0, ...) floor guard on _in_flight_counts decrement - Rename _ensure_async_engine_loop to public ensure_async_engine_loop - Move AsyncTaskScheduler import to module level in integration tests * fix: preserve async callback contract and e2e setup * fix: prune _seeds_dispatched_rgs and _pre_batch_done_rgs on checkpoint * refactor: consolidate per-RG state into _RowGroupState dataclass Replace 5 independent collections (_active_rgs, _admitted_rg_ids, _seeds_dispatched_rgs, _pre_batch_done_rgs, _in_flight_counts) with a single _rg_states dict keyed by row group ID. Cleanup is now a single `del` instead of N separate discards, eliminating the class of bugs where one collection is missed during row group teardown. * fix: skip checkpoint and callbacks when on_before_checkpoint fails When on_before_checkpoint raises and all rows are dropped, the code previously fell through to checkpoint_row_group and on_row_group_complete, sending a spurious progress notification for a batch with zero records. Now gates both on a `dropped` flag so they are skipped after failure. * fix: snapshot dropped rows before await in _run_batch and sync tracker on checkpoint failure Two fixes: - _run_batch: snapshot dropped rows before `await agenerate` so the row-count expectation matches batch_df. Concurrent tasks can drop rows during the await, causing a spurious ValueError that would drop the entire row group. Write-back now re-checks is_dropped to skip rows dropped mid-flight. - _checkpoint_completed_row_groups: add tracker.drop_row alongside buffer_manager.drop_row when on_before_checkpoint fails, keeping both in sync. * feat: sliding window error rate and out-of-order row group completion test Replace cumulative error counters with a deque-based sliding window so that early transient failures do not permanently inflate the error rate in long-running jobs. Add tests for the sliding window recovery path and for deterministic out-of-order row group checkpoint ordering. * fix: use real time delays in out-of-order completion test asyncio.sleep(0) interleaving is not deterministic across Python versions. Switch to asyncio.sleep(num_records * 0.02) so the smaller row group genuinely finishes seeds first regardless of event loop scheduling. * fix: prevent ZeroDivisionError when shutdown_error_window is 0 Change RunConfig.shutdown_error_window constraint from ge=0 to ge=1 so the sliding window denominator is never zero. * fix: address Greptile review nits in async_scheduler - Move del _rg_states inside try/finally so semaphore is always released - Add exc_info=True to pre-batch failure log for consistent tracebacks - Short-circuit _check_error_rate when _early_shutdown already set * fix: address Greptile summary findings - Remove duplicate async engine log in build() (kept in _build_async) - Guard _run_seeds_complete_check with has_pre_batch at both call sites - Change error rate comparison from > to >= to match sync path semantics	2026-03-20 13:05:09 -03:00
Eric W. Tramel	28c8345909	feat: add built-in filesystem seed readers (#421 )	2026-03-16 17:40:27 -04:00
Andre Manoel	982ce79ca9	feat: add processor plugin support (#299 ) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/	2026-02-25 16:40:01 -03:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Johnny Greco	2e413d31ce	bump pytest, nbconvert, and pyjwt for vulnerability fixes (#312 ) - pytest: 8.x -> 9.0.2 (with pytest-asyncio 1.3.0, pytest-httpx 0.36.0) - nbconvert: 7.16.6 -> 7.17.0 - pyjwt: 2.10.1 -> 2.11.0	2026-02-09 10:02:36 -05:00
Johnny Greco	87119a545b	refactor: move SingleColumnConfig to config.base module (#287 ) * create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test	2026-02-03 14:04:04 -05:00
Eric W. Tramel	5430bcbe99	Remove `debug_trace_override` (#290 )	2026-02-03 12:09:30 -05:00
Eric W. Tramel	510761107b	feat: Add TraceType enum for granular trace control (#284 )	2026-02-02 19:43:51 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00
Nabin Mulepati	eb5ef279ab	fix litellm issue with lazy load (#228 )	2026-01-16 17:41:06 -07:00
Johnny Greco	1ee37bc317	refactor: update single column base class (#206 ) * make properties abstract * add private column emoji attribute * update e2e plugin tests * throw error if not default string * add unit tests * make emoji a static method * dont need that docstring * update unit test	2026-01-15 14:44:58 -05:00
Johnny Greco	3d9f5185d7	refactor: remove task metadata property (#216 ) * remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring	2026-01-15 14:12:11 -05:00
Johnny Greco	367de1a063	rename (#214 )	2026-01-14 15:26:46 -05:00

17 commits