mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
17 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c0a4dcbb85
|
feat: implement async scheduling admission control (#661)
Some checks are pending
CI / End to end test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / End to end test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Coverage Check (Python 3.11) (push) Blocked by required conditions
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
|
||
|
|
61cdeefb17
|
feat: make async engine the default execution path (#592)
Some checks failed
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish devnotes / deploy (push) Has been cancelled
* feat: make async engine the default execution path
The async engine has been hardening as opt-in for several releases. Make it
the default and address the prerequisites flagged for the flip.
Default flip
- DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites
- Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out
- allow_resize=True still falls back to sync with a DeprecationWarning
Python 3.10 support
- Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with
gather-with-explicit-cancel; semantics preserved because _run_task already
swallows its own exceptions and uses _shutdown_event for sibling cancellation
- Remove the sys.version_info < (3, 11) runtime guard
- Remove the matching pytest skipif so the executor tests run on 3.10 too
Derived timeouts (replaces two hardcoded 300s constants)
- ThrottleManager.acquire_sync/async default to timeout=None (no deadline)
instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds
actual work, queue waits scale with provider speed and AIMD
- _AsyncBridgedModelFacade derives the sync->async bridge timeout from the
model's inference_parameters.timeout and the call's max_correction_steps;
one knob (per-model timeout) drives both deadlines, no new config surface
- Add ModelFacade.request_timeout property so the bridge can read the
effective timeout the client is configured with
Root-cause surfacing
- AsyncTaskScheduler captures the first non-retryable error and exposes it
via first_non_retryable_error
- Interface threads it through DataDesignerGenerationError when 0 records
are produced without early-shutdown, so deterministic failures (e.g. bad
seed sources) surface their original message instead of a wrapped
FileNotFoundError on the parquet path
Tests
- New: throttle no-deadline default behavior (sync+async), parametrized
derived bridge timeout, restored async_concurrency tests on 3.10
- Updated: test_dataset_builder.py uses an autouse fixture to pin its
Mock-based tests to the sync engine they cover; existing bridge tests
set facade.request_timeout for the new derivation
Docs
- Replace the stale LiteLLM security notice in README with a short
async-default heads-up and link to the migration guide
- Add docs/migration-async-default.md covering per-model timeouts,
custom-column thread safety, mocking model calls, run outcomes, and
the opt-out
- Append a short Update section to the async-all-the-way-down dev note
* test: extract _compute_bridge_timeout helper for direct testing
The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result``
to capture the timeout the bridge passed in. That reaches into stdlib internals
(DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument
on the parametrize was missing.
Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test
now calls the helper directly with no mocking, and the parametrize gets readable ids.
Behavior is unchanged.
* test(e2e): align demo plugins with async engine contracts
The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two
of them were written against sync-engine semantics that the async engine
restricts:
- DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no
required_columns. The async engine routes ``no-upstream`` columns
through the from-scratch path, which passes an empty DataFrame to
generators that aren't FromScratchColumnGenerator subclasses. The
generator then produces 0 rows and the scheduler raises
``update_batch received 0 values``. Switching the plugin to
FromScratchColumnGenerator with generate_from_scratch(num_records)
matches what the plugin actually does (produces a constant column
without input) and works on both engines.
- RegexFilterProcessor implemented process_before_batch with row-count
changes. The async engine enforces row-count invariance in pre- and
post-batch processor stages by design. Moving the filter to
process_after_generation preserves the plugin's purpose (regex-based
row filtering) at a stage that supports row-count changes on both
engines. Test assertions check the final dataset, so the stage shift
is transparent.
Both changes are demo-plugin updates only; no production code change.
* fix: address Codex review findings on async-default flip
Three bugs and two test-quality concerns surfaced by an independent review of
the prior commits. Each was real and worth fixing in the flip PR.
Bug fixes
- Sync-fallback path was creating async-only model clients. The default flip
meant ``client_concurrency_mode = ASYNC`` for every default run, but the
``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()``
calls then hit ``SyncClientUnavailableError``. The resolution decision now
lives at the DataDesigner interface level via
``_resolve_client_concurrency_mode``: it considers both the env var and the
config (allow_resize forces sync clients) and is passed explicitly to
``create_resource_provider``. Direct callers of the factory still get the
env-var default.
- Sync→async bridge timeout ignored the per-call ``timeout=`` override. A
custom column calling ``model.generate(timeout=600)`` against a slow endpoint
was being cancelled at the model-config default, not 600s. The bridge now
prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``.
- Bridge timeout formula missed ``max_conversation_restarts``. One logical
generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)``
HTTP requests; the formula now multiplies both, matching the worst-case
attempt budget.
Engine routing fix (also surfaced by failing e2e plugin tests)
- ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch
generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn``
with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now
passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the
sync engine's FULL_COLUMN contract.
- The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``)
is reverted; the engine fix subsumes it. The processor-plugin fix
(``process_after_generation`` for the regex filter) stays — pre-batch
row-count change is intentionally rejected by the async engine.
Test improvements
- Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)``
and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct
from any finite default. Sync and async counterparts mirror.
- New regression tests for ``first_non_retryable_error`` surfacing covering
both load-raises and load-returns-empty paths, asserting the original
exception is chained via ``__cause__`` and that the typed
``DataDesignerEarlyShutdownError`` doesn't fire in this branch.
- New parametrized regression test for ``_resolve_client_concurrency_mode``
covering all four (env × allow_resize) combinations.
- New parametrized test for the per-call ``timeout=`` override flowing into
the bridge timeout calculation.
- Bridge formula tests extended with ``max_conversation_restarts`` cases.
* test: trim redundant parametrize cases in async-default tests
Three parametrize cases were duplicating coverage already provided by
existing standalone tests:
- ``test_acquire_*_timeout_branches`` parametrized over ``(0.0, raises)``
and ``(None, waits)``. The ``raises`` half duplicates
``test_acquire_*_raises_timeout_when_at_capacity``. Replaced with two
focused ``..._default_no_deadline_waits_for_release`` tests covering
only the no-deadline branch.
- ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four
cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the
env var alone forces it; the allow_resize input is moot. Dropped.
- ``test_async_bridge_honors_per_call_timeout`` had three cases. The
"override below floor" case cross-products the per-call override flow
with the floor-clamping behavior already covered by
``test_compute_bridge_timeout``. Dropped.
Net: -25 lines of test code with no loss of essential coverage.
* docs: fold migration page into existing concept docs
The standalone ``Migrating to the async default`` page didn't fit the
existing docs style — present tense, behavior over comparisons, content
in the natural concept home. Folding it in:
- ``architecture-and-performance.md`` gets a new ``Async Engine`` section
covering per-model timeouts, run outcomes (partial completion +
``DataDesignerEarlyShutdownError``), and the transitional opt-out.
Three stale ``async engine is landing soon`` callouts updated to
reflect the flip.
- ``custom_columns.md`` gets two short notes: a thread-safety callout
near Generation Strategies, and a mocking-with-spec note in
Development Testing.
- ``async-all-the-way-down.md`` Update section now points at the new
arch-and-perf section.
- README heads-up links to the same anchor.
- ``migration-async-default.md`` removed; mkdocs.yml entry dropped.
* docs: frame Execution Model as sync-engine specifics
Small targeted edits to make the user-facing concept docs consistent
with the post-flip state. No restructuring.
- ``architecture-and-performance.md``: the ``Execution Model`` callout
now opens with two engines, links to the new ``Async Engine`` section,
and frames the existing column-at-a-time description as sync-engine
semantics. The ``Step 2: Process columns sequentially`` paragraph notes
the async engine relaxes this. The ``Key Concepts`` table differentiates
per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells``
is the same on both.
- ``processors.md``: added a warning callout about the async engine's
row-count invariance in pre- and post-batch stages, with the guidance
to use ``process_after_generation()`` for row-filtering or expansion.
* fix: address review nits from PR #592 (Nabin)
Four targeted fixes from the review.
Worth-addressing (warning):
- ``test_acquire_async_default_no_deadline_waits_for_release`` was
spawning the release task without holding a strong reference. The
loop's weak-ref bookkeeping could GC it before the inner ``await``
observes the release, producing a CI flake. Hold the task and
``await`` it in ``finally``.
Take-it-or-leave-it (applied):
- Root-cause error surfacing now includes the exception type name:
``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see
``ValueError: ...`` instead of just the message string. The
``__cause__`` chain is preserved either way.
- Drop the defensive ``getattr(c, "allow_resize", False)`` in
``_resolve_client_concurrency_mode`` — every member of
``ColumnConfigT`` inherits ``allow_resize: bool = False`` from
``SingleColumnConfig``.
- One-line comment near the root-cause surfacing branch noting that
``actual_num_records == 0`` is async-only (sync runs leave it at
``-1``), so the branch is async-only by construction.
Not addressed in this PR (filing as follow-ups):
- ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in
``column_generators/generators/base.py:_run_coroutine_sync``. That
bridge has no model-facade context to derive a timeout from, so the
fix is a structural refactor outside this PR's scope.
- First-error capture loses subsequent-error context. The "first wins"
heuristic is documented; richer aggregation is a follow-up.
* fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync
This was the third hardcoded 300s timeout (Nabin flagged it on PR #592).
The path is the generic sync→async bridge in ``ColumnGenerator.generate()``:
when a subclass overrides only ``agenerate()``, the sync entry point runs
the coroutine in a background thread.
Same philosophy we applied to the throttle queue wait elsewhere in the
PR: a defensive deadline on top of work that's already bounded by the
HTTP timeout doesn't add safety, it just produces spurious failures on
slow self-hosted endpoints. Drop the constant, the timeout exception
handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)``
becomes the simple cleanup.
Tests in ``test_async_generators.py`` exercise the happy path only and
don't depend on the timeout firing.
* Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync"
This reverts commit
|
||
|
|
2528741eb2
|
fix: bump pytest, aiohttp, and cryptography for security CVEs (#535)
* fix: bump pytest, aiohttp, and cryptography for security CVEs - pytest 9.0.2 → 9.0.3 (CVE-2025-71176, High — RCE via symlink TOCTOU) - aiohttp 3.13.3 → 3.13.5 (10 Medium CVEs — DoS, CRLF injection, credential theft, request smuggling) - cryptography 46.0.6 → 46.0.7 (CVE-2026-39892, Medium — buffer overflow on Python >3.11) Add constraint-dependencies for transitive deps (aiohttp, cryptography) to enforce minimum safe versions across both workspace and e2e lockfiles. * style: fix indentation in tests_e2e/pyproject.toml Match the 2-space indentation used throughout the file. |
||
|
|
7e812630cf
|
feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder (#429)
* feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder * chore: add async benchmark notebook and demo scripts * fix: address all PR review comments on async builder integration - Wire on_batch_complete through on_row_group_complete callback - Mark trailing slots as dropped in replace_dataframe when processor filters rows - Ensure checkpoint still runs when on_before_checkpoint raises - Gate non-seed task dispatch on pre-batch completion - Add public run_pre_batch_on_df to ProcessorRunner (replaces private _run_stage call) - Add public is_column_complete_for_rg to CompletionTracker (replaces private _completed access) - Type task_traces as list[TaskTrace] in results.py - Add async_trace docstring to RunConfig - Move module-level log into _build_async - Add replace_dataframe unit tests (same-size, dropped rows, fewer rows) - Assert on public outcomes in scheduler tests instead of private attributes - Parametrize allow_resize validation tests - Cache seed_cols before main loop - Remove redundant disable_early_shutdown from AsyncTaskScheduler * style: fix ruff format for lambda expression * fix: address open review issues on async scheduler - Flush completed row groups before breaking on early shutdown (data loss) - Change error rate check from >= to > so disable_early_shutdown sentinel (1.0) never triggers at 100% failure rate - Extract seeds-complete check into helper and call it in salvage rounds via _drain_frontier, with pre-batch gating, so pre-batch processor runs even when seed tasks succeed only after retry - Fix is_column_complete_for_rg to check _batch_complete first, then verify all non-dropped rows for CELL_BY_CELL columns - Replace O(|in-flight|) scan in _in_flight_for_rg with per-RG counter * fix: sync pre-batch row drops to CompletionTracker and restore stderr safely Pre-batch processors that filter rows marked them as dropped in RowGroupBufferManager but not in CompletionTracker, causing unnecessary LLM calls for rows that would be discarded at checkpoint time. Also wrap the benchmark warmup stderr redirect in try/finally so stderr is restored if _run_once raises. * fix: prune _admitted_rg_ids on row group checkpoint Prevents unbounded growth of the admission set across large runs. * chore: remove demo/async files from PR Dev-time benchmarks and manual test scripts - kept locally, not needed in the PR. * fix: wire disable_early_shutdown into AsyncTaskScheduler RunConfig.disable_early_shutdown was forwarded to the sync executor but silently ignored in the async path. Now passed through to the scheduler's _check_error_rate. * test: add e2e test for async engine concurrency Verifies the async scheduler dispatches independent LLM columns concurrently by checking for overlapping task trace intervals. Uses a wide DAG (sampler -> 2 parallel LLM columns) with 2 records. Requires NVIDIA_API_KEY. * fix: drop row group on on_before_checkpoint failure instead of writing unprocessed data Matches on_seeds_complete failure behavior and avoids silently checkpointing unfiltered rows when a post-batch processor fails. * fix: skip on_before_checkpoint when no POST_BATCH processors configured Avoids unnecessary DataFrame round-trip for every row group in the common case where no post-batch processors exist. * fix: address remaining review nits from nabinchha and greptile summary - Gate on_seeds_complete on PRE_BATCH processors (matches on_before_checkpoint pattern) - Cache seed_cols as instance attr instead of recomputing in _dispatch_seeds - Iterate list(self._active_rgs) snapshot in _run_seeds_complete_check - Add logger.debug to telemetry except block - Add design comment on on_before_checkpoint failure drop behavior - Rename row_group param to row_group_index in is_column_complete_for_rg - Document rg_id as current_batch_number equivalence - Use mock.patch.object in e2e test instead of direct mutation - Add max(0, ...) floor guard on _in_flight_counts decrement - Rename _ensure_async_engine_loop to public ensure_async_engine_loop - Move AsyncTaskScheduler import to module level in integration tests * fix: preserve async callback contract and e2e setup * fix: prune _seeds_dispatched_rgs and _pre_batch_done_rgs on checkpoint * refactor: consolidate per-RG state into _RowGroupState dataclass Replace 5 independent collections (_active_rgs, _admitted_rg_ids, _seeds_dispatched_rgs, _pre_batch_done_rgs, _in_flight_counts) with a single _rg_states dict keyed by row group ID. Cleanup is now a single `del` instead of N separate discards, eliminating the class of bugs where one collection is missed during row group teardown. * fix: skip checkpoint and callbacks when on_before_checkpoint fails When on_before_checkpoint raises and all rows are dropped, the code previously fell through to checkpoint_row_group and on_row_group_complete, sending a spurious progress notification for a batch with zero records. Now gates both on a `dropped` flag so they are skipped after failure. * fix: snapshot dropped rows before await in _run_batch and sync tracker on checkpoint failure Two fixes: - _run_batch: snapshot dropped rows before `await agenerate` so the row-count expectation matches batch_df. Concurrent tasks can drop rows during the await, causing a spurious ValueError that would drop the entire row group. Write-back now re-checks is_dropped to skip rows dropped mid-flight. - _checkpoint_completed_row_groups: add tracker.drop_row alongside buffer_manager.drop_row when on_before_checkpoint fails, keeping both in sync. * feat: sliding window error rate and out-of-order row group completion test Replace cumulative error counters with a deque-based sliding window so that early transient failures do not permanently inflate the error rate in long-running jobs. Add tests for the sliding window recovery path and for deterministic out-of-order row group checkpoint ordering. * fix: use real time delays in out-of-order completion test asyncio.sleep(0) interleaving is not deterministic across Python versions. Switch to asyncio.sleep(num_records * 0.02) so the smaller row group genuinely finishes seeds first regardless of event loop scheduling. * fix: prevent ZeroDivisionError when shutdown_error_window is 0 Change RunConfig.shutdown_error_window constraint from ge=0 to ge=1 so the sliding window denominator is never zero. * fix: address Greptile review nits in async_scheduler - Move del _rg_states inside try/finally so semaphore is always released - Add exc_info=True to pre-batch failure log for consistent tracebacks - Short-circuit _check_error_rate when _early_shutdown already set * fix: address Greptile summary findings - Remove duplicate async engine log in build() (kept in _build_async) - Guard _run_seeds_complete_check with has_pre_batch at both call sites - Change error rate comparison from > to >= to match sync path semantics |
||
|
|
28c8345909
|
feat: add built-in filesystem seed readers (#421) | ||
|
|
982ce79ca9
|
feat: add processor plugin support (#299)
* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/ |
||
|
|
1439bbea7e
|
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios. |
||
|
|
2e413d31ce
|
bump pytest, nbconvert, and pyjwt for vulnerability fixes (#312)
- pytest: 8.x -> 9.0.2 (with pytest-asyncio 1.3.0, pytest-httpx 0.36.0) - nbconvert: 7.16.6 -> 7.17.0 - pyjwt: 2.10.1 -> 2.11.0 |
||
|
|
87119a545b
|
refactor: move SingleColumnConfig to config.base module (#287)
* create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test |
||
|
|
5430bcbe99
|
Remove debug_trace_override (#290)
|
||
|
|
510761107b
|
feat: Add TraceType enum for granular trace control (#284) | ||
|
|
e6e58e692e
|
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) | ||
|
|
ae0665fa16
|
refactor: slim package refactor into three subpackages (#240)
* remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main |
||
|
|
eb5ef279ab
|
fix litellm issue with lazy load (#228) | ||
|
|
1ee37bc317
|
refactor: update single column base class (#206)
* make properties abstract * add private column emoji attribute * update e2e plugin tests * throw error if not default string * add unit tests * make emoji a static method * dont need that docstring * update unit test |
||
|
|
3d9f5185d7
|
refactor: remove task metadata property (#216)
* remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring |
||
|
|
367de1a063
|
rename (#214) |