mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
|
Some checks failed
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Lint and Format Check (push) Waiting to run
CI / Check License Headers (push) Waiting to run
CI / Test Config (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Config (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Config (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Config (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Config (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish devnotes / deploy (push) Has been cancelled
* feat: make async engine the default execution path
The async engine has been hardening as opt-in for several releases. Make it
the default and address the prerequisites flagged for the flip.
Default flip
- DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites
- Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out
- allow_resize=True still falls back to sync with a DeprecationWarning
Python 3.10 support
- Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with
gather-with-explicit-cancel; semantics preserved because _run_task already
swallows its own exceptions and uses _shutdown_event for sibling cancellation
- Remove the sys.version_info < (3, 11) runtime guard
- Remove the matching pytest skipif so the executor tests run on 3.10 too
Derived timeouts (replaces two hardcoded 300s constants)
- ThrottleManager.acquire_sync/async default to timeout=None (no deadline)
instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds
actual work, queue waits scale with provider speed and AIMD
- _AsyncBridgedModelFacade derives the sync->async bridge timeout from the
model's inference_parameters.timeout and the call's max_correction_steps;
one knob (per-model timeout) drives both deadlines, no new config surface
- Add ModelFacade.request_timeout property so the bridge can read the
effective timeout the client is configured with
Root-cause surfacing
- AsyncTaskScheduler captures the first non-retryable error and exposes it
via first_non_retryable_error
- Interface threads it through DataDesignerGenerationError when 0 records
are produced without early-shutdown, so deterministic failures (e.g. bad
seed sources) surface their original message instead of a wrapped
FileNotFoundError on the parquet path
Tests
- New: throttle no-deadline default behavior (sync+async), parametrized
derived bridge timeout, restored async_concurrency tests on 3.10
- Updated: test_dataset_builder.py uses an autouse fixture to pin its
Mock-based tests to the sync engine they cover; existing bridge tests
set facade.request_timeout for the new derivation
Docs
- Replace the stale LiteLLM security notice in README with a short
async-default heads-up and link to the migration guide
- Add docs/migration-async-default.md covering per-model timeouts,
custom-column thread safety, mocking model calls, run outcomes, and
the opt-out
- Append a short Update section to the async-all-the-way-down dev note
* test: extract _compute_bridge_timeout helper for direct testing
The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result``
to capture the timeout the bridge passed in. That reaches into stdlib internals
(DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument
on the parametrize was missing.
Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test
now calls the helper directly with no mocking, and the parametrize gets readable ids.
Behavior is unchanged.
* test(e2e): align demo plugins with async engine contracts
The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two
of them were written against sync-engine semantics that the async engine
restricts:
- DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no
required_columns. The async engine routes ``no-upstream`` columns
through the from-scratch path, which passes an empty DataFrame to
generators that aren't FromScratchColumnGenerator subclasses. The
generator then produces 0 rows and the scheduler raises
``update_batch received 0 values``. Switching the plugin to
FromScratchColumnGenerator with generate_from_scratch(num_records)
matches what the plugin actually does (produces a constant column
without input) and works on both engines.
- RegexFilterProcessor implemented process_before_batch with row-count
changes. The async engine enforces row-count invariance in pre- and
post-batch processor stages by design. Moving the filter to
process_after_generation preserves the plugin's purpose (regex-based
row filtering) at a stage that supports row-count changes on both
engines. Test assertions check the final dataset, so the stage shift
is transparent.
Both changes are demo-plugin updates only; no production code change.
* fix: address Codex review findings on async-default flip
Three bugs and two test-quality concerns surfaced by an independent review of
the prior commits. Each was real and worth fixing in the flip PR.
Bug fixes
- Sync-fallback path was creating async-only model clients. The default flip
meant ``client_concurrency_mode = ASYNC`` for every default run, but the
``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()``
calls then hit ``SyncClientUnavailableError``. The resolution decision now
lives at the DataDesigner interface level via
``_resolve_client_concurrency_mode``: it considers both the env var and the
config (allow_resize forces sync clients) and is passed explicitly to
``create_resource_provider``. Direct callers of the factory still get the
env-var default.
- Sync→async bridge timeout ignored the per-call ``timeout=`` override. A
custom column calling ``model.generate(timeout=600)`` against a slow endpoint
was being cancelled at the model-config default, not 600s. The bridge now
prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``.
- Bridge timeout formula missed ``max_conversation_restarts``. One logical
generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)``
HTTP requests; the formula now multiplies both, matching the worst-case
attempt budget.
Engine routing fix (also surfaced by failing e2e plugin tests)
- ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch
generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn``
with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now
passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the
sync engine's FULL_COLUMN contract.
- The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``)
is reverted; the engine fix subsumes it. The processor-plugin fix
(``process_after_generation`` for the regex filter) stays — pre-batch
row-count change is intentionally rejected by the async engine.
Test improvements
- Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)``
and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct
from any finite default. Sync and async counterparts mirror.
- New regression tests for ``first_non_retryable_error`` surfacing covering
both load-raises and load-returns-empty paths, asserting the original
exception is chained via ``__cause__`` and that the typed
``DataDesignerEarlyShutdownError`` doesn't fire in this branch.
- New parametrized regression test for ``_resolve_client_concurrency_mode``
covering all four (env × allow_resize) combinations.
- New parametrized test for the per-call ``timeout=`` override flowing into
the bridge timeout calculation.
- Bridge formula tests extended with ``max_conversation_restarts`` cases.
* test: trim redundant parametrize cases in async-default tests
Three parametrize cases were duplicating coverage already provided by
existing standalone tests:
- ``test_acquire_*_timeout_branches`` parametrized over ``(0.0, raises)``
and ``(None, waits)``. The ``raises`` half duplicates
``test_acquire_*_raises_timeout_when_at_capacity``. Replaced with two
focused ``..._default_no_deadline_waits_for_release`` tests covering
only the no-deadline branch.
- ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four
cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the
env var alone forces it; the allow_resize input is moot. Dropped.
- ``test_async_bridge_honors_per_call_timeout`` had three cases. The
"override below floor" case cross-products the per-call override flow
with the floor-clamping behavior already covered by
``test_compute_bridge_timeout``. Dropped.
Net: -25 lines of test code with no loss of essential coverage.
* docs: fold migration page into existing concept docs
The standalone ``Migrating to the async default`` page didn't fit the
existing docs style — present tense, behavior over comparisons, content
in the natural concept home. Folding it in:
- ``architecture-and-performance.md`` gets a new ``Async Engine`` section
covering per-model timeouts, run outcomes (partial completion +
``DataDesignerEarlyShutdownError``), and the transitional opt-out.
Three stale ``async engine is landing soon`` callouts updated to
reflect the flip.
- ``custom_columns.md`` gets two short notes: a thread-safety callout
near Generation Strategies, and a mocking-with-spec note in
Development Testing.
- ``async-all-the-way-down.md`` Update section now points at the new
arch-and-perf section.
- README heads-up links to the same anchor.
- ``migration-async-default.md`` removed; mkdocs.yml entry dropped.
* docs: frame Execution Model as sync-engine specifics
Small targeted edits to make the user-facing concept docs consistent
with the post-flip state. No restructuring.
- ``architecture-and-performance.md``: the ``Execution Model`` callout
now opens with two engines, links to the new ``Async Engine`` section,
and frames the existing column-at-a-time description as sync-engine
semantics. The ``Step 2: Process columns sequentially`` paragraph notes
the async engine relaxes this. The ``Key Concepts`` table differentiates
per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells``
is the same on both.
- ``processors.md``: added a warning callout about the async engine's
row-count invariance in pre- and post-batch stages, with the guidance
to use ``process_after_generation()`` for row-filtering or expansion.
* fix: address review nits from PR #592 (Nabin)
Four targeted fixes from the review.
Worth-addressing (warning):
- ``test_acquire_async_default_no_deadline_waits_for_release`` was
spawning the release task without holding a strong reference. The
loop's weak-ref bookkeeping could GC it before the inner ``await``
observes the release, producing a CI flake. Hold the task and
``await`` it in ``finally``.
Take-it-or-leave-it (applied):
- Root-cause error surfacing now includes the exception type name:
``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see
``ValueError: ...`` instead of just the message string. The
``__cause__`` chain is preserved either way.
- Drop the defensive ``getattr(c, "allow_resize", False)`` in
``_resolve_client_concurrency_mode`` — every member of
``ColumnConfigT`` inherits ``allow_resize: bool = False`` from
``SingleColumnConfig``.
- One-line comment near the root-cause surfacing branch noting that
``actual_num_records == 0`` is async-only (sync runs leave it at
``-1``), so the branch is async-only by construction.
Not addressed in this PR (filing as follow-ups):
- ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in
``column_generators/generators/base.py:_run_coroutine_sync``. That
bridge has no model-facade context to derive a timeout from, so the
fix is a structural refactor outside this PR's scope.
- First-error capture loses subsequent-error context. The "first wins"
heuristic is documented; richer aggregation is a follow-up.
* fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync
This was the third hardcoded 300s timeout (Nabin flagged it on PR #592).
The path is the generic sync→async bridge in ``ColumnGenerator.generate()``:
when a subclass overrides only ``agenerate()``, the sync entry point runs
the coroutine in a background thread.
Same philosophy we applied to the throttle queue wait elsewhere in the
PR: a defensive deadline on top of work that's already bounded by the
HTTP timeout doesn't add safety, it just produces spurious failures on
slow self-hosted endpoints. Drop the constant, the timeout exception
handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)``
becomes the simple cleanup.
Tests in ``test_async_generators.py`` exercise the happy path only and
don't depend on the timeout firing.
* Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync"
This reverts commit
|
||
|---|---|---|
| .. | ||
| data_designer_e2e_tests | ||