DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Nabin Mulepati	2a487cdc5c	feat: add dropped column preservation toggle (#691 ) * feat: add dropped column preservation toggle Closes #690 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> * fix: reject dropped column policy resume mismatch Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> --------- Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>	2026-05-21 13:19:20 -06:00
Sai Asish Y	000fc09f94	fix(interface): reject duplicate names within output_processors (#675 ) (#697 ) Signed-off-by: SAY-5 <say.apm35@gmail.com>	2026-05-21 12:13:51 -06:00
Eric W. Tramel	c0a4dcbb85	feat: implement async scheduling admission control (#661 ) Some checks are pending CI / End to end test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Coverage Check (Python 3.11) (push) Blocked by required conditions Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details	2026-05-20 20:58:05 -04:00
Mike Knepper	498e627d49	feat: Expose on_batch_complete via create method (#663 ) Some checks are pending CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details	2026-05-19 09:38:18 -05:00
Andre Manoel	6055290136	feat: add workflow chaining (#636 ) Some checks are pending CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details Publish Fern devnotes / deploy (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details * feat: add workflow chaining * test: tidy workflow chaining coverage * fix: harden workflow chaining concurrency * docs: update workflow chaining plan * feat: add workflow stage postprocessors * feat: expose workflow stage outputs * fix: align workflow selected output export * fix: address workflow chaining review issues * fix: align workflow parquet export selection Signed-off-by: Andre Manoel <amanoel@nvidia.com> * fix: preserve generated columns in drop validation Signed-off-by: Andre Manoel <amanoel@nvidia.com> * fix: clarify workflow output processors Signed-off-by: Andre Manoel <amanoel@nvidia.com> * docs: add workflow chaining page * docs: align workflow chaining warning * fix: address workflow review nits --------- Signed-off-by: Andre Manoel <amanoel@nvidia.com>	2026-05-18 20:15:47 -03:00
Nabin Mulepati	3c8394e783	fix(interface): silence registry-default deprecation when library auto-fills it (#655 ) The ``ModelProviderRegistry.default is deprecated`` warning added in #594 fires for every fresh-install ``DataDesigner()`` construction, even when the user wrote ``default=`` nowhere — neither in YAML, nor in Python, nor in any ``ModelConfig``. Root cause: ``resolve_model_provider_registry`` synthesises ``default=providers[0].name`` for the multi-provider case to satisfy ``check_implicit_default``. The auto-seeded ``~/.data-designer/model_providers.yaml`` ships three providers and no ``default:`` key, so this path is hit on every bare ``DataDesigner()`` call. ``_warn_on_explicit_default`` then attributes the warning to the user's ``DataDesigner()`` line, with a remediation message ("Specify provider= explicitly on each ModelConfig") that doesn't even apply when the user hasn't built a ``ModelConfig`` (e.g. a UUID-only sampler config with the GitHub plugin). Fix: broaden the existing warning suppression in ``DataDesigner.__init__`` to also cover the ``model_providers is None`` case. The user is opting into all defaults — the library is the one filling ``default=``, so the deprecation nudge is misdirected. Users who hand-construct a multi-provider list in Python still see the warning (they wrote the multi-provider intent themselves), and direct ``ModelProviderRegistry(default="x")`` always warns — those are the entry points #589 actually targets. New regression test pins the bare-``DataDesigner()`` quiet path so a future tightening of the suppression can't silently re-introduce the spurious warning. Refs #589, follow-up to #594. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>	2026-05-13 14:15:10 -06:00
Przemysław Boruta	0afe287a5f	feat(results): add export() method and --output-format CLI flag (#540 ) * feat(results): add export() method and --output-format CLI flag Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files. * fix(cli): validate output_format before dataset generation * fix(cli): remove top-level results import from create.py to preserve lazy loading * fix(results): address andreatgretel review — error types, UX ordering, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure * fix(results): correct Raises docstring — ValueError -> InvalidFileFormatError * feat(results): stream batch files in export() to avoid OOM on large datasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override * fix(results): address nabinchha review — memory safety, error wrapping, UX - Replace load_dataset() with count_records() in CLI to avoid OOM on large datasets; add count_records() method using pq.read_metadata (reads file metadata only, no data pages loaded) - Remove redundant format validation in controller — click.Choice in create.py already rejects invalid values at parse time; dead code removed along with corresponding test - Wrap pa.unify_schemas / table.cast ArrowInvalid as InvalidFileFormatError to normalize third-party exceptions at module boundaries per AGENTS.md - Lowercase file extension before format lookup so .JSONL/.CSV/.PARQUET are accepted without error - Add clarifying comment to trailing-newline guard in _export_jsonl - Add tests: count_records(), uppercase extension, incompatible schemas * fix(results): fix parquet export schema unification and controller path bug - Use promote_options="permissive" in pa.unify_schemas so minor numeric type drift (int64 vs float64) is handled by promotion instead of raising - Also catch ArrowTypeError from unify_schemas and ValueError from table.cast() — the actual exception types thrown by pyarrow for these cases (ArrowInvalid alone is not sufficient) - Wrap base_dataset_path in Path() in generation_controller.run_create to guard against callers that return a str (mock returns str, Path does not support / with str operands) - Update test_export_parquet_incompatible_schemas_raises to match the new error source: with permissive unification, different-column-name batches fail at cast() not at unify_schemas(), so the match string changes from "Cannot unify batch schemas" to "Cannot cast batch" * fix(results,cli): address nabinchha review round 2 - Use public pa.ArrowInvalid/ArrowTypeError instead of pa.lib.* in _export_parquet - Drop dead trailing-newline guard in _export_jsonl; skip empty batches with `if content` - Rename num_records→actual_record_count after count_records() call to avoid shadowing - Unlink partial export file before re-raising on export failure in run_create - Export filename now uses dataset_name (<dataset-name>.<format>) instead of literal "dataset" - Update help text and tests to match new export filename convention --------- Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>	2026-05-06 17:13:57 -06:00
Nabin Mulepati	f73da1975c	feat(models): deprecate implicit default provider routing (#594 ) Some checks failed CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Coverage Check (Python 3.11) (push) Has been cancelled Details CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details * feat(models): deprecate implicit default provider routing Emit DeprecationWarning whenever the legacy "implicit default provider" path is exercised: `ModelConfig.provider=None`, the registry-level `ModelProviderRegistry.default`, the YAML `default:` key in `~/.data-designer/model_providers.yaml`, and the CLI's "Change default provider" workflow. `resolve_model_provider_registry` skips passing `default=` in the single-provider case so the common construction path stays quiet. Multi-provider registries still pass `default` (per `check_implicit_default`) and warn accordingly. Update docs, the package README, and test fixtures to specify `provider=` explicitly on every `ModelConfig`. New tests cover each warning entry point and pin the post-deprecation happy paths. Refs #589 Made-with: Cursor * fix(models): address PR #594 review feedback Greptile P1: ProviderRepository.load emitted its DeprecationWarning inside a `try/except Exception` block. Under `filterwarnings("error", DeprecationWarning)` the warn would raise, the except would swallow it, and `load()` would silently return None (losing the registry). Move the warn outside the catch-all so the strict-warning path no longer drops valid configs. Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and `_warn_on_explicit_default` use `stacklevel=2`, which lands inside pydantic v2's validator dispatch rather than at the user's `ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke both attribution (the source line was unhelpful) and Python's once-per-location dedup (every call collapsed to the same pydantic-internal key, suppressing all but the first warning). Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`, which walks past the helper, validator, and any pydantic frames to find the user's call site and emits via `warnings.warn_explicit` with the user frame's `__warningregistry__`. Keeps attribution accurate and dedup keyed on the user's (filename, lineno). johnnygreco: align the `provider_repository.py` warning copy with the sibling site in `default_model_settings.py` ("specify provider= explicitly on each ModelConfig instead") so both YAML-default warning sites give the same migration instruction. The previous wording pointed users at "ModelConfig entries" inside `model_providers.yaml`, where ModelConfig entries don't actually live. johnnygreco: dedup the cascade in `DataDesigner.__init__`. With `model_providers=None` and a YAML `default:`, the user previously saw two DeprecationWarnings for the same root cause — `get_default_provider_name()` warns about the YAML key, then `resolve_model_provider_registry(...)` re-warns from `_warn_on_explicit_default`. Suppress the registry-level duplicate in the YAML-fallback branch via `warnings.catch_warnings()` so users see exactly one warning per user action. johnnygreco: tighten `_warn_on_explicit_default` to fire only when `default is not None`. Passing `default=None` explicitly is semantically equivalent to omitting it (caller is opting out of a registry-level default), and shouldn't trigger the deprecation nudge. johnnygreco: add a `model_validate({...})` regression test for `ModelConfig` so the deserialization path (legacy on-disk configs) is pinned alongside the construction path. Tests: - Update `test_load_exists` and `test_save` to omit `default=` so the roundtrip stops exercising the deprecated YAML-default path unguarded (Greptile note). - Wrap `test_resolve_model_provider_registry_with_explicit_default`, `test_get_provider`, and `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default` in `pytest.warns` so the suite stays green under `-W error::DeprecationWarning` (Greptile note). - Add `test_explicit_default_none_does_not_emit_deprecation_warning` to pin the tightened predicate. - Add `test_init_yaml_default_emits_single_deprecation_warning` to pin the cascade-dedup behavior. Refs #589 Made-with: Cursor * fix(models): make deprecation warnings visible under default filters andreatgretel (PR #594): the YAML-default warning in `get_default_provider_name` and the registry-default warning emitted from inside DataDesigner helpers were attributing to data_designer library frames, not user code. Python's default filter chain includes `ignore::DeprecationWarning`, so library-attributed entries are silenced — meaning a normal `DataDesigner()` call with a YAML `default:` set showed nothing, and `resolve_model_provider_registry` warnings were similarly invisible. Two related changes: 1. `warn_at_caller`: extend the default skip-list from `("pydantic",)` to `("pydantic", "pydantic_core", "data_designer")` so the walk escapes both pydantic's validator-dispatch frames and data_designer helper frames before attributing. Also tighten the prefix predicate to exact-or-dotted-prefix matching (`name == p or name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not falsely matched as part of `pydantic` (johnnygreco nit). Allow callers to pass a custom `skip_prefixes` for flexibility. Drop the "skip frame 0+1 unconditionally" guard now that prefix matching covers it. 2. `get_default_provider_name`: switch from `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous stacklevel pointed into `default_model_settings.py`, which is a library file → silenced under default filters. Verified the fix empirically with `python -W default`: warning is now attributed to the user's call site and rendered. johnnygreco (PR #594): add the missing `test_explicit_default_none_does_not_emit_deprecation_warning` regression for the `self.default is not None` predicate landed in the prior round. Tests: - New `test_warning_helpers.py` pins prefix-matching precision (rejects `pydantic_helpers` / `data_designer_other`), default skip-list contents, attribution past skip-prefix frames, and per-call-site dedup behavior. - `test_get_default_provider_name_warning_attributes_to_user_frame` pins andreatgretel's repro for the YAML-default site. - `test_explicit_default_warning_attributes_to_user_frame` pins the multi-frame case: construction goes through `resolve_model_provider_registry`, so the walk has to escape both pydantic and data_designer before landing on the test file. - `test_explicit_default_none_does_not_emit_deprecation_warning` pins johnnygreco's predicate-tightening regression. 3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net from this round). Refs #589 Made-with: Cursor * fix(models): apply warn_at_caller to remaining deprecation sites greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from `247fa30`) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589	2026-05-05 13:39:12 -06:00
Andre Manoel	61cdeefb17	feat: make async engine the default execution path (#592 ) Some checks failed CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details Publish devnotes / deploy (push) Has been cancelled Details * feat: make async engine the default execution path The async engine has been hardening as opt-in for several releases. Make it the default and address the prerequisites flagged for the flip. Default flip - DATA_DESIGNER_ASYNC_ENGINE defaults to "1" at both consumption sites - Set DATA_DESIGNER_ASYNC_ENGINE=0 for one transitional release to opt out - allow_resize=True still falls back to sync with a DeprecationWarning Python 3.10 support - Replace asyncio.TaskGroup (3.11+) in async_concurrency.py with gather-with-explicit-cancel; semantics preserved because _run_task already swallows its own exceptions and uses _shutdown_event for sibling cancellation - Remove the sys.version_info < (3, 11) runtime guard - Remove the matching pytest skipif so the executor tests run on 3.10 too Derived timeouts (replaces two hardcoded 300s constants) - ThrottleManager.acquire_sync/async default to timeout=None (no deadline) instead of DEFAULT_ACQUIRE_TIMEOUT=300; HTTP request timeout already bounds actual work, queue waits scale with provider speed and AIMD - _AsyncBridgedModelFacade derives the sync->async bridge timeout from the model's inference_parameters.timeout and the call's max_correction_steps; one knob (per-model timeout) drives both deadlines, no new config surface - Add ModelFacade.request_timeout property so the bridge can read the effective timeout the client is configured with Root-cause surfacing - AsyncTaskScheduler captures the first non-retryable error and exposes it via first_non_retryable_error - Interface threads it through DataDesignerGenerationError when 0 records are produced without early-shutdown, so deterministic failures (e.g. bad seed sources) surface their original message instead of a wrapped FileNotFoundError on the parquet path Tests - New: throttle no-deadline default behavior (sync+async), parametrized derived bridge timeout, restored async_concurrency tests on 3.10 - Updated: test_dataset_builder.py uses an autouse fixture to pin its Mock-based tests to the sync engine they cover; existing bridge tests set facade.request_timeout for the new derivation Docs - Replace the stale LiteLLM security notice in README with a short async-default heads-up and link to the migration guide - Add docs/migration-async-default.md covering per-model timeouts, custom-column thread safety, mocking model calls, run outcomes, and the opt-out - Append a short Update section to the async-all-the-way-down dev note * test: extract _compute_bridge_timeout helper for direct testing The parametrized bridge-timeout test was patching ``concurrent.futures.Future.result`` to capture the timeout the bridge passed in. That reaches into stdlib internals (DEVELOPMENT.md "Mock at boundaries: Keep mocking shallow") and the ``ids=`` argument on the parametrize was missing. Extracts the formula into a module-level ``_compute_bridge_timeout`` helper. The test now calls the helper directly with no mocking, and the parametrize gets readable ids. Behavior is unchanged. * test(e2e): align demo plugins with async engine contracts The e2e demo plugins exercise plugin discovery and full DD lifecycle. Two of them were written against sync-engine semantics that the async engine restricts: - DemoColumnGeneratorImpl was a ColumnGeneratorFullColumn with no required_columns. The async engine routes ``no-upstream`` columns through the from-scratch path, which passes an empty DataFrame to generators that aren't FromScratchColumnGenerator subclasses. The generator then produces 0 rows and the scheduler raises ``update_batch received 0 values``. Switching the plugin to FromScratchColumnGenerator with generate_from_scratch(num_records) matches what the plugin actually does (produces a constant column without input) and works on both engines. - RegexFilterProcessor implemented process_before_batch with row-count changes. The async engine enforces row-count invariance in pre- and post-batch processor stages by design. Moving the filter to process_after_generation preserves the plugin's purpose (regex-based row filtering) at a stage that supports row-count changes on both engines. Test assertions check the final dataset, so the stage shift is transparent. Both changes are demo-plugin updates only; no production code change. * fix: address Codex review findings on async-default flip Three bugs and two test-quality concerns surfaced by an independent review of the prior commits. Each was real and worth fixing in the flip PR. Bug fixes - Sync-fallback path was creating async-only model clients. The default flip meant ``client_concurrency_mode = ASYNC`` for every default run, but the ``allow_resize=True`` path falls back to the sync engine — sync ``model.generate()`` calls then hit ``SyncClientUnavailableError``. The resolution decision now lives at the DataDesigner interface level via ``_resolve_client_concurrency_mode``: it considers both the env var and the config (allow_resize forces sync clients) and is passed explicitly to ``create_resource_provider``. Direct callers of the factory still get the env-var default. - Sync→async bridge timeout ignored the per-call ``timeout=`` override. A custom column calling ``model.generate(timeout=600)`` against a slow endpoint was being cancelled at the model-config default, not 600s. The bridge now prefers ``kwargs.get("timeout")`` over ``facade.request_timeout``. - Bridge timeout formula missed ``max_conversation_restarts``. One logical generation can do ``(1 + max_conversation_restarts) × (1 + max_correction_steps)`` HTTP requests; the formula now multiplies both, matching the worst-case attempt budget. Engine routing fix (also surfaced by failing e2e plugin tests) - ``_run_from_scratch`` else-branch passed an empty DataFrame to non-FromScratch generators classified as seeds (no upstream columns), so ``ColumnGeneratorFullColumn`` with no required_columns produced 0 rows for an ``rg_size``-row buffer. Now passes an ``rg_size``-row snapshot of the row-group buffer, mirroring the sync engine's FULL_COLUMN contract. - The earlier ``DemoColumnGeneratorImpl`` workaround (rewrite as ``FromScratchColumnGenerator``) is reverted; the engine fix subsumes it. The processor-plugin fix (``process_after_generation`` for the regex filter) stays — pre-batch row-count change is intentionally rejected by the async engine. Test improvements - Throttle no-deadline tests are parametrized over ``(timeout=0.0, raises)`` and ``(timeout=None, waits)``, pinning that ``None`` is genuinely distinct from any finite default. Sync and async counterparts mirror. - New regression tests for ``first_non_retryable_error`` surfacing covering both load-raises and load-returns-empty paths, asserting the original exception is chained via ``__cause__`` and that the typed ``DataDesignerEarlyShutdownError`` doesn't fire in this branch. - New parametrized regression test for ``_resolve_client_concurrency_mode`` covering all four (env × allow_resize) combinations. - New parametrized test for the per-call ``timeout=`` override flowing into the bridge timeout calculation. - Bridge formula tests extended with ``max_conversation_restarts`` cases. * test: trim redundant parametrize cases in async-default tests Three parametrize cases were duplicating coverage already provided by existing standalone tests: - ``test_acquire__timeout_branches`` parametrized over ``(0.0, raises)`` and ``(None, waits)``. The ``raises`` half duplicates ``test_acquire__raises_timeout_when_at_capacity``. Replaced with two focused ``..._default_no_deadline_waits_for_release`` tests covering only the no-deadline branch. - ``test_resolve_client_concurrency_mode_matches_engine_choice`` had four cases. The ``async-off + allow-resize`` case asserts ``SYNC`` because the env var alone forces it; the allow_resize input is moot. Dropped. - ``test_async_bridge_honors_per_call_timeout`` had three cases. The "override below floor" case cross-products the per-call override flow with the floor-clamping behavior already covered by ``test_compute_bridge_timeout``. Dropped. Net: -25 lines of test code with no loss of essential coverage. * docs: fold migration page into existing concept docs The standalone ``Migrating to the async default`` page didn't fit the existing docs style — present tense, behavior over comparisons, content in the natural concept home. Folding it in: - ``architecture-and-performance.md`` gets a new ``Async Engine`` section covering per-model timeouts, run outcomes (partial completion + ``DataDesignerEarlyShutdownError``), and the transitional opt-out. Three stale ``async engine is landing soon`` callouts updated to reflect the flip. - ``custom_columns.md`` gets two short notes: a thread-safety callout near Generation Strategies, and a mocking-with-spec note in Development Testing. - ``async-all-the-way-down.md`` Update section now points at the new arch-and-perf section. - README heads-up links to the same anchor. - ``migration-async-default.md`` removed; mkdocs.yml entry dropped. * docs: frame Execution Model as sync-engine specifics Small targeted edits to make the user-facing concept docs consistent with the post-flip state. No restructuring. - ``architecture-and-performance.md``: the ``Execution Model`` callout now opens with two engines, links to the new ``Async Engine`` section, and frames the existing column-at-a-time description as sync-engine semantics. The ``Step 2: Process columns sequentially`` paragraph notes the async engine relaxes this. The ``Key Concepts`` table differentiates per-engine for ``Batching`` and ``Sequential columns``; ``Parallel cells`` is the same on both. - ``processors.md``: added a warning callout about the async engine's row-count invariance in pre- and post-batch stages, with the guidance to use ``process_after_generation()`` for row-filtering or expansion. * fix: address review nits from PR #592 (Nabin) Four targeted fixes from the review. Worth-addressing (warning): - ``test_acquire_async_default_no_deadline_waits_for_release`` was spawning the release task without holding a strong reference. The loop's weak-ref bookkeeping could GC it before the inner ``await`` observes the release, producing a CI flake. Hold the task and ``await`` it in ``finally``. Take-it-or-leave-it (applied): - Root-cause error surfacing now includes the exception type name: ``f"🛑 {type(root_cause).__name__}: {root_cause}"`` so users see ``ValueError: ...`` instead of just the message string. The ``__cause__`` chain is preserved either way. - Drop the defensive ``getattr(c, "allow_resize", False)`` in ``_resolve_client_concurrency_mode`` — every member of ``ColumnConfigT`` inherits ``allow_resize: bool = False`` from ``SingleColumnConfig``. - One-line comment near the root-cause surfacing branch noting that ``actual_num_records == 0`` is async-only (sync runs leave it at ``-1``), so the branch is async-only by construction. Not addressed in this PR (filing as follow-ups): - ``SYNC_BRIDGE_TIMEOUT = 300`` still hardcoded in ``column_generators/generators/base.py:_run_coroutine_sync``. That bridge has no model-facade context to derive a timeout from, so the fix is a structural refactor outside this PR's scope. - First-error capture loses subsequent-error context. The "first wins" heuristic is documented; richer aggregation is a follow-up. * fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync This was the third hardcoded 300s timeout (Nabin flagged it on PR #592). The path is the generic sync→async bridge in ``ColumnGenerator.generate()``: when a subclass overrides only ``agenerate()``, the sync entry point runs the coroutine in a background thread. Same philosophy we applied to the throttle queue wait elsewhere in the PR: a defensive deadline on top of work that's already bounded by the HTTP timeout doesn't add safety, it just produces spurious failures on slow self-hosted endpoints. Drop the constant, the timeout exception handling, and the ``timed_out`` bookkeeping. ``pool.shutdown(wait=True)`` becomes the simple cleanup. Tests in ``test_async_generators.py`` exercise the happy path only and don't depend on the timeout firing. * Revert "fix: drop SYNC_BRIDGE_TIMEOUT in _run_coroutine_sync" This reverts commit `7a0b77d44c`. * docs+feat: deprecate the sync-engine opt-out path Nabin asked whether the docs should adopt explicit "deprecation" language on the opt-out path. Doing both: - Doc: ``architecture-and-performance.md``'s ``Opting out`` section now uses an ``!!! warning "Deprecated"`` admonition that names the env var as a deprecated escape hatch and notes the run-time warning. - Code: ``DataDesigner._resolve_client_concurrency_mode`` emits a ``DeprecationWarning`` when ``DATA_DESIGNER_ASYNC_ENGINE=0`` is detected. Same precedent as the existing ``allow_resize=True`` warning. Auto-fallback via ``allow_resize`` does not double-warn here; the builder layer emits its own warning later. - Test: parametrized regression now asserts ``pytest.warns(DeprecationWarning)`` on the opt-out branch and treats any warning on the async-on branches as a failure (``simplefilter("error")`` inside the ``catch_warnings`` block). * fix: emit logger.warning alongside DeprecationWarning on env-var opt-out Parity fix from Nabin's re-review of PR #592. The ``allow_resize=True`` auto-fallback path in ``_resolve_async_compatibility`` emits both a ``logger.warning("⚠️ ...")`` and a ``DeprecationWarning``. The new ``DATA_DESIGNER_ASYNC_ENGINE=0`` opt-out path was only emitting the ``DeprecationWarning``, leaving users who run with default warning filters silenced and inconsistent with the established precedent. Match the pattern: same message body, both signals, same stacklevel. * docs: breadcrumb explaining why SYNC_BRIDGE_TIMEOUT survives PR #592 Nabin's re-review pointed out that ``base.py`` is the lone place where the 300s pattern survives, while ``custom.py`` and ``throttle_manager.py`` both retired theirs. Without a comment, a future reader (or a lint sweep) could mistake this for an oversight and "consistency-fix" it the wrong way. Add a short note at the constant naming the two retired siblings, the reason this one stayed (no ``ModelFacade`` context to derive from), and the fact that it's tracked for a structural follow-up.	2026-05-04 16:22:13 -03:00
Nabin Mulepati	5ec0e89972	fix(interface): don't leak YAML default provider into user-supplied list (#591 ) Some checks are pending CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details * fix(interface): don't leak YAML default provider into user-supplied list When a caller passes ``model_providers`` to ``DataDesigner.__init__``, the YAML's ``default:`` key from ``~/.data-designer/model_providers.yaml`` was still being applied to the resulting ``ModelProviderRegistry``. This caused two related problems: 1. Hard failure: if the YAML default named a provider absent from the user-supplied list, construction raised ``ValidationError: Specified default 'X' not found in providers list``. 2. Silent override: if the YAML default matched a non-first user-supplied provider, the documented "first wins" behavior was silently overridden. Gate the YAML lookup on ``model_providers is None`` so that user-supplied lists own their own default. Also expose ``model_provider_registry`` and ``run_config`` as public read-only properties on ``DataDesigner``, paired with the existing ``secret_resolver`` property and ``set_run_config`` setter; tests now use these instead of the underscore-prefixed attributes. Closes #588. Made-with: Cursor * fix(interface): tighten model_provider_registry docstring; pin YAML-fallback path Address review feedback on PR #591: - Clarify ``model_provider_registry`` docstring so it reflects the full fallback chain: user-supplied first → YAML default (when set) → first provider in the YAML list. - Add ``test_init_no_user_providers_uses_yaml_default`` to lock the YAML-fallback contract that the #588 fix preserved but didn't pin. Made-with: Cursor	2026-04-30 15:50:26 -06:00
Andre Manoel	47c72b3d87	fix(async): pack of fixes for async engine under degraded providers (#585 ) Some checks are pending CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details * fix(async): exclude all retryable errors from early-shutdown gate The gate previously only excluded `ModelRateLimitError`, leaving `ModelTimeoutError`, `ModelInternalServerError`, and `ModelAPIConnectionError` to count toward the sliding-window error rate. Under provider degradation these errors cluster in time (concurrent in-flight requests time out together), so 5/10 in a row is easy and trips the gate even when salvage could recover the rows. Refs #575. * feat(async): WARN log when provider showing degraded performance Diagnostic A/Bs against build.nvidia.com showed runs failing silently under provider degradation - no log indication that retryable errors were piling up until the early-shutdown gate fired (or, post-fix, until salvage exhaustion). Surfacing this earlier helps users distinguish "DataDesigner is broken" from "the upstream provider is slow today." Tracks a separate sliding window over retryable-vs-not for every task outcome (independent of the early-shutdown gate's window) and emits a throttled WARN when the rolling fraction crosses the threshold. Refs #575. * fix(async): salvage partial row groups on early shutdown Before: when the early-shutdown gate fired, any row group still in flight stayed in `_rg_states` un-checkpointed. The buffer manager later raised `FileNotFoundError` when the builder tried to read the finalized parquet. User-visible result: `0 records produced`. After: a new `_finalize_after_shutdown` step runs in `run()`'s finally block, after `_cancel_workers` has drained in-flight tasks (Codex caveat: in-flight `from_scratch`/`batch` tasks must not be allowed to write into a buffer that's being finalized). For each remaining row group it drops rows that aren't fully complete, then delegates to the existing `_checkpoint_completed_row_groups` so the buffer manager's zero-survivor handling (skip empty parquet, free buffer) kicks in unchanged. Also surfaces partial completion as a structured signal: scheduler exposes `early_shutdown: bool` and `partial_row_groups: tuple[int, ...]` properties so callers can detect partial completion programmatically rather than parsing log lines. Builder uses this to emit a more specific WARN distinguishing early shutdown from non-shutdown drops. Refs #575. * fix(throttle): reset consecutive_429s on non-rate-limit failure In `release_failure`, the cascade counter wasn't reset, so a sequence like 429 → 500 → 429 was treated as 2 consecutive 429s. The cascade counter feeds AIMD's reduce-once-per-cascade logic; the second 429 should start a fresh cascade and trigger another concurrency reduction, but currently doesn't. Standalone bug surfaced during #575 investigation; not on the failure path that drives the gate-trip outcome but worth fixing while we're in this code. * fix(custom): preserve retryability through CustomColumnGenerator wrap A real-workload run of #575 showed the early-shutdown gate still trips even with the gate-exclusion fix in place: the trigger is 10 timeouts inside Anonymizer's QA-repair custom columns, all wrapped in CustomColumnGenerationError (non-retryable) by the catch-all in CustomColumnGenerator. Two fixes here: 1. Re-raise RETRYABLE_MODEL_ERRORS unchanged before the wrap so the scheduler's _is_retryable correctly classifies them. 2. Surface _AsyncBridgedModelFacade timeouts as ModelTimeoutError instead of stdlib TimeoutError. Without this the sync bridge times out as the wrong exception type and is still classified non-retryable even after fix #1. Also moves _RETRYABLE_MODEL_ERRORS from async_scheduler to models/errors as the public RETRYABLE_MODEL_ERRORS tuple - both the scheduler and the wrap site need it, and models/errors is the appropriate home alongside the error class definitions. Refs #575. * feat(interface): typed DataDesignerEarlyShutdownError on zero-record runs When the async scheduler hits early shutdown and produces zero records, the buffer manager skips writing parquet (correctly), so ArtifactStorage.load_dataset_with_dropped_columns() raises FileNotFoundError. Previously this surfaced as a generic DataDesignerGenerationError wrapping the FileNotFoundError, which is ambiguous (could be missing files for any reason). This commit: - Adds DataDesignerEarlyShutdownError as a subclass of DataDesignerGenerationError so existing handlers still match while callers that want to react programmatically (retry on different alias, surface a degraded-provider message, etc.) can catch the specific type. - Plumbs the scheduler's structured signals (early_shutdown, partial_row_groups) up through the builder so they're available at data_designer.create() time without re-introspecting the scheduler. - create() raises the typed error in both failure modes (load fails or empty DataFrame returned) when builder.early_shutdown is True. Refs #575. * fix(async): emit first degraded-provider WARN regardless of clock state Initialize _last_degraded_warn_at to -inf so the first WARN is always emitted. The previous initialization to 0.0 suppressed the first WARN on fresh CI runners where time.monotonic() returns a small value (system boot uptime), making the throttle interval check (now - 0.0 < interval) true on the first attempt. * fix(async): address review findings on early-shutdown salvage PR Five real correctness issues caught in review of the original PR, plus a few smaller cleanups and test simplifications. Throttle - cascade reset (regression of existing AIMD invariant): release_failure() now resets consecutive_429s only when in_flight == 0. Resetting unconditionally broke "reduce once per cascade" when 429/500/429 arrived interleaved within a single in-flight burst - the second 429 was treated as a new cascade and the limit got halved twice for what was effectively one rate-limit event. Interface - typed-error gating: DataDesignerEarlyShutdownError now fires only when early_shutdown is true AND actual_num_records == 0. Without this, a partial-salvage run that fails to load for unrelated reasons (corrupt parquet, schema drift, disk hiccup) was misdiagnosed as "zero records produced," hiding the real cause. Async - WARN window scope: the degraded-provider warning was fed by every task outcome, including samplers and non-LLM customs. In realistic pipelines (one model column, several non-model columns) the rate stayed under threshold even when every model call was failing, silencing the WARN exactly when it mattered. Now gated on is_llm. Async/builder - signal preservation across raises: scheduler.early_shutdown and partial_row_groups are captured in a try/finally around future.result(), so a processor failure during the salvage path doesn't drop the structured signal. Both build() and build_preview() now reset per-run state at the start so reused builders don't leak prior-run flags. Async - dead code: dispatch_error capture in run() was unread (the post- finally check is unreachable on the exception path). Removed. Smaller cleanups: - early-shutdown WARN says "non-retryable error rate exceeded threshold" - bridge timeout WARN demoted to debug (ModelTimeoutError already surfaces it; the throttled degraded-provider WARN is the user-facing signal) - TODO note for threading degraded_warn_* through RunConfig - doc note in _finalize_after_shutdown clarifying that pre-batch processor isn't re-run on partial-salvage row groups Tests: - new regression tests for the cascade burst case, partial-salvage error gating, and LLM-only WARN window - direct unit test for _reset_run_state - dedup via _make_storage / _seed_plus_cell_setup helpers - WARN emission cases parametrized into a single test - shared parametrize lists hoisted to module-level constants - redundant cascade test dropped in favor of the more thorough drain variant; redundant healthy-baseline test folded into the zero-survivor test * chore(async): address Nabin's review comments Style cleanups, parametrization, docstring polish, and one consistency fix in the typed-error path. All non-blocking ("Ship it (with nits)"). interface/data_designer.py: - preview() now raises DataDesignerEarlyShutdownError when shutdown produced zero records (parity with create()), and also gates on actual_num_records == 0 so partial-salvage runs that fail to load don't get misdiagnosed - create()'s defensive empty-DF guard mirrors the load-failure guard with the same actual_num_records == 0 check async_scheduler.py: - _record_retryable_outcome docstring clarifies that the call site filters by is_llm; the function alone reads as if every outcome feeds the window dataset_builder.py: - moved _reset_run_state() down past the public methods to match the project's public-before-private convention test_custom.py: - flattened TestAsyncBridgedModelFacade class into module-level test functions (matches the rest of the file) - hoisted inline imports (asyncio, threading, patch, _AsyncBridgedModelFacade, SyncClientUnavailableError) to top of file - driven retryable-error parametrize off RETRYABLE_MODEL_ERRORS directly instead of the hand-rolled factory list, so new retryable types pick up coverage automatically - dropped the redundant "Sanity" block in test_async_bridge_timeout_raises_ model_timeout_error - pytest.raises already enforces the type, the duplicate block was running the same slow scenario twice test_async_scheduler.py: - parametrize over RETRYABLE_MODEL_ERRORS directly (same as above) test_data_designer.py: - added preview-path tests for the typed-error and partial-salvage fall-through cases - updated the existing empty-DF test to also patch actual_num_records=0 (otherwise the new gating in the empty-DF guard skips the typed error) * test(interface): consolidate create() error-dispatch tests into a matrix Five separate tests (two existing, three new from earlier in this PR) all probed the same dispatch logic in create(): "given a load outcome and a builder state, which error type should fire?" Pulled them into a single parametrized matrix indexed by (load_side_effect, early_shutdown, actual_num_records). Net result: 5 named tests → 1 parametrized test with 6 cells, and the previously-missing empty_df + shutdown + partial salvage cell is now covered. Test names retain readable IDs (load_fails_shutdown_zero_records etc.) so failures still pinpoint the exact case in pytest output.	2026-04-30 14:43:35 -03:00
Eric W. Tramel	8be4ff787f	feat: add RunConfig jinja rendering engine (#557 )	2026-04-17 15:06:27 -04:00
Johnny Greco	4a2813654a	fix: always return ISO-8601 from datetime postproc (#484 ) (#512 ) * fix: always return ISO-8601 from datetime postproc (#484) The DatetimeFormatMixin.postproc heuristics inferred output format from value distribution, silently stripping date/time components for small datasets or narrow date ranges. Replace with deterministic ISO-8601 output via vectorized strftime. Users who need custom formats can still set convert_to on the SamplerColumnConfig. * docs: update convert_to docstring and add DatetimeFormatMixin docstring The SamplerColumnConfig.convert_to docstring incorrectly stated that only "float", "int", or "str" are accepted. Datetime/timedelta samplers accept strftime format strings. Also document the ISO-8601 default. * test: add regression test for #484 via DataDesigner.preview API Captures the exact reproducer from the issue: a single-record datetime preview through the public DataDesigner.preview() interface must return a full ISO-8601 timestamp, not a bare year string. * test: trim redundant datetime tests, align reproducer with issue #484 - Remove postproc_same_day_records (subsumed by same_month + no_convert_to) - Remove postproc_always_parseable (subsumed by stdlib_fromisoformat) - Remove all_same_month integration test (subsumed by narrow_range_single_day) - Update single_record test to use unit="h" matching the issue reproducer * fix: address review nits — move datetime import to module scope, drop redundant isinstance	2026-04-09 12:50:40 -04:00
Eric W. Tramel	7891dd53cb	feat: add Hermes Agent rollout support (#500 )	2026-04-07 12:39:49 -04:00
Eric W. Tramel	58870bb83f	feat: add ATIF rollout ingestion (#495 )	2026-04-06 11:06:14 -04:00
Andre Manoel	c146af5146	chore: async engine follow-up - rename, preview, lifecycle, progress (#456 ) * chore: rename ColumnWiseDatasetBuilder, wire async preview, unify row-group lifecycle - Rename ColumnWiseDatasetBuilder to DatasetBuilder and column_wise_builder.py to dataset_builder.py, update all references - Extract _prepare_async_run() factory shared by build and preview paths - Add _build_async_preview() for async preview with no disk checkpoints - Replace on_row_group_complete/on_checkpoint_complete with single on_finalize_row_group callback; caller handles checkpointing - Add free_row_group() on RowGroupBufferManager for discard-without-write - Free fully-dropped row groups instead of finalizing them - Add consolidated AsyncProgressReporter for async generation logging Closes #437, closes #442, closes #444 * feat: add consolidated async progress reporting with row group context - Add AsyncProgressReporter: groups per-column progress into a single log block emitted at configurable intervals (default 5s) - Add quiet mode to ProgressTracker to suppress per-tracker logging when used with the consolidated reporter - Add ContextVar-based row group tagging (RG1, RG2, ...) for log messages emitted inside async tasks (samplers, expressions, seeds) - Add progress_interval to RunConfig for user-configurable reporting - Remove log_start_async from ProgressTracker (superseded by reporter) Closes #443 * fix async preview and progress reporting * feat: add opt-in sticky ANSI progress bars for generation Add RunConfig.progress_bar setting that replaces periodic log-line progress with sticky terminal bars that stay at the bottom while logs scroll above. Pure ANSI escape codes, no new dependencies. Disabled by default - existing log-based output unchanged. * docs: add missing progress_interval docstring in RunConfig * fix: update progress bar on every completion in async path Skip the time gate when the progress bar is active so the bar redraws on every record instead of every progress_interval seconds. * fix: resolve row-group semaphore deadlock when all tasks are deferred When all tasks for admitted row groups fail with transient errors, the row-group semaphore never releases, blocking admission of new row groups. Fix by salvaging stalled row groups inline - retrying deferred tasks immediately so row groups can checkpoint and free their semaphore slots. Also updates row group log format to (x/X) with leading zeros. * fix: eagerly salvage stalled row groups to avoid wasting semaphore slots Run inline salvage after every checkpoint pass instead of only when globally stalled. Row groups with 0 in-flight and only deferred tasks are salvaged immediately, freeing their semaphore slot for new work. * fix: address review findings from greptile and codex - Use `with` statement for progress bar context (safe __exit__ on error) - Check bar.is_active instead of bar is not None (non-TTY fallback) - Record failures (not skips) for tasks that exhaust salvage retries - Record skipped tasks when pre-batch filtering drops rows * fix: stable progress bar width and accurate failure counts - Pre-compute fixed stats width at bar creation to prevent bar resizing when failed count appears - Cap displayed completed at total to avoid >100% on retries - Exclude already-failed columns from skip recording to prevent double-counting in progress reporter * fix: address Nabin's review - exclude_columns, dead code, docstring - Add exclude_columns={task.column} on non-retryable batch/from_scratch drop path to prevent double-counting (same pattern as cell path) - Simplify salvage drop to per-task exclude (is_dropped guard handles multi-column case) - Remove dead _in_flight_for_rg method - Fix context.py docstring to match actual (x/X) format * fix: _drain_frontier exits before dispatching ready salvage tasks After salvage discards a cell task from dispatched (making it available in the frontier), _drain_frontier broke immediately because nothing was in-flight yet. The task and its downstream were never re-dispatched, leaving the row group incomplete. Fix: only break when both ready and in-flight are empty. * fix: salvage edge cases found by code review - _drain_frontier: only break when both ready and in-flight are empty - _salvage_rounds: re-mark sibling columns as dispatched after from_scratch retry to prevent duplicate dispatch - _salvage_stalled_row_groups: separate exhausted tasks from new drain failures to avoid treating non-stalled tasks as permanent - _checkpoint_completed_row_groups: clean up deferred tasks for checkpointed row groups - Early shutdown: salvage stalled row groups before exiting * fix: skip record_failure for already-dropped rows in salvage When multiple columns fail for the same row, the first drop records a skip for the other column. Without this guard, record_failure fires again for the second column, double-counting it.	2026-03-25 14:50:48 -03:00
Eric W. Tramel	a0fb04ee07	feat: agent rollout trace ingestion (#399 )	2026-03-20 09:17:35 -04:00
Eric W. Tramel	c88508790f	test: trim FileSystemSeedReader follow-up coverage (#432 )	2026-03-19 09:47:55 -04:00
Eric W. Tramel	3d33680b7d	feat: support 1-to-many FileSystemSeedReader hydration (#424 )	2026-03-17 20:56:19 -04:00
Eric W. Tramel	28c8345909	feat: add built-in filesystem seed readers (#421 )	2026-03-16 17:40:27 -04:00
Nabin Mulepati	340087f4cf	fix: raise clear error when all records are dropped during generation (#383 ) * fix: raise clear error when all records are dropped during generation (#382) When every record fails during generation (e.g. LLM or image model errors), the empty dataset previously reached the profiler, which raised a misleading DatasetProfilerConfigurationError about missing columns. Now both preview() and create() check for an empty dataset before profiling and raise a DataDesignerGenerationError with an actionable message instead. Made-with: Cursor * fix: keep load_dataset_with_dropped_columns inside profiling try/except Moves the call back inside the try/except block so ArtifactStorageError is caught and re-raised as DataDesignerProfilingError, preserving the documented API contract. Also reduces test num_records to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address Andre's feedback * more small feedback --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-10 10:18:16 -06:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Nabin Mulepati	1f172eaa28	chore: move ArtifactStorage to engine/storage/ module (#321 )	2026-02-12 14:58:12 -07:00
Andre Manoel	429b558588	refactor: callback-based processor design (#294 )	2026-02-11 21:32:24 -03:00
Johnny Greco	c19f35639f	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00

26 commits