DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

History

Nabin Mulepati bbcd7d3995 Some checks are pending CI / Test Config (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Config (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Config (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details fix: harden resume checkpoint handling (#624 ) * fix: harden resume checkpoint handling Persist config identity in metadata, make checkpoints atomic, and reject unsafe resume states so interrupted runs do not mix incompatible or post-processed data. * fix: close resume edge cases Let IF_POSSIBLE start fresh for resize configs and mark after-generation processing before mutation so interrupted processors cannot be resumed unsafely. * refactor: drop dataset directory lock Single-user CLI/notebook flows don't race on the artifact directory, and the timestamped-directory fallback already handles the "ran it twice" case. The lock added complexity (re-entrancy, stale cleanup, the cached-property trap where IF_POSSIBLE→NEVER moves writes to a timestamped directory while the lock stays pinned to the original) for no real protection. Atomic metadata writes still cover the actual hazard (crash mid-write). Also fix a pre-existing test bug in test_initial_actual_num_records_uses_actual_parquet_rows_for_partial_row_group where the mocked scheduler hit the partial-completion path with unconfigured Mock attributes. * fix: address Greptile review on resume edge cases * Drop the unreachable ResumeMode.IF_POSSIBLE branch in _post_generation_processed_resume_result. By the time this helper runs, build() has normalised IF_POSSIBLE to ALWAYS or NEVER, so the guard now matches reality. Tighten the docstring to document the three outcomes (no-op return / fall through / raise). * Split the post-processed extension/raise into two cases. When num_records < prior_target the user just asked for fewer records than already exist; the previous "would mix pre- and post-processor records" message only describes the extension case. Mirror the wording used by _load_resume_state and add a regression test. * Remove the dead _find_completed_row_group_ids wrapper now that _build_async uses _find_completed_row_groups directly. Rename the related test to match. * refactor: unify sync + async resume around filesystem-derived progress Both engines now derive `num_completed_batches` and `actual_num_records` from `parquet-files/batch_.parquet` via `_recover_progress_from_disk`. `metadata.json` keeps describing the run configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), while the filesystem is the source of truth for progress. This closes the sync engine's race window between `move_partial_result_to_final_file_path` and the metadata write that follows it, matching the crash-recovery the async engine already had. The sync engine additionally rejects non-contiguous batch IDs (a hole can only mean external mutation or a directory written by an incompatible engine); the async engine continues to tolerate gaps from out-of-order completion via `allow_holes=True`. Existing sync resume tests now seed parquet files alongside metadata, and two new tests cover the unified behaviour: filesystem progress wins when metadata lags, and sync rejects non-contiguous IDs. * docs: clarify DatasetCreationResults observability scope on resume `load_dataset`, `count_records`, `load_analysis`, `export`, and `push_to_hub` all read from the artifact directory, so they reflect the cumulative dataset (original + resume rows). `task_traces`, model-usage logs, and telemetry events are scoped to the current invocation only because the original run's in-memory state is not persisted. Document this in the class docstring, the architecture note, and the Fern resume guide. * docs: explain DeprecationWarning re-raise in create()/preview() Future readers were puzzled by the ``except DeprecationWarning: raise`` short-circuits before the generic generation-error wrappers. Add a comment in ``create()`` (with a back-reference from ``preview()``) to record that strict warning filters (``pytest.warns``, ``-W error::DeprecationWarning``) turn the engine's ``warnings.warn(..., DeprecationWarning)`` calls — most notably the ``allow_resize=True`` deprecation in ``_resolve_async_compatibility`` — into raised exceptions, and we want them to surface untouched instead of being swallowed by ``DataDesignerGenerationError``. * fix: close after-generation crash window and tighten metadata typing on resume Address review feedback on resume hardening: * Run after-generation processors unconditionally on the on-disk dataset rather than gating on the generation return value. The previous gate silently skipped after-generation when resume saw every row group already on disk, leaving a crash window between the final parquet write and the ``post_generation_state="started"`` marker write: in that window the dataset is complete but after-generation never ran, and the on-disk parquet files are still clean. The "started" short-circuit still rejects the other direction (crashed mid-rewrite, ambiguous state), so resume only re-runs after-generation when it is safe to do so. * Raise ``DatasetGenerationError`` (instead of letting a raw ``TypeError`` leak out of ``num_records < prior_target``) when a post-processed dataset's metadata is missing ``target_num_records``. Mirrors the wording used by ``_load_resume_state``. * Document the new behaviour in ``architecture/dataset-builders.md`` and the Fern resume invariants. Tests: * ``test_build_resume_complete_dataset_runs_after_generation_when_no_marker`` covers the closed crash window via the public ``set_processor_runner`` API. * ``test_build_resume_post_generation_processed_missing_target_raises_clearly`` covers the typed-error gap.		2026-05-11 11:44:46 -06:00
..
analysis	chore: Improve CLI startup with lazy heavy import cleanup (#330 )	2026-02-18 16:24:15 -05:00
column_generators	feat: let column configs declare all model aliases for the startup health check (#626 )	2026-05-11 11:33:50 -06:00
dataset_builders	fix: harden resume checkpoint handling (#624 )	2026-05-11 11:44:46 -06:00
mcp	refactor: Decouple ModelFacade from LiteLLM via ModelClient adapter (#373 )	2026-03-11 14:30:40 -06:00
models	feat: make async engine the default execution path (#592 )	2026-05-04 16:22:13 -03:00
processing	feat: add RunConfig jinja rendering engine (#557 )	2026-04-17 15:06:27 -04:00
registry	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
resources	fix: normalize rollout timestamps before deriving started_at/ended_at (#556 )	2026-05-07 14:13:10 -04:00
sampling_gen	feat: add RunConfig jinja rendering engine (#557 )	2026-04-17 15:06:27 -04:00
storage	fix: harden resume checkpoint handling (#624 )	2026-05-11 11:44:46 -06:00
testing	fix(engine): validate processor plugin impls (#609 )	2026-05-06 14:31:12 -04:00
validators	chore: Improve CLI startup with lazy heavy import cleanup (#330 )	2026-02-18 16:24:15 -05:00
__init__.py	feat: Native Anthropic adapter with shared HTTP client infrastructure (#426 )	2026-03-19 11:18:40 -06:00
conftest.py	feat: Refactor person data reading for client ddb connection control (#393 )	2026-03-19 09:34:57 -05:00
test_compiler.py	refactor: slim package refactor into three subpackages (#240 )	2026-01-27 13:53:20 -05:00
test_configurable_task.py	chore: Improve CLI startup with lazy heavy import cleanup (#330 )	2026-02-18 16:24:15 -05:00
test_dataset_metadata.py	refactor: slim package refactor into three subpackages (#240 )	2026-01-27 13:53:20 -05:00
test_engine_errors.py	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
test_model_provider.py	feat(models): deprecate implicit default provider routing (#594 )	2026-05-05 13:39:12 -06:00
test_secret_resolver.py	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
test_validation.py	feat: add skip.when conditional column generation (#502 )	2026-04-15 09:31:50 -06:00