DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

History

Nabin Mulepati bbcd7d3995 Some checks are pending CI / Test Config (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Config (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Config (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Config (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details fix: harden resume checkpoint handling (#624 ) * fix: harden resume checkpoint handling Persist config identity in metadata, make checkpoints atomic, and reject unsafe resume states so interrupted runs do not mix incompatible or post-processed data. * fix: close resume edge cases Let IF_POSSIBLE start fresh for resize configs and mark after-generation processing before mutation so interrupted processors cannot be resumed unsafely. * refactor: drop dataset directory lock Single-user CLI/notebook flows don't race on the artifact directory, and the timestamped-directory fallback already handles the "ran it twice" case. The lock added complexity (re-entrancy, stale cleanup, the cached-property trap where IF_POSSIBLE→NEVER moves writes to a timestamped directory while the lock stays pinned to the original) for no real protection. Atomic metadata writes still cover the actual hazard (crash mid-write). Also fix a pre-existing test bug in test_initial_actual_num_records_uses_actual_parquet_rows_for_partial_row_group where the mocked scheduler hit the partial-completion path with unconfigured Mock attributes. * fix: address Greptile review on resume edge cases * Drop the unreachable ResumeMode.IF_POSSIBLE branch in _post_generation_processed_resume_result. By the time this helper runs, build() has normalised IF_POSSIBLE to ALWAYS or NEVER, so the guard now matches reality. Tighten the docstring to document the three outcomes (no-op return / fall through / raise). * Split the post-processed extension/raise into two cases. When num_records < prior_target the user just asked for fewer records than already exist; the previous "would mix pre- and post-processor records" message only describes the extension case. Mirror the wording used by _load_resume_state and add a regression test. * Remove the dead _find_completed_row_group_ids wrapper now that _build_async uses _find_completed_row_groups directly. Rename the related test to match. * refactor: unify sync + async resume around filesystem-derived progress Both engines now derive `num_completed_batches` and `actual_num_records` from `parquet-files/batch_.parquet` via `_recover_progress_from_disk`. `metadata.json` keeps describing the run configuration* (`buffer_size`, `target_num_records`, `original_target_num_records`, config fingerprint), while the filesystem is the source of truth for progress. This closes the sync engine's race window between `move_partial_result_to_final_file_path` and the metadata write that follows it, matching the crash-recovery the async engine already had. The sync engine additionally rejects non-contiguous batch IDs (a hole can only mean external mutation or a directory written by an incompatible engine); the async engine continues to tolerate gaps from out-of-order completion via `allow_holes=True`. Existing sync resume tests now seed parquet files alongside metadata, and two new tests cover the unified behaviour: filesystem progress wins when metadata lags, and sync rejects non-contiguous IDs. * docs: clarify DatasetCreationResults observability scope on resume `load_dataset`, `count_records`, `load_analysis`, `export`, and `push_to_hub` all read from the artifact directory, so they reflect the cumulative dataset (original + resume rows). `task_traces`, model-usage logs, and telemetry events are scoped to the current invocation only because the original run's in-memory state is not persisted. Document this in the class docstring, the architecture note, and the Fern resume guide. * docs: explain DeprecationWarning re-raise in create()/preview() Future readers were puzzled by the ``except DeprecationWarning: raise`` short-circuits before the generic generation-error wrappers. Add a comment in ``create()`` (with a back-reference from ``preview()``) to record that strict warning filters (``pytest.warns``, ``-W error::DeprecationWarning``) turn the engine's ``warnings.warn(..., DeprecationWarning)`` calls — most notably the ``allow_resize=True`` deprecation in ``_resolve_async_compatibility`` — into raised exceptions, and we want them to surface untouched instead of being swallowed by ``DataDesignerGenerationError``. * fix: close after-generation crash window and tighten metadata typing on resume Address review feedback on resume hardening: * Run after-generation processors unconditionally on the on-disk dataset rather than gating on the generation return value. The previous gate silently skipped after-generation when resume saw every row group already on disk, leaving a crash window between the final parquet write and the ``post_generation_state="started"`` marker write: in that window the dataset is complete but after-generation never ran, and the on-disk parquet files are still clean. The "started" short-circuit still rejects the other direction (crashed mid-rewrite, ambiguous state), so resume only re-runs after-generation when it is safe to do so. * Raise ``DatasetGenerationError`` (instead of letting a raw ``TypeError`` leak out of ``num_records < prior_target``) when a post-processed dataset's metadata is missing ``target_num_records``. Mirrors the wording used by ``_load_resume_state``. * Document the new behaviour in ``architecture/dataset-builders.md`` and the Fern resume invariants. Tests: * ``test_build_resume_complete_dataset_runs_after_generation_when_no_marker`` covers the closed crash window via the public ``set_processor_runner`` API. * ``test_build_resume_post_generation_processed_missing_target_raises_clearly`` covers the typed-error gap.		2026-05-11 11:44:46 -06:00
..
__init__.py	feat: add image generation support with multi-modal context (#317 )	2026-02-12 14:00:28 -07:00
test_artifact_storage.py	fix: harden resume checkpoint handling (#624 )	2026-05-11 11:44:46 -06:00
test_media_storage.py	chore: Improve CLI startup with lazy heavy import cleanup (#330 )	2026-02-18 16:24:15 -05:00