mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
6 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6055290136
|
feat: add workflow chaining (#636)
Some checks are pending
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Lint and Format Check (push) Waiting to run
CI / Check License Headers (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
Publish Fern devnotes / deploy (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
* feat: add workflow chaining * test: tidy workflow chaining coverage * fix: harden workflow chaining concurrency * docs: update workflow chaining plan * feat: add workflow stage postprocessors * feat: expose workflow stage outputs * fix: align workflow selected output export * fix: address workflow chaining review issues * fix: align workflow parquet export selection Signed-off-by: Andre Manoel <amanoel@nvidia.com> * fix: preserve generated columns in drop validation Signed-off-by: Andre Manoel <amanoel@nvidia.com> * fix: clarify workflow output processors Signed-off-by: Andre Manoel <amanoel@nvidia.com> * docs: add workflow chaining page * docs: align workflow chaining warning * fix: address workflow review nits --------- Signed-off-by: Andre Manoel <amanoel@nvidia.com> |
||
|
|
a9af365e8e
|
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in
|
||
|
|
438aa4986d
|
fix: make DropColumnsProcessorConfig idempotent and support reasoning columns (#334)
* fix: make DropColumnsProcessorConfig idempotent and support reasoning columns - add_processor now uses upsert semantics: re-adding a processor with the same name replaces the old one and reverts its drop=True side-effects, making notebook cells safely re-runnable. - validate_drop_columns_processor now includes side-effect columns (reasoning_content, trace) so reasoning columns can be dropped. Fixes #332 * test: reduce duplication in drop-columns tests - Use parametrize for reasoning column validation cases - Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup - Move validate_drop_columns_processor import to top of file * feat: support glob patterns in DropColumnsProcessorConfig column_names Patterns like "*__reasoning_content" or "col_*" are now expanded against available columns at validation time and at runtime. Validation emits a warning when a glob pattern matches no columns. * fix: preserve drop flag when column is referenced by other processors When removing a DropColumnsProcessor, only revert drop=True on columns that are not also dropped by another processor. * fix: deduplicate resolved column names from overlapping glob patterns Prevents duplicates when column_names contains both a literal and a matching glob (e.g. ["col_a", "col_*"]), which would cause a KeyError at runtime when dropping the same column twice. * fix: restrict glob detection to * only Avoids false positives from column names containing [ or ? characters. * refactor: simplify resolve helpers and flatten test class - Use dict-as-ordered-set pattern instead of seen + list for dedup - Flatten TestAddProcessorIdempotent into top-level test functions * test: use fixture and parametrize for add_processor tests * fix: remove redundant quotes around repr in validation message |
||
|
|
429b558588
|
refactor: callback-based processor design (#294) | ||
|
|
c19f35639f
|
chore: add publish script and update license headers (#253) | ||
|
|
ae0665fa16
|
refactor: slim package refactor into three subpackages (#240)
* remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main |
Renamed from tests/engine/test_validation.py (Browse further)