DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Andre Manoel	982ce79ca9	feat: add processor plugin support (#299 ) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/	2026-02-25 16:40:01 -03:00
Andre Manoel	438aa4986d	fix: make DropColumnsProcessorConfig idempotent and support reasoning columns (#334 ) * fix: make DropColumnsProcessorConfig idempotent and support reasoning columns - add_processor now uses upsert semantics: re-adding a processor with the same name replaces the old one and reverts its drop=True side-effects, making notebook cells safely re-runnable. - validate_drop_columns_processor now includes side-effect columns (reasoning_content, trace) so reasoning columns can be dropped. Fixes #332 * test: reduce duplication in drop-columns tests - Use parametrize for reasoning column validation cases - Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup - Move validate_drop_columns_processor import to top of file * feat: support glob patterns in DropColumnsProcessorConfig column_names Patterns like "__reasoning_content" or "col_" are now expanded against available columns at validation time and at runtime. Validation emits a warning when a glob pattern matches no columns. * fix: preserve drop flag when column is referenced by other processors When removing a DropColumnsProcessor, only revert drop=True on columns that are not also dropped by another processor. * fix: deduplicate resolved column names from overlapping glob patterns Prevents duplicates when column_names contains both a literal and a matching glob (e.g. ["col_a", "col_"]), which would cause a KeyError at runtime when dropping the same column twice. fix: restrict glob detection to * only Avoids false positives from column names containing [ or ? characters. * refactor: simplify resolve helpers and flatten test class - Use dict-as-ordered-set pattern instead of seen + list for dedup - Flatten TestAddProcessorIdempotent into top-level test functions * test: use fixture and parametrize for add_processor tests * fix: remove redundant quotes around repr in validation message	2026-02-19 17:21:42 -03:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Nabin Mulepati	1f172eaa28	chore: move ArtifactStorage to engine/storage/ module (#321 )	2026-02-12 14:58:12 -07:00
Andre Manoel	429b558588	refactor: callback-based processor design (#294 )	2026-02-11 21:32:24 -03:00
Andre Manoel	406928d83a	fix: escape special characters in SchemaTransformProcessor JSON templates (#250 ) Fixes GitHub issue #227 where SchemaTransformProcessor fails with JSONDecodeError when LLM-generated content contains quotes, backslashes, newlines, or other special characters that break JSON parsing. The fix properly escapes all string values before template rendering using json.dumps to handle all JSON-special characters.	2026-01-28 20:54:02 -03:00
Johnny Greco	c19f35639f	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00

8 commits