* feat: add processor plugin support
Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).
- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation
* test: add processor plugin registration test
Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.
* test: simplify processor plugin registration test
* move ProcessorConfig to base and convert demo to e2e test
- Move ProcessorConfig from processors.py to config.base to guard
against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/
* move plan to plans/299/
* fix: make DropColumnsProcessorConfig idempotent and support reasoning columns
- add_processor now uses upsert semantics: re-adding a processor with the
same name replaces the old one and reverts its drop=True side-effects,
making notebook cells safely re-runnable.
- validate_drop_columns_processor now includes side-effect columns
(reasoning_content, trace) so reasoning columns can be dropped.
Fixes#332
* test: reduce duplication in drop-columns tests
- Use parametrize for reasoning column validation cases
- Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup
- Move validate_drop_columns_processor import to top of file
* feat: support glob patterns in DropColumnsProcessorConfig column_names
Patterns like "*__reasoning_content" or "col_*" are now expanded against
available columns at validation time and at runtime. Validation emits a
warning when a glob pattern matches no columns.
* fix: preserve drop flag when column is referenced by other processors
When removing a DropColumnsProcessor, only revert drop=True on columns
that are not also dropped by another processor.
* fix: deduplicate resolved column names from overlapping glob patterns
Prevents duplicates when column_names contains both a literal and a
matching glob (e.g. ["col_a", "col_*"]), which would cause a KeyError
at runtime when dropping the same column twice.
* fix: restrict glob detection to * only
Avoids false positives from column names containing [ or ? characters.
* refactor: simplify resolve helpers and flatten test class
- Use dict-as-ordered-set pattern instead of seen + list for dedup
- Flatten TestAddProcessorIdempotent into top-level test functions
* test: use fixture and parametrize for add_processor tests
* fix: remove redundant quotes around repr in validation message
* perf: defer heavy imports to improve CLI startup time
Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.
Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations
Reduces CLI import-time from ~1.67s to ~0.46s.
* perf: defer pandas/numpy in io_helpers and add config_list benchmark
- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
with module-level __getattr__ (for backwards-compatible external
access / test mocks) and function-level imports in the 3 functions
that actually use them (read_parquet_dataset, smart_load_dataframe,
_convert_to_serializable). Importing io_helpers no longer triggers
pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
bodies to avoid loading repositories, Rich, and prompt_toolkit at
module import time.
- Add `config_list` (data-designer config list) measurement to the
CLI startup benchmark with isolated cold measurement in a separate
venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.
* Refine lazy import usage and TYPE_CHECKING cleanup
* Run license header updater on PR-touched files
* fix: update sqlfluff mock target for lazy imports in test_sql
* perf: cache globals() in lazy __getattr__ to avoid repeated lookups
Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.
* perf: lazy CLI command loading and deferred heavy import evaluations
- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files
- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes
- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks
- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use
- Update test mock targets to patch at usage-site for module-level imports
* refactor: use direct pandas import in seed_source_dataframe
Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.
* update lazy import pattern
* update tests to use lazy import namespace
Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.
* tighten import perf test thresholds
Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.
* document pandas import requirement
Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.
* increase timeout time
* use lazy pandas imports in visualization tests
- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted
* fix lazy pandas runtime usage and preview mocks
Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
Fixes GitHub issue #227 where SchemaTransformProcessor fails with
JSONDecodeError when LLM-generated content contains quotes, backslashes,
newlines, or other special characters that break JSON parsing.
The fix properly escapes all string values before template rendering
using json.dumps to handle all JSON-special characters.