Commit graph

10 commits

Author SHA1 Message Date
Nabin Mulepati
2a487cdc5c
feat: add dropped column preservation toggle (#691)
* feat: add dropped column preservation toggle

Closes #690

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>

* fix: reject dropped column policy resume mismatch

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>

---------

Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
2026-05-21 13:19:20 -06:00
Andre Manoel
61fa0150f7
fix: support nested field access in schema transform templates (#435)
* feat: support nested field access in schema transform templates

Enable {{ result.quality.score }} style dot notation in schema
transform Jinja2 templates, where result is a deserialized JSON column.

Previously, _json_escape_record flattened all dict values to escaped
JSON strings before Jinja2 saw them. This made the rendered output
valid JSON but prevented nested access since Jinja2 only saw strings.

The fix introduces TemplateValue, a wrapper that defers the choice
between "drill into nested dict" and "render as escaped string" to
template evaluation time. Jinja2 resolves dot notation via __getattr__
(returning a new TemplateValue for the nested value), and converts to
string via __str__ (delegating to a caller-provided str_fn). This is
necessary because plain dicts render as Python repr ({'key': 'val'})
which is invalid JSON - we need to control __str__ to produce properly
escaped JSON, and that requires a wrapper object.

Other Jinja2 consumers (prompt templates, expression columns) don't
need this - Jinja2 natively supports dot access on plain dicts via
getattr-to-getitem fallback, and plain str() is fine for text output.
Schema transform is unique because its output must be valid JSON.

* Address PR review comments

- Fix boolean serialization: add bool check before str in _escape_value_for_json
  to produce JSON 'true'/'false' instead of Python 'True'/'False'
- Add class-level _record_str_fn annotation to WithJinja2UserTemplateRendering
- Rename skip_record_sanitization to _skip_record_sanitization (underscore prefix)
  to signal internal-only usage, and document it in safe_render docstring
- Add defensive error handling in TemplateValue.__getitem__ and __iter__
- Promote test input data to parametrize column, removing brittle string scan

* Address second round of PR review comments

- Add __eq__ and __hash__ to TemplateValue so Jinja2 equality
  conditionals (e.g. {% if result.label == "excellent" %}) work
- Add inline comment explaining deliberate double-encode in
  _escape_value_for_json for dict/list values
- Default _record_str_fn to None at class level so accessing it
  before prepare_jinja2_template_renderer doesn't mask the real error

* refactor: replace TemplateValue with Jinja2 finalize hook

Use Jinja2's built-in finalize hook for value-to-string conversion
and a getattr override for dict-key-priority lookup, eliminating the
custom TemplateValue wrapper class entirely.

* docs: add missing param docs to prepare_jinja2_template_renderer
2026-03-18 15:28:18 -03:00
Andre Manoel
982ce79ca9
feat: add processor plugin support (#299)
* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/
2026-02-25 16:40:01 -03:00
Andre Manoel
438aa4986d
fix: make DropColumnsProcessorConfig idempotent and support reasoning columns (#334)
* fix: make DropColumnsProcessorConfig idempotent and support reasoning columns

- add_processor now uses upsert semantics: re-adding a processor with the
  same name replaces the old one and reverts its drop=True side-effects,
  making notebook cells safely re-runnable.
- validate_drop_columns_processor now includes side-effect columns
  (reasoning_content, trace) so reasoning columns can be dropped.

Fixes #332

* test: reduce duplication in drop-columns tests

- Use parametrize for reasoning column validation cases
- Extract _add_sampler helper to avoid repeated SamplerColumnConfig setup
- Move validate_drop_columns_processor import to top of file

* feat: support glob patterns in DropColumnsProcessorConfig column_names

Patterns like "*__reasoning_content" or "col_*" are now expanded against
available columns at validation time and at runtime. Validation emits a
warning when a glob pattern matches no columns.

* fix: preserve drop flag when column is referenced by other processors

When removing a DropColumnsProcessor, only revert drop=True on columns
that are not also dropped by another processor.

* fix: deduplicate resolved column names from overlapping glob patterns

Prevents duplicates when column_names contains both a literal and a
matching glob (e.g. ["col_a", "col_*"]), which would cause a KeyError
at runtime when dropping the same column twice.

* fix: restrict glob detection to * only

Avoids false positives from column names containing [ or ? characters.

* refactor: simplify resolve helpers and flatten test class

- Use dict-as-ordered-set pattern instead of seen + list for dedup
- Flatten TestAddProcessorIdempotent into top-level test functions

* test: use fixture and parametrize for add_processor tests

* fix: remove redundant quotes around repr in validation message
2026-02-19 17:21:42 -03:00
Johnny Greco
1439bbea7e
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.

Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations

Reduces CLI import-time from ~1.67s to ~0.46s.

* perf: defer pandas/numpy in io_helpers and add config_list benchmark

- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
  with module-level __getattr__ (for backwards-compatible external
  access / test mocks) and function-level imports in the 3 functions
  that actually use them (read_parquet_dataset, smart_load_dataframe,
  _convert_to_serializable). Importing io_helpers no longer triggers
  pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
  bodies to avoid loading repositories, Rich, and prompt_toolkit at
  module import time.
- Add `config_list` (data-designer config list) measurement to the
  CLI startup benchmark with isolated cold measurement in a separate
  venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.

* Refine lazy import usage and TYPE_CHECKING cleanup

* Run license header updater on PR-touched files

* fix: update sqlfluff mock target for lazy imports in test_sql

* perf: cache globals() in lazy __getattr__ to avoid repeated lookups

Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.

* perf: lazy CLI command loading and deferred heavy import evaluations

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files

- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes

- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks

- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use

- Update test mock targets to patch at usage-site for module-level imports

* refactor: use direct pandas import in seed_source_dataframe

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.

* update lazy import pattern

* update tests to use lazy import namespace

Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.

* tighten import perf test thresholds

Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.

* document pandas import requirement

Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.

* increase timeout time

* use lazy pandas imports in visualization tests

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted

* fix lazy pandas runtime usage and preview mocks

Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
2026-02-18 16:24:15 -05:00
Nabin Mulepati
1f172eaa28
chore: move ArtifactStorage to engine/storage/ module (#321) 2026-02-12 14:58:12 -07:00
Andre Manoel
429b558588
refactor: callback-based processor design (#294) 2026-02-11 21:32:24 -03:00
Andre Manoel
406928d83a
fix: escape special characters in SchemaTransformProcessor JSON templates (#250)
Fixes GitHub issue #227 where SchemaTransformProcessor fails with
JSONDecodeError when LLM-generated content contains quotes, backslashes,
newlines, or other special characters that break JSON parsing.

The fix properly escapes all string values before template rendering
using json.dumps to handle all JSON-special characters.
2026-01-28 20:54:02 -03:00
Johnny Greco
c19f35639f
chore: add publish script and update license headers (#253) 2026-01-28 08:47:34 -05:00
Johnny Greco
ae0665fa16
refactor: slim package refactor into three subpackages (#240)
* remove old structure

* major shuffle

* streamline project configs

* update make commands

* updates to make commands

* remove essentials

* initialize logger in interface

* uv lock

* ignore notepad

* update workflows

* fix e2e project config

* generate colab notebooks

* resolve default model settings in interface

* fix build commands

* update perf import make command

* cleaning up some slop

* update recipes

* move conftest files to tests/

* update subpackage readmes

* streamline config_logging

* use exports

* update perf import usage pattern

* update for IDE behavior with ruff

* remove engine's fixtures file

* add note to about lazy imports

* update dependencies

* update docs

* doc fixes

* uv lock

* updates to catch up with main

* clean up makefile

* remove package gitignores

* define deps only once

* isolate tests

* add test for protetion rule

* create temp dirs for isolated tests

* catch up to main

* update headers

* re apply changes

* better result summaries for isolated tests

* move exports into top-level init

* fix client importlib version syntax

* catch up with main
2026-01-27 13:53:20 -05:00