DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Johnny Greco	03b3d6c726	chore: address Andre's feedback on --save-results and CLI preview (#335 ) * fix: suppress stdout when saving report and sample records to file Console(record=True) still prints to stdout by default. Use file=io.StringIO() to redirect output so save-path calls only write to disk. * refactor: --save-results skips terminal display When --save-results is used, records and the analysis report are no longer printed to the terminal. Extracted save logic into a dedicated _save_preview_results method and updated option help text accordingly. * feat: wrap-around navigation in sample records browser Prev/next buttons and arrow keys now cycle back to the beginning/end instead of clamping at boundaries. * test: reuse record_series fixture in visualization tests * feat: thread --theme through to sample records pager The pager shell was hardcoded dark, so --theme light produced light records inside a dark frame. Extract CSS variables into dark/light constants and pass the theme from the controller. * fix: cap terminal display width at display_width The module-level Console() had no width limit, so tables with expand=True stretched to the full terminal width. Cap terminal output at min(terminal_width, display_width) and thread the display_width parameter through the controller's display methods. * docs: update --display-width and --theme help text Remove "Only applies when --save-results is used" from --display-width since it now also affects terminal output. * fix: update generation controller tests to match display_width and save_results behavior	2026-02-18 20:17:03 -05:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Johnny Greco	f2a1657870	feat: add --save-results option to preview command (#333 ) * feat: add --save-report option to preview command * feat: add save_path option to display_sample_record Allow saving rendered sample records as HTML or SVG files via an optional save_path parameter on both the standalone function and the WithRecordSamplerMixin method. * feat: replace --save-report with --save-results on preview command Replace the single-file --save-report option with --save-results, which saves all preview artifacts (dataset parquet, analysis report HTML, and per-record sample HTMLs) into a timestamped directory under the artifact path. Add error handling around the save block, improve timestamp precision to microseconds, and expand test coverage for the new behavior. * feat: add sample records pager with theme toggle, postMessage bridge, and UI polish * feat: add dataset metadata subtitle to pager and clean up toolbar layout * fix: address review findings for preview save-results feature - Split try/except in generation_controller so report display errors don't produce misleading "failed to save" messages when not saving - Add browser HTML path to save success output for discoverability - Remove 5 unused CSS variables from pager theme constants - Add "N of M" record counter to pager toolbar - Add theme/display_width assertions to all preview_command tests - Add dedicated test for custom theme and display_width passthrough - Add tests for record counter and CSS variable cleanup * fix: address code review findings and simplify pager - Fix critical bug: analysis report now displays to console even when --save-results is active (was silently dropped via pass statement) - Fix latent UnboundLocalError in display_sample_record when index is out of bounds (num_records computed before try block) - Eliminate duplicated dark CSS between constant and theme listener script - Simplify sample_records_pager: remove dual-theme system, postMessage bridge, and responsive media queries; restore GitHub link; reorder toolbar to put prev/next buttons on the far left - Narrow except Exception to except OSError in save-results path - Use case-insensitive extension check and lambda-based re.sub - Collapse redundant preview command delegation tests into parametrize - Add missing type annotations and remove tautological assertions * style: move record counter to far right of pager toolbar * refactor: remove dead theme-listener script and inline CSS constant _THEME_LISTENER_SCRIPT and _SAMPLE_RECORD_DARK_CSS_INLINE became orphaned after the pager simplification removed the postMessage bridge. This removes both constants, drops the injection line, switches the idempotency guard to the viewport meta tag, and cleans up related test assertions. * fix: move Path import out of TYPE_CHECKING block in test_visualization * fix: rename _logger to logger to match codebase convention * fix: remove unnecessary cast in preview command theme parameter * refactor: extract DEFAULT_DISPLAY_WIDTH constant and make apply_html_post_processing public * Update packages/data-designer-config/tests/config/utils/test_visualization.py --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>	2026-02-18 15:58:35 -05:00
Johnny Greco	d3c4de76da	feat: add preview, create, and validate CLI commands (#313 ) * feat: add preview, create, and validate CLI commands Add three new top-level CLI commands for the data-designer workflow: - `data-designer preview` - generate preview datasets for fast iteration - `data-designer create` - create full datasets and save to disk - `data-designer validate` - validate configuration files Also includes: - Move wait_for_navigation_key() UI primitive from preview.py to ui.py - Add KeyPressEvent type annotations to all key binding handlers in ui.py - Refactor cli/utils.py into cli/utils/ package with config_loader module - Comprehensive test coverage for all new commands * fix: update pythonjsonlogger import and clean up dev dependencies - Update pythonjsonlogger import to use newer JsonFormatter API - Consolidate dev-dependencies into [dependency-groups] dev section - Remove unnecessary test cli/utils __init__.py * small E * address greptile feedback * organize CLI commands into rich help panels Group top-level commands under "Generation" and "Setup" panels for clearer help output. * refactor config loader to parse files directly and auto-detect config format - Parse YAML/JSON files into dicts before passing to from_config, providing format-specific error messages for parse failures - Auto-detect DataDesignerConfig format (columns at top level) and wrap it into BuilderConfig so users can provide either format - Clean up Python module loading with try/except/finally for reliable sys.modules and sys.path cleanup - Add comprehensive tests for parsing, validation, and auto-wrapping * fix sys.path cleanup in config loader and simplify tests - Use pop(0) instead of remove() to precisely undo the insert(0, ...) and avoid accidentally removing a different matching path entry - Replace MagicMock with real DataDesignerConfigBuilder in tests * move config format auto-detection into from_config Centralize the shorthand DataDesignerConfig detection (columns at top level without a data_designer wrapper) in DataDesignerConfigBuilder.from_config so all callers benefit, not just the CLI config loader. Simplify config_loader to delegate file parsing and format normalization entirely to from_config. * extract GenerationController from CLI commands Move shared generation logic (preview, validate, create) out of the individual Typer command functions into a dedicated GenerationController, matching the existing controller pattern (DownloadController, etc.). The command functions now delegate to the controller, keeping them as thin entry points. Tests updated accordingly — command tests verify delegation while controller tests cover the full behavior. * harden sys.path cleanup and add explanatory comments Use sys.path.remove() instead of checking sys.path[0] so cleanup succeeds even when exec_module inserts entries at index 0. Drop unnecessary spec=DataDesignerConfigBuilder from test mocks. * check stdout TTY in preview interactive mode detection Previously only stdin was checked, so piping stdout (e.g. `dd preview cfg.yaml \| head`) would still attempt interactive browsing. Now both stdin and stdout must be a TTY.	2026-02-11 14:06:06 -05:00

4 commits