DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Johnny Greco	164db0aeb4	refactor: simplify agent CLI to context, types, and state (#418 ) (#420 ) * refactor: simplify agent CLI to context, types, and state subcommands - Remove schema and builder subcommands and all supporting code - Add description column (docstring first paragraph) to types table - Add config_file per family (relative to data_designer package) - Add config_package_path and library_version to context output - Clean section hierarchy: ## for sections, ### for family sub-tables - Add docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint * cleanup: remove dead code and fix redundant type discovery - Remove unused get_import_path (only used by deleted schema/builder) - Remove unused class_name from catalog dicts - Fix N+1: get_family_source_file uses get_args directly instead of rediscovering all types via discover_family_types * docs: update DropColumnsProcessorConfig docstring to prefer drop=True * fix: address Greptile review feedback - Add parameters:/params: to _SECTION_HEADERS for docstring parsing - Fix config_package_path to return parent of data_designer package so Path(base) / relative_file resolves correctly - Use last occurrence of data_designer in _get_source_file to handle nested paths (e.g. dev checkouts) - Return list of deduplicated files per family (get_family_source_files) instead of assuming all types live in one file - Add config_builder_file to context output * fix: resolve config_builder_file dynamically and fix fragile test - Use _get_source_file(DataDesignerConfigBuilder) instead of hardcoded string for config_builder_file, consistent with family file resolution - Fix test assertion that assumed "config" in path (only true in dev) * fix: return empty string for unresolvable source paths - _get_source_file returns "" instead of absolute path when data_designer is not in the path, consistent with error branch - Add Config Module section to context output pointing agent to the config module as the only part of the codebase to work with - Rename config_package_path to config_module_path (returns config dir) * refactor: remove ConfigBase.schema_text() and supporting helpers Schema rendering is no longer needed in the config layer — the agent CLI now provides file paths so agents can read source files directly. * Improve agent context output and processor discoverability - Redeclare `name: str` in DropColumnsProcessorConfig and SchemaTransformProcessorConfig so agents see the required field without reading the base class - Add base config file path to agent context output - Optimize agent context formatting: strip redundant path prefixes, remove family count summary, separate usable/unusable model aliases, rename sections for clarity * fix: restore emoji literal in get_column_emoji * fix: revert unnecessary name redeclarations and use posix paths - Remove bare name: str redeclarations in processor configs that silently dropped the parent Field(description=...) - Use Path.as_posix() in _get_source_file for consistent forward slashes * docs: standardize config docstrings with (required) markers and Inherited Attributes - Add (required) to all required parameters in Attributes sections - Add Inherited Attributes section to all config subclasses listing fields from parent classes (SingleColumnConfig, ProcessorConfig, Constraint) - Fix stale with_trace descriptions in LLM subclass inherited sections - Remove discriminator fields from Attributes sections - Remove redundant name: str redeclaration from ExpressionColumnConfig * fix: address Greptile feedback on model aliases and test paths - Show per-alias reason for unusable models instead of blanket "missing API keys" label - Surface model_config_present: tell agent when no config file exists - Fix test fixtures to use realistic data_designer/config/ paths that exercise _strip_config_prefix * test: add coverage for model_config_present=false branch * docs: put required attributes first in Inherited Attributes docstrings Move `name (required)` to the top of the Inherited Attributes section in LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig so required fields appear before optional ones. * fix: improve agent CLI output for clarity and agent comprehension - Use {config_root}/file.py path syntax across all agent output - Add config_root preamble to standalone `agent types` output - Replace type_name (discriminator) with type (class name) in tables - Show only usable model aliases; warn agent to surface config issues - Add directive scoping agents to the config module only - Reword import hint and config module description for directness * fix: fall back to absolute path for plugin source files _get_source_file() returned "" for types outside the data_designer package (e.g., plugin configs). Now returns the absolute path so the agent still gets a readable file reference. * fix: remove unreachable model_config_present branch from formatter main() calls ensure_cli_default_model_settings() before any agent command, so model config is always seeded. The model_config_present=False branch was dead code. * test: add coverage for no-usable-model-aliases warning Covers the remaining branch in _format_model_aliases_context where all aliases are unusable and the agent gets a warning to surface to the user. * fix: add inherited attributes to section headers and use posix paths Address two Greptile review comments: - Add "inherited attributes:" to _SECTION_HEADERS so docstring parsing stops before that section even without a preceding blank line. - Use .as_posix() in get_config_module_path() for consistent forward-slash paths across platforms.	2026-03-17 09:30:06 -07:00
Johnny Greco	4c19dba74b	feat: agent CLI introspection (simplified) (#415 ) * feat: add agent introspection cli * refactor: remove agent cli schema version * refactor: omit missing builder docstrings from context * refactor: tighten agent cli contract * feat: add schema_text() to ConfigBase for human-readable field summaries ConfigBase.schema_text() returns a concise text representation including the class docstring summary, field names, types, defaults, and descriptions. Field descriptions added to column config types to surface through this method. * refactor: flatten agent CLI into plain functions with text output mode Delete AgentController class and agent_command_defs module. Move all logic into agent_introspection (data) and agent_text_formatter (display) as plain functions. Add --json flag so commands default to human-readable text using schema_text(), with JSON as opt-in. Unify _emit helper, remove include_docstrings parameter, deduplicate catalog calls, and fix N+1 discover_family_types in get_family_schemas. * fix: port stale controller tests and consolidate command descriptions Port test_agent_controller.py to use plain functions instead of deleted AgentController. Extract AGENT_COMMANDS constant as single source for operation descriptions, syncing with main.py help strings. * style: fix ruff formatting in agent_introspection * refactor: centralize agent command definitions Extract AGENT_COMMANDS into agent_command_defs.py so main.py and agent_introspection.py share a single source for command names, help text, and metadata. The new module has no heavy dependencies, keeping --help latency unaffected. * fix: handle default_factory and empty providers in schema_text and introspection - schema_text() now detects default_factory fields and renders e.g. "list()" instead of leaking PydanticUndefined - Guard against IndexError when provider registry has an empty providers list - Add 15 edge-case tests for schema_text covering default_factory, enum defaults, None defaults, scalar defaults, descriptions, and docstrings * refactor: remove JSON output mode from agent CLI commands Text-only output simplifies the interface. Structured output can be added back trivially since the functions already return dicts. * docs: update schema_text docstring to reflect agent focus * fix: include builder section and import_path in agent text output - format_context_text now renders a ## Builder section - format_types_text now includes import_path column in tables * refactor: drop import_path from types tables All config objects are imported via dd.<ClassName>, so the full import path is redundant noise in agent output. * docs: add family definition and import hint to context output * refactor: rename Types section to Families, drop redundant "types" from sub-headers * fix: coerce None to empty string in table cells row.get(col, '') returns None when the key exists with value None, causing str(None) to render "None" in the output. Use `or ''` instead. * refactor: move agent controller tests to utils as introspection integration tests There is no controller layer — these tests exercise functions in agent_introspection.py, so they belong in tests/cli/utils/. * fix: only coerce None to empty string in table cells, not False The previous `or ''` pattern treated all falsy values (including False) as empty. Use an explicit None check so booleans render correctly. * style: address review nits from nabin - Add explicit parentheses to and/or precedence in _build_agent_lazy_group - Rename loop variable l to line in test_schema_text - Move get_family_schema import to module level in test_agent_text_formatter * fix: improve schema_text Literal display, builder signature quotes, and docstring parsing - _format_annotation now renders Literal['value'] instead of bare Literal - _format_signature strips quotes from stringified annotations caused by `from __future__ import annotations` - _get_docstring_summary stops at any Google-style section header, not just Attributes:	2026-03-13 18:26:00 -04:00
Johnny Greco	b94b88b7a4	feat(cli): bootstrap default configs on CLI startup (#401 ) * feat(cli): bootstrap default configs on command run * fix(cli): use active interpreter in bootstrap warning * refactor(cli): simplify bootstrap warning flow * refactor(cli): bootstrap defaults in main entrypoint * refactor(cli): keep bootstrap ownership in main * test(cli): cover lazy dispatch and runtime failure flag * refactor(cli): remove redundant bootstrap state * test(cli): assert bootstrap warning includes error * test: address cli bootstrap review feedback	2026-03-12 15:42:41 -04:00
Johnny Greco	03b3d6c726	chore: address Andre's feedback on --save-results and CLI preview (#335 ) * fix: suppress stdout when saving report and sample records to file Console(record=True) still prints to stdout by default. Use file=io.StringIO() to redirect output so save-path calls only write to disk. * refactor: --save-results skips terminal display When --save-results is used, records and the analysis report are no longer printed to the terminal. Extracted save logic into a dedicated _save_preview_results method and updated option help text accordingly. * feat: wrap-around navigation in sample records browser Prev/next buttons and arrow keys now cycle back to the beginning/end instead of clamping at boundaries. * test: reuse record_series fixture in visualization tests * feat: thread --theme through to sample records pager The pager shell was hardcoded dark, so --theme light produced light records inside a dark frame. Extract CSS variables into dark/light constants and pass the theme from the controller. * fix: cap terminal display width at display_width The module-level Console() had no width limit, so tables with expand=True stretched to the full terminal width. Cap terminal output at min(terminal_width, display_width) and thread the display_width parameter through the controller's display methods. * docs: update --display-width and --theme help text Remove "Only applies when --save-results is used" from --display-width since it now also affects terminal output. * fix: update generation controller tests to match display_width and save_results behavior	2026-02-18 20:17:03 -05:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Johnny Greco	f2a1657870	feat: add --save-results option to preview command (#333 ) * feat: add --save-report option to preview command * feat: add save_path option to display_sample_record Allow saving rendered sample records as HTML or SVG files via an optional save_path parameter on both the standalone function and the WithRecordSamplerMixin method. * feat: replace --save-report with --save-results on preview command Replace the single-file --save-report option with --save-results, which saves all preview artifacts (dataset parquet, analysis report HTML, and per-record sample HTMLs) into a timestamped directory under the artifact path. Add error handling around the save block, improve timestamp precision to microseconds, and expand test coverage for the new behavior. * feat: add sample records pager with theme toggle, postMessage bridge, and UI polish * feat: add dataset metadata subtitle to pager and clean up toolbar layout * fix: address review findings for preview save-results feature - Split try/except in generation_controller so report display errors don't produce misleading "failed to save" messages when not saving - Add browser HTML path to save success output for discoverability - Remove 5 unused CSS variables from pager theme constants - Add "N of M" record counter to pager toolbar - Add theme/display_width assertions to all preview_command tests - Add dedicated test for custom theme and display_width passthrough - Add tests for record counter and CSS variable cleanup * fix: address code review findings and simplify pager - Fix critical bug: analysis report now displays to console even when --save-results is active (was silently dropped via pass statement) - Fix latent UnboundLocalError in display_sample_record when index is out of bounds (num_records computed before try block) - Eliminate duplicated dark CSS between constant and theme listener script - Simplify sample_records_pager: remove dual-theme system, postMessage bridge, and responsive media queries; restore GitHub link; reorder toolbar to put prev/next buttons on the far left - Narrow except Exception to except OSError in save-results path - Use case-insensitive extension check and lambda-based re.sub - Collapse redundant preview command delegation tests into parametrize - Add missing type annotations and remove tautological assertions * style: move record counter to far right of pager toolbar * refactor: remove dead theme-listener script and inline CSS constant _THEME_LISTENER_SCRIPT and _SAMPLE_RECORD_DARK_CSS_INLINE became orphaned after the pager simplification removed the postMessage bridge. This removes both constants, drops the injection line, switches the idempotency guard to the viewport meta tag, and cleans up related test assertions. * fix: move Path import out of TYPE_CHECKING block in test_visualization * fix: rename _logger to logger to match codebase convention * fix: remove unnecessary cast in preview command theme parameter * refactor: extract DEFAULT_DISPLAY_WIDTH constant and make apply_html_post_processing public * Update packages/data-designer-config/tests/config/utils/test_visualization.py --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>	2026-02-18 15:58:35 -05:00
Johnny Greco	1514720596	feat: support loading config files from HTTP(S) URLs (#323 ) * support loading config files from http urls - allow config builder and CLI loader to load YAML/JSON configs from HTTP(S) URLs - reject unsupported URL extensions and remote Python module URLs - update CLI help text and add tests for URL success/failure paths * harden remote config loading and deduplicate URL validation - Add size limit (10 MB) when fetching configs from URLs - Validate parsed YAML is a dict before returning - Make is_http_url public and reuse it in CLI validate_url - Replace local CONFIG_FILE_EXTENSIONS with shared constant - Add tests for is_http_url, URL-with-no-extension edge cases * use requests for remote config loading - replace urllib URL fetching with requests and status checks - parse remote payloads via smart_load_yaml for consistent validation - expand tests for HTTP errors, size limits, and non-dict payloads * lower remote config size limit to 1 MB * improve config URL HTTP error reporting Add granular 401/403/404 and generic HTTP status errors for remote config fetching to make failures actionable. Clarify that authenticated config URL loading is not currently supported and update tests for status-aware behavior. * rewrite github blob URLs for remote loading Handle GitHub blob links by rewriting them to raw content URLs for config and dataframe HTTP loaders, preserving query params but avoiding query token leaks in logs. This also fixes extension detection for URLs with query strings and adds coverage for rewrite behavior. * remove validate_url wrapper in favor of is_http_url The validate_url function in cli/utils was just a thin wrapper around is_http_url from io_helpers. Remove it and have callers use is_http_url directly for clarity and reduced indirection. * fix optional type for artifact_path CLI option * fix URL recursion in smart_load_yaml - avoid treating remote payload strings as new URL inputs - add regression test for URL string payloads from remote config * rewrite huggingface blob URLs for remote loading	2026-02-11 15:12:52 -05:00
Johnny Greco	d3c4de76da	feat: add preview, create, and validate CLI commands (#313 ) * feat: add preview, create, and validate CLI commands Add three new top-level CLI commands for the data-designer workflow: - `data-designer preview` - generate preview datasets for fast iteration - `data-designer create` - create full datasets and save to disk - `data-designer validate` - validate configuration files Also includes: - Move wait_for_navigation_key() UI primitive from preview.py to ui.py - Add KeyPressEvent type annotations to all key binding handlers in ui.py - Refactor cli/utils.py into cli/utils/ package with config_loader module - Comprehensive test coverage for all new commands * fix: update pythonjsonlogger import and clean up dev dependencies - Update pythonjsonlogger import to use newer JsonFormatter API - Consolidate dev-dependencies into [dependency-groups] dev section - Remove unnecessary test cli/utils __init__.py * small E * address greptile feedback * organize CLI commands into rich help panels Group top-level commands under "Generation" and "Setup" panels for clearer help output. * refactor config loader to parse files directly and auto-detect config format - Parse YAML/JSON files into dicts before passing to from_config, providing format-specific error messages for parse failures - Auto-detect DataDesignerConfig format (columns at top level) and wrap it into BuilderConfig so users can provide either format - Clean up Python module loading with try/except/finally for reliable sys.modules and sys.path cleanup - Add comprehensive tests for parsing, validation, and auto-wrapping * fix sys.path cleanup in config loader and simplify tests - Use pop(0) instead of remove() to precisely undo the insert(0, ...) and avoid accidentally removing a different matching path entry - Replace MagicMock with real DataDesignerConfigBuilder in tests * move config format auto-detection into from_config Centralize the shorthand DataDesignerConfig detection (columns at top level without a data_designer wrapper) in DataDesignerConfigBuilder.from_config so all callers benefit, not just the CLI config loader. Simplify config_loader to delegate file parsing and format normalization entirely to from_config. * extract GenerationController from CLI commands Move shared generation logic (preview, validate, create) out of the individual Typer command functions into a dedicated GenerationController, matching the existing controller pattern (DownloadController, etc.). The command functions now delegate to the controller, keeping them as thin entry points. Tests updated accordingly — command tests verify delegation while controller tests cover the full behavior. * harden sys.path cleanup and add explanatory comments Use sys.path.remove() instead of checking sys.path[0] so cleanup succeeds even when exec_module inserts entries at index 0. Drop unnecessary spec=DataDesignerConfigBuilder from test mocks. * check stdout TTY in preview interactive mode detection Previously only stdin was checked, so piping stdout (e.g. `dd preview cfg.yaml \| head`) would still attempt interactive browsing. Now both stdin and stdout must be a TTY.	2026-02-11 14:06:06 -05:00

8 commits