Commit graph

8 commits

Author SHA1 Message Date
Johnny Greco
164db0aeb4
refactor: simplify agent CLI to context, types, and state (#418) (#420)
* refactor: simplify agent CLI to context, types, and state subcommands

- Remove schema and builder subcommands and all supporting code
- Add description column (docstring first paragraph) to types table
- Add config_file per family (relative to data_designer package)
- Add config_package_path and library_version to context output
- Clean section hierarchy: ## for sections, ### for family sub-tables
- Add docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint

* cleanup: remove dead code and fix redundant type discovery

- Remove unused get_import_path (only used by deleted schema/builder)
- Remove unused class_name from catalog dicts
- Fix N+1: get_family_source_file uses get_args directly instead of
  rediscovering all types via discover_family_types

* docs: update DropColumnsProcessorConfig docstring to prefer drop=True

* fix: address Greptile review feedback

- Add parameters:/params: to _SECTION_HEADERS for docstring parsing
- Fix config_package_path to return parent of data_designer package so
  Path(base) / relative_file resolves correctly
- Use last occurrence of data_designer in _get_source_file to handle
  nested paths (e.g. dev checkouts)
- Return list of deduplicated files per family (get_family_source_files)
  instead of assuming all types live in one file
- Add config_builder_file to context output

* fix: resolve config_builder_file dynamically and fix fragile test

- Use _get_source_file(DataDesignerConfigBuilder) instead of hardcoded
  string for config_builder_file, consistent with family file resolution
- Fix test assertion that assumed "config" in path (only true in dev)

* fix: return empty string for unresolvable source paths

- _get_source_file returns "" instead of absolute path when
  data_designer is not in the path, consistent with error branch
- Add Config Module section to context output pointing agent to
  the config module as the only part of the codebase to work with
- Rename config_package_path to config_module_path (returns config dir)

* refactor: remove ConfigBase.schema_text() and supporting helpers

Schema rendering is no longer needed in the config layer — the agent
CLI now provides file paths so agents can read source files directly.

* Improve agent context output and processor discoverability

- Redeclare `name: str` in DropColumnsProcessorConfig and
  SchemaTransformProcessorConfig so agents see the required field
  without reading the base class
- Add base config file path to agent context output
- Optimize agent context formatting: strip redundant path prefixes,
  remove family count summary, separate usable/unusable model aliases,
  rename sections for clarity

* fix: restore emoji literal in get_column_emoji

* fix: revert unnecessary name redeclarations and use posix paths

- Remove bare name: str redeclarations in processor configs that
  silently dropped the parent Field(description=...)
- Use Path.as_posix() in _get_source_file for consistent forward slashes

* docs: standardize config docstrings with (required) markers and Inherited Attributes

- Add (required) to all required parameters in Attributes sections
- Add Inherited Attributes section to all config subclasses listing
  fields from parent classes (SingleColumnConfig, ProcessorConfig, Constraint)
- Fix stale with_trace descriptions in LLM subclass inherited sections
- Remove discriminator fields from Attributes sections
- Remove redundant name: str redeclaration from ExpressionColumnConfig

* fix: address Greptile feedback on model aliases and test paths

- Show per-alias reason for unusable models instead of blanket
  "missing API keys" label
- Surface model_config_present: tell agent when no config file exists
- Fix test fixtures to use realistic data_designer/config/ paths that
  exercise _strip_config_prefix

* test: add coverage for model_config_present=false branch

* docs: put required attributes first in Inherited Attributes docstrings

Move `name (required)` to the top of the Inherited Attributes section
in LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig
so required fields appear before optional ones.

* fix: improve agent CLI output for clarity and agent comprehension

- Use {config_root}/file.py path syntax across all agent output
- Add config_root preamble to standalone `agent types` output
- Replace type_name (discriminator) with type (class name) in tables
- Show only usable model aliases; warn agent to surface config issues
- Add directive scoping agents to the config module only
- Reword import hint and config module description for directness

* fix: fall back to absolute path for plugin source files

_get_source_file() returned "" for types outside the data_designer
package (e.g., plugin configs). Now returns the absolute path so
the agent still gets a readable file reference.

* fix: remove unreachable model_config_present branch from formatter

main() calls ensure_cli_default_model_settings() before any agent
command, so model config is always seeded. The model_config_present=False
branch was dead code.

* test: add coverage for no-usable-model-aliases warning

Covers the remaining branch in _format_model_aliases_context where
all aliases are unusable and the agent gets a warning to surface to
the user.

* fix: add inherited attributes to section headers and use posix paths

Address two Greptile review comments:
- Add "inherited attributes:" to _SECTION_HEADERS so docstring parsing
  stops before that section even without a preceding blank line.
- Use .as_posix() in get_config_module_path() for consistent
  forward-slash paths across platforms.
2026-03-17 09:30:06 -07:00
Johnny Greco
4c19dba74b
feat: agent CLI introspection (simplified) (#415)
* feat: add agent introspection cli

* refactor: remove agent cli schema version

* refactor: omit missing builder docstrings from context

* refactor: tighten agent cli contract

* feat: add schema_text() to ConfigBase for human-readable field summaries

ConfigBase.schema_text() returns a concise text representation including
the class docstring summary, field names, types, defaults, and
descriptions. Field descriptions added to column config types to
surface through this method.

* refactor: flatten agent CLI into plain functions with text output mode

Delete AgentController class and agent_command_defs module. Move all
logic into agent_introspection (data) and agent_text_formatter (display)
as plain functions. Add --json flag so commands default to human-readable
text using schema_text(), with JSON as opt-in. Unify _emit helper,
remove include_docstrings parameter, deduplicate catalog calls, and fix
N+1 discover_family_types in get_family_schemas.

* fix: port stale controller tests and consolidate command descriptions

Port test_agent_controller.py to use plain functions instead of deleted
AgentController. Extract AGENT_COMMANDS constant as single source for
operation descriptions, syncing with main.py help strings.

* style: fix ruff formatting in agent_introspection

* refactor: centralize agent command definitions

Extract AGENT_COMMANDS into agent_command_defs.py so main.py and
agent_introspection.py share a single source for command names,
help text, and metadata. The new module has no heavy dependencies,
keeping --help latency unaffected.

* fix: handle default_factory and empty providers in schema_text and introspection

- schema_text() now detects default_factory fields and renders e.g. "list()"
  instead of leaking PydanticUndefined
- Guard against IndexError when provider registry has an empty providers list
- Add 15 edge-case tests for schema_text covering default_factory, enum
  defaults, None defaults, scalar defaults, descriptions, and docstrings

* refactor: remove JSON output mode from agent CLI commands

Text-only output simplifies the interface. Structured output can be
added back trivially since the functions already return dicts.

* docs: update schema_text docstring to reflect agent focus

* fix: include builder section and import_path in agent text output

- format_context_text now renders a ## Builder section
- format_types_text now includes import_path column in tables

* refactor: drop import_path from types tables

All config objects are imported via dd.<ClassName>, so the full import
path is redundant noise in agent output.

* docs: add family definition and import hint to context output

* refactor: rename Types section to Families, drop redundant "types" from sub-headers

* fix: coerce None to empty string in table cells

row.get(col, '') returns None when the key exists with value None,
causing str(None) to render "None" in the output. Use `or ''` instead.

* refactor: move agent controller tests to utils as introspection integration tests

There is no controller layer — these tests exercise functions in
agent_introspection.py, so they belong in tests/cli/utils/.

* fix: only coerce None to empty string in table cells, not False

The previous `or ''` pattern treated all falsy values (including False)
as empty. Use an explicit None check so booleans render correctly.

* style: address review nits from nabin

- Add explicit parentheses to and/or precedence in _build_agent_lazy_group
- Rename loop variable l to line in test_schema_text
- Move get_family_schema import to module level in test_agent_text_formatter

* fix: improve schema_text Literal display, builder signature quotes, and docstring parsing

- _format_annotation now renders Literal['value'] instead of bare Literal
- _format_signature strips quotes from stringified annotations caused by
  `from __future__ import annotations`
- _get_docstring_summary stops at any Google-style section header, not
  just Attributes:
2026-03-13 18:26:00 -04:00
Johnny Greco
b94b88b7a4
feat(cli): bootstrap default configs on CLI startup (#401)
* feat(cli): bootstrap default configs on command run

* fix(cli): use active interpreter in bootstrap warning

* refactor(cli): simplify bootstrap warning flow

* refactor(cli): bootstrap defaults in main entrypoint

* refactor(cli): keep bootstrap ownership in main

* test(cli): cover lazy dispatch and runtime failure flag

* refactor(cli): remove redundant bootstrap state

* test(cli): assert bootstrap warning includes error

* test: address cli bootstrap review feedback
2026-03-12 15:42:41 -04:00
Johnny Greco
03b3d6c726
chore: address Andre's feedback on --save-results and CLI preview (#335)
* fix: suppress stdout when saving report and sample records to file

Console(record=True) still prints to stdout by default. Use
file=io.StringIO() to redirect output so save-path calls only
write to disk.

* refactor: --save-results skips terminal display

When --save-results is used, records and the analysis report are no
longer printed to the terminal. Extracted save logic into a dedicated
_save_preview_results method and updated option help text accordingly.

* feat: wrap-around navigation in sample records browser

Prev/next buttons and arrow keys now cycle back to the beginning/end
instead of clamping at boundaries.

* test: reuse record_series fixture in visualization tests

* feat: thread --theme through to sample records pager

The pager shell was hardcoded dark, so --theme light produced
light records inside a dark frame. Extract CSS variables into
dark/light constants and pass the theme from the controller.

* fix: cap terminal display width at display_width

The module-level Console() had no width limit, so tables with
expand=True stretched to the full terminal width. Cap terminal
output at min(terminal_width, display_width) and thread the
display_width parameter through the controller's display methods.

* docs: update --display-width and --theme help text

Remove "Only applies when --save-results is used" from
--display-width since it now also affects terminal output.

* fix: update generation controller tests to match display_width and save_results behavior
2026-02-18 20:17:03 -05:00
Johnny Greco
1439bbea7e
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time

Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost.

Key changes:
- Defer controller imports to inside command functions
- Remove eager re-export chains from CLI package __init__ files
- Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time
- Add lazy __getattr__ exports in interface/__init__.py
- Replace module-level tokenizer init with cached lazy getter
- Fix ModelProvider import to use config layer instead of engine
- Update test mock paths to match new import locations

Reduces CLI import-time from ~1.67s to ~0.46s.

* perf: defer pandas/numpy in io_helpers and add config_list benchmark

- Replace eager `from lazy_heavy_imports import pd, np` in io_helpers
  with module-level __getattr__ (for backwards-compatible external
  access / test mocks) and function-level imports in the 3 functions
  that actually use them (read_parquet_dataset, smart_load_dataframe,
  _convert_to_serializable). Importing io_helpers no longer triggers
  pandas/numpy loading.
- Defer heavy imports in list and reset CLI commands into function
  bodies to avoid loading repositories, Rich, and prompt_toolkit at
  module import time.
- Add `config_list` (data-designer config list) measurement to the
  CLI startup benchmark with isolated cold measurement in a separate
  venv and a --skip-config-list-check flag.
- Update test mock paths to match new import locations.

* Refine lazy import usage and TYPE_CHECKING cleanup

* Run license header updater on PR-touched files

* fix: update sqlfluff mock target for lazy imports in test_sql

* perf: cache globals() in lazy __getattr__ to avoid repeated lookups

Add globals() caching and explanatory comment to all three lazy
__getattr__ implementations (lazy_heavy_imports, config/__init__,
interface/__init__) so subsequent attribute accesses bypass __getattr__.

* perf: lazy CLI command loading and deferred heavy import evaluations

- Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files

- Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes

- Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks

- Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use

- Update test mock targets to patch at usage-site for module-level imports

* refactor: use direct pandas import in seed_source_dataframe

Drop lazy-loading for pandas in DataFrameSeedSource; use direct import
for simplicity.

* update lazy import pattern

* update tests to use lazy import namespace

Switch test modules to import data_designer.lazy_heavy_imports as lazy
and reference heavy libraries through that namespace. This keeps heavy
imports deferred during module import and aligns tests with the new
lazy-import usage pattern.

* tighten import perf test thresholds

Document recent baseline timings and lower the allowed average
import time and timeout so regressions are detected sooner.

* document pandas import requirement

Clarify that Pydantic needs DataFrame resolved at module load and
that keeping the direct import preserves IDE typing support.

* increase timeout time

* use lazy pandas imports in visualization tests

- replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports
- add TYPE_CHECKING pandas import and keep CLI controller imports sorted

* fix lazy pandas runtime usage and preview mocks

Switch sample-record handling to lazy pandas types so runtime paths no longer
depend on TYPE_CHECKING imports. Align preview controller tests to patch the
module-local DataDesigner symbol, preventing real engine invocation in save
results scenarios.
2026-02-18 16:24:15 -05:00
Johnny Greco
f2a1657870
feat: add --save-results option to preview command (#333)
* feat: add --save-report option to preview command

* feat: add save_path option to display_sample_record

Allow saving rendered sample records as HTML or SVG files via an
optional save_path parameter on both the standalone function and
the WithRecordSamplerMixin method.

* feat: replace --save-report with --save-results on preview command

Replace the single-file --save-report option with --save-results, which saves all preview artifacts (dataset parquet, analysis report HTML, and per-record sample HTMLs) into a timestamped directory under the artifact path. Add error handling around the save block, improve timestamp precision to microseconds, and expand test coverage for the new behavior.

* feat: add sample records pager with theme toggle, postMessage bridge, and UI polish

* feat: add dataset metadata subtitle to pager and clean up toolbar layout

* fix: address review findings for preview save-results feature

- Split try/except in generation_controller so report display errors
  don't produce misleading "failed to save" messages when not saving
- Add browser HTML path to save success output for discoverability
- Remove 5 unused CSS variables from pager theme constants
- Add "N of M" record counter to pager toolbar
- Add theme/display_width assertions to all preview_command tests
- Add dedicated test for custom theme and display_width passthrough
- Add tests for record counter and CSS variable cleanup

* fix: address code review findings and simplify pager

- Fix critical bug: analysis report now displays to console even when
  --save-results is active (was silently dropped via pass statement)
- Fix latent UnboundLocalError in display_sample_record when index is
  out of bounds (num_records computed before try block)
- Eliminate duplicated dark CSS between constant and theme listener script
- Simplify sample_records_pager: remove dual-theme system, postMessage
  bridge, and responsive media queries; restore GitHub link; reorder
  toolbar to put prev/next buttons on the far left
- Narrow except Exception to except OSError in save-results path
- Use case-insensitive extension check and lambda-based re.sub
- Collapse redundant preview command delegation tests into parametrize
- Add missing type annotations and remove tautological assertions

* style: move record counter to far right of pager toolbar

* refactor: remove dead theme-listener script and inline CSS constant

_THEME_LISTENER_SCRIPT and _SAMPLE_RECORD_DARK_CSS_INLINE became
orphaned after the pager simplification removed the postMessage
bridge. This removes both constants, drops the injection line,
switches the idempotency guard to the viewport meta tag, and
cleans up related test assertions.

* fix: move Path import out of TYPE_CHECKING block in test_visualization

* fix: rename _logger to logger to match codebase convention

* fix: remove unnecessary cast in preview command theme parameter

* refactor: extract DEFAULT_DISPLAY_WIDTH constant and make apply_html_post_processing public

* Update packages/data-designer-config/tests/config/utils/test_visualization.py

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
2026-02-18 15:58:35 -05:00
Johnny Greco
1514720596
feat: support loading config files from HTTP(S) URLs (#323)
* support loading config files from http urls

- allow config builder and CLI loader to load YAML/JSON configs from
  HTTP(S) URLs
- reject unsupported URL extensions and remote Python module URLs
- update CLI help text and add tests for URL success/failure paths

* harden remote config loading and deduplicate URL validation

- Add size limit (10 MB) when fetching configs from URLs
- Validate parsed YAML is a dict before returning
- Make is_http_url public and reuse it in CLI validate_url
- Replace local CONFIG_FILE_EXTENSIONS with shared constant
- Add tests for is_http_url, URL-with-no-extension edge cases

* use requests for remote config loading

- replace urllib URL fetching with requests and status checks
- parse remote payloads via smart_load_yaml for consistent validation
- expand tests for HTTP errors, size limits, and non-dict payloads

* lower remote config size limit to 1 MB

* improve config URL HTTP error reporting

Add granular 401/403/404 and generic HTTP status errors for remote config fetching to make failures actionable. Clarify that authenticated config URL loading is not currently supported and update tests for status-aware behavior.

* rewrite github blob URLs for remote loading

Handle GitHub blob links by rewriting them to raw content URLs for
config and dataframe HTTP loaders, preserving query params but avoiding
query token leaks in logs. This also fixes extension detection for URLs
with query strings and adds coverage for rewrite behavior.

* remove validate_url wrapper in favor of is_http_url

The validate_url function in cli/utils was just a thin wrapper around
is_http_url from io_helpers. Remove it and have callers use is_http_url
directly for clarity and reduced indirection.

* fix optional type for artifact_path CLI option

* fix URL recursion in smart_load_yaml

- avoid treating remote payload strings as new URL inputs
- add regression test for URL string payloads from remote config

* rewrite huggingface blob URLs for remote loading
2026-02-11 15:12:52 -05:00
Johnny Greco
d3c4de76da
feat: add preview, create, and validate CLI commands (#313)
* feat: add preview, create, and validate CLI commands

Add three new top-level CLI commands for the data-designer workflow:
- `data-designer preview` - generate preview datasets for fast iteration
- `data-designer create` - create full datasets and save to disk
- `data-designer validate` - validate configuration files

Also includes:
- Move wait_for_navigation_key() UI primitive from preview.py to ui.py
- Add KeyPressEvent type annotations to all key binding handlers in ui.py
- Refactor cli/utils.py into cli/utils/ package with config_loader module
- Comprehensive test coverage for all new commands

* fix: update pythonjsonlogger import and clean up dev dependencies

- Update pythonjsonlogger import to use newer JsonFormatter API
- Consolidate dev-dependencies into [dependency-groups] dev section
- Remove unnecessary test cli/utils __init__.py

* small E

* address greptile feedback

* organize CLI commands into rich help panels

Group top-level commands under "Generation" and "Setup" panels
for clearer help output.

* refactor config loader to parse files directly and auto-detect config format

- Parse YAML/JSON files into dicts before passing to from_config,
  providing format-specific error messages for parse failures
- Auto-detect DataDesignerConfig format (columns at top level) and
  wrap it into BuilderConfig so users can provide either format
- Clean up Python module loading with try/except/finally for reliable
  sys.modules and sys.path cleanup
- Add comprehensive tests for parsing, validation, and auto-wrapping

* fix sys.path cleanup in config loader and simplify tests

- Use pop(0) instead of remove() to precisely undo the insert(0, ...)
  and avoid accidentally removing a different matching path entry
- Replace MagicMock with real DataDesignerConfigBuilder in tests

* move config format auto-detection into from_config

Centralize the shorthand DataDesignerConfig detection (columns at
top level without a data_designer wrapper) in
DataDesignerConfigBuilder.from_config so all callers benefit, not
just the CLI config loader. Simplify config_loader to delegate file
parsing and format normalization entirely to from_config.

* extract GenerationController from CLI commands

Move shared generation logic (preview, validate, create) out of the
individual Typer command functions into a dedicated GenerationController,
matching the existing controller pattern (DownloadController, etc.).
The command functions now delegate to the controller, keeping them as
thin entry points. Tests updated accordingly — command tests verify
delegation while controller tests cover the full behavior.

* harden sys.path cleanup and add explanatory comments

Use sys.path.remove() instead of checking sys.path[0] so cleanup
succeeds even when exec_module inserts entries at index 0. Drop
unnecessary spec=DataDesignerConfigBuilder from test mocks.

* check stdout TTY in preview interactive mode detection

Previously only stdin was checked, so piping stdout (e.g.
`dd preview cfg.yaml | head`) would still attempt interactive
browsing. Now both stdin and stdout must be a TTY.
2026-02-11 14:06:06 -05:00