DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Eric W. Tramel	7a539c0e3d	fix: make display_tui the canonical run config flag	2026-05-21 21:51:40 -04:00
Eric W. Tramel	a867a66d58	fix: rename create progress flags to tui	2026-05-21 21:41:59 -04:00
Eric W. Tramel	a24edaa06a	feat: add create progress override flags	2026-05-21 21:36:58 -04:00
Johnny Greco	d14c9b3ccc	feat(cli): add plugin catalog core (#618 ) * feat(cli): add plugin catalog services Add typed catalog and tap models, persistent tap storage, cached catalog loading, compatibility evaluation, install plan generation, and runtime plugin discovery helpers. Refs #617 * feat(cli): add plugins command group Wire list, search, info, install, installed, and tap management commands through the existing command-controller CLI pattern. Refs #617 * test(cli): cover plugin catalog workflows Add regression coverage for tap caching, catalog compatibility, installer command generation, local path resolution, and Typer command delegation. Refs #617 * fix(cli): align plugin taps with schema v2 Validate tap catalogs against the schema v2 contract used by NVIDIA-NeMo/DataDesignerPlugins#36, including source union fields, docs URLs, package paths, compatibility metadata, and unique runtime plugin names. Derive Git install targets as package-qualified PEP 508 direct references so git tap entries install the package described by the catalog source metadata. Refs #617 * fix(cli): address plugin review feedback - Invalidate import caches before post-install entry point verification - Make tap aliases case-insensitive and cache catalogs by alias plus URL - Prefer compatible catalog entries before falling back to forced installs - Clarify unused --tap behavior and list installed entry points without imports - Add direct controller coverage and update CLI plugin documentation Refs #617 * fix(cli): gate incompatible plugin installs Fetch install targets before compatibility filtering so the controller owns the final --force decision and the incompatible install guard stays reachable. Refs #617 * style(cli): format plugin catalog files Apply ruff formatting to the plugin command and tap repository tests so CI format checks pass on the PR merge commit. Refs #617 * fix(cli): reject duplicate plugin entry names Key catalog duplicate detection by entry_point.name so distinct catalog entries cannot register the same runtime plugin name. Refs #617 * fix(cli): preserve GitHub tree tap paths * fix(cli): verify plugin entry point names * align plugin CLI with catalog schema - adopt catalog terminology for plugin source aliases - parse package-first plugin catalog metadata from the plugin repo - install package requirements with optional catalog indexes * tidy plugin catalog workflow docs * align plugin catalog CLI with package contract * add plugin package uninstall workflow * test plugin package command targets * document plugin package aliases * address plugin catalog review feedback * prefer runtime plugin lookup matches * rename plugins command to plugin * show plugin package descriptions * rename plugin catalogs command * add protected plugin package installs * document plugin package install modes * avoid building project during plugin installs * harden plugin package installs * tighten plugin catalog contracts * fix no-args help exit code * make plugin docs links robust * document plugin CLI catalog workflows * clarify plugin entry point verification * simplify plugin CLI docs * narrow plugin search fields * hide plugin catalog cache ttl * remove plugin catalog trust flag * improve plugin CLI recovery UX * polish plugin catalog table display * stabilize plugin catalog table test * tighten plugin catalog edge cases * harden plugin catalog verification - Escape catalog-provided Rich markup before rendering CLI output - Reject runtime plugin names that collide after enum-key normalization - Load installed runtime entry points in a subprocess before reporting success * simplify plugin entry point verification Load matching entry points directly after install instead of spawning a separate Python process. This keeps the check package-scoped while still catching broken entry-point targets and non-Plugin objects. * require newer uv for plugin plans Use uv >= 0.10.0 as the single supported uv requirement for plugin package commands. Auto mode now falls back to a pip plan with an upgrade warning when uv is unavailable or too old, while explicit uv selection remains strict. * verify pip fallback availability * polish plugin CLI status markers * clarify plugin compatibility labels * simplify plugin info install details * address plugin CLI review nits * support versioned plugin package installs * share plugin install metadata rendering * show installed plugin packages * harden versioned plugin installs - Preserve catalog requirement constraints for versioned installs - Remove stale install-plan metadata fields - Expand parser, uv, controller, and local-catalog dry-run coverage * harden plugin help tests * show plugin package versions Add package version metadata support for plugin catalogs and resolve current versions from exact requirements or simple indexes when catalog entries omit them. Update plugin list/info/install metadata to show the plugin package version and Data Designer compatibility requirement while removing the separate Data Designer version line. * format plugin catalog tests * harden plugin package metadata checks * harden plugin CLI test coverage * add plugin discovery docs (#642) Signed-off-by: Johnny Greco <jogreco@nvidia.com> --------- Signed-off-by: Johnny Greco <jogreco@nvidia.com>	2026-05-13 12:26:58 -04:00
Przemysław Boruta	810c681f7a	feat: resume interrupted dataset generation runs (sync + async engine) (#526 ) Some checks failed CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Coverage Check (Python 3.11) (push) Has been cancelled Details CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Lint and Format Check (push) Has been cancelled Details CI / Check License Headers (push) Has been cancelled Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details * docs: add implementation plan for resume mechanism Fixes #525 * feat(storage): add resume flag and clear_partial_results() - ArtifactStorage gains a `resume: bool = False` field - resolved_dataset_name skips timestamp logic when resume=True, returning the existing dataset folder name as-is - Raises ArtifactStorageError on resume=True when the target folder is absent or empty (no data to resume from) - New clear_partial_results() removes in-flight partial results left over from an interrupted run Fixes #525 * feat(batch-manager): add start_batch param to start() DatasetBatchManager.start() now accepts: - start_batch: int = 0 — first batch index to process - initial_actual_num_records: int = 0 — records already on disk Both default to 0 so all existing call sites are unaffected. Fixes #525 * feat(builder): implement resume logic in DatasetBuilder - build() gains a resume: bool = False parameter - _load_resume_state() reads metadata.json and validates that num_records and buffer_size match the original run - _build_with_resume() skips completed batches, clears in-flight partial results, and continues from the first incomplete batch - Raises DatasetGenerationError with clear messages for: - missing metadata.json (interrupted before first batch completes) - num_records mismatch - buffer_size mismatch - DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported) - Logs a warning and returns early when dataset is already complete Fixes #525 * feat(interface): expose resume on DataDesigner.create() - create() gains resume: bool = False - _create_resource_provider() passes resume to ArtifactStorage - builder.build() receives the resume flag Fixes #525 * test: add tests for resume mechanism Covers: - ArtifactStorage.resolved_dataset_name with resume=True - ArtifactStorage.clear_partial_results() - DatasetBatchManager.start() with start_batch and initial_actual_num_records - DatasetBuilder.build(resume=True): missing metadata, num_records mismatch, buffer_size mismatch, already-complete detection Fixes #525 * feat(builder): extend resume to async engine (DATA_DESIGNER_ASYNC_ENGINE=1) - Add _find_completed_row_group_ids() to scan parquet-files/ for already-written row groups by parsing batch_.parquet filenames - _build_async() now accepts resume=True: loads metadata, finds completed row groups, clears partial results, and logs progress; returns early if all row groups are done - _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and initial_total_num_batches so the scheduler only processes remaining row groups and RowGroupBufferManager starts from the correct counts - RowGroupBufferManager.__init__ gains initial_actual_num_records and initial_total_num_batches params to seed the counters on resume - finalize_row_group closure now writes incremental metadata after each checkpoint so any run (resume or not) can be resumed if interrupted mid-way - Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1 - Add tests for all new paths fix(builder): skip after-generation processors when resume finds dataset already complete _build_with_resume and _build_async now return False when the dataset is already complete (early-return path), True otherwise. build() skips _processor_runner.run_after_generation() on False, preventing processors from calling shutil.rmtree and rewriting an already-finalized dataset. Fixes the issue raised in review: greptile P1 comment on PR #526. * fix(builder): use filesystem count for initial_total_num_batches on async resume Metadata can lag by one row group if a crash occurs between move_partial_result_to_final_file_path and write_metadata. Using len(completed_ids) from the filesystem scan instead of state.num_completed_batches ensures the final metadata reflects the actual number of parquet files present, not the potentially stale metadata count. * feat(results): add export() method and --output-format CLI flag Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files. * fix(builder): handle resume when metadata.json missing (interrupted before first batch) When a run is interrupted before any row group or batch completes, metadata.json is never written. Previously resume=True would raise DatasetGenerationError in this case. Now build() detects the missing file, logs an info message, clears any leftover partial results and falls back to a clean fresh run. This is the common scenario for small datasets (fewer records than buffer_size) where all records fit in a single row group. * docs(interface): fix resume docstring — async engine is supported * fix(builder): derive initial_actual_num_records from filesystem in async resume In the crash window (row group written to disk but write_metadata crashed before updating the file), both initial_total_num_batches and initial_actual_num_records now use the filesystem-discovered completed_ids as source of truth. Previously initial_actual_num_records was read from potentially stale metadata, causing actual_num_records in the final metadata to be undercounted by one row group. Also adds a test covering the partial-resume crash-window scenario. * feat(resume): replace resume: bool with ResumeMode enum (NEVER/ALWAYS/IF_POSSIBLE) - Introduces ResumeMode(StrEnum) in artifact_storage.py for use across all layers - Replaces resume: bool with resume: ResumeMode in DatasetBuilder.build(), DataDesigner.create(), ArtifactStorage, and _build_async() - Adds _check_resume_config_compatibility() using config fingerprints to support IF_POSSIBLE: falls back to a fresh run when config has changed since last run - Relaxes num_records validation from strict equality to num_records >= actual_num_records, allowing dataset extension on resume; buffer_size must still match exactly - Preserves exception chain with 'from exc' on FileNotFoundError in _load_resume_state - Exports ResumeMode from data_designer.interface for users to import - Adds skip_row_groups assertion test and IF_POSSIBLE storage behavior tests * fix(resume): invalidate resolved_dataset_name cache when IF_POSSIBLE downgrades to NEVER ArtifactStorage's Pydantic model validator accesses base_dataset_path at construction time, caching resolved_dataset_name under IF_POSSIBLE semantics before build() can set resume=NEVER. Pop the stale cache entry so the property re-resolves with the correct NEVER semantics (timestamped directory). Also fixes _check_resume_config_compatibility() to use artifact_path/dataset_name directly instead of base_dataset_path, and adds a regression test covering the cache-bypass scenario. * fix(builder): move partial-completion warning before return in _build_async * fix(builder): IF_POSSIBLE now starts fresh when no dataset directory exists _check_resume_config_compatibility returned True when config_path was absent, even when the dataset directory itself didn't exist. This caused IF_POSSIBLE to upgrade to ALWAYS, which then raised ArtifactStorageError on the first-ever run because ALWAYS requires an existing directory. Fix: return False early when the dataset directory is absent. Also sets actual_num_records on mock buffer managers in two async resume tests that started failing after the partial-completion warning block was made reachable. * fix(builder): use original target_num_records in async resume record count When extending a non-aligned run (e.g. original num_records=5, buffer_size=2), the last completed row group has 1 record, not buffer_size=2. Using new num_records in the formula would overcount: min(2, 7-22)=2 instead of min(2, 5-22)=1. Fix: capture state from _load_resume_state (previously discarded) and pass state.target_num_records into the sum formula. Added target_num_records field to _ResumeState, populated from metadata.json. Test: test_build_async_resume_initial_actual_num_records_uses_original_target * fix(builder): IF_POSSIBLE starts fresh on empty dataset directory Empty directory (crash between mkdir and first file write) was treated as compatible — _check_resume_config_compatibility returned True, IF_POSSIBLE upgraded to ALWAYS, which then raised ArtifactStorageError. Fix: treat empty directory the same as missing — return False from _check_resume_config_compatibility when any(dir.iterdir()) is False. Test: test_if_possible_starts_fresh_when_directory_is_empty * fix(builder): ALWAYS raises DatasetGenerationError on config fingerprint mismatch ResumeMode.ALWAYS was documented to raise when column/model config changed, but _check_resume_config_compatibility() was only called in the IF_POSSIBLE branch. A user resuming with ALWAYS after changing the config would silently mix records from two different configs. Fix: - Refactor _check_resume_config_compatibility() to return _ConfigCompatibility enum (COMPATIBLE / INCOMPATIBLE / NO_PRIOR_DATASET) instead of bool so callers can distinguish 'no prior run' from 'configs differ' - Call the check for both ALWAYS and IF_POSSIBLE before _write_builder_config() - ALWAYS + INCOMPATIBLE → DatasetGenerationError - IF_POSSIBLE + INCOMPATIBLE → silent fresh start (existing behaviour) - IF_POSSIBLE + NO_PRIOR_DATASET → silent fresh start (existing behaviour) Test: test_build_resume_always_raises_on_config_mismatch * fix(resume): address nabinchha review — drop export collision, add CLI flag, fix edge cases C1: drop commit `0bdf24ab` — remove export() / --output-format from this PR; that feature belongs to #540 which has a superior streaming implementation C2: add --resume / -r flag to data-designer create CLI, thread ResumeMode through GenerationController.run_create() into DataDesigner.create() C3: fix already-complete warning text — replace stale "Remove resume=True" with "Use resume=ResumeMode.NEVER" in _build_with_resume and _build_async C4: fix docstrings — ALWAYS does NOT raise when no checkpoint exists (silently restarts from scratch); clarify num_records >= actual semantics C5: sync artifact_storage.resume = NEVER when no-metadata fallback fires so both state holders agree after the downgrade C6: fix return_value=False → _ConfigCompatibility.INCOMPATIBLE in IF_POSSIBLE test; drop 3 direct _find_completed_row_group_ids tests (private API, covered by build()) W1: add logger.warning when builder_config.json is absent (silent COMPATIBLE was footgun) W2: narrow except Exception → (OSError, json.JSONDecodeError, ValidationError) W3: run make check-all-fix — ruff reformatted test_if_possible_starts_fresh_when_directory_is_empty * fix(builder): replace stdlib StrEnum with project compat shim for Python 3.10 * fix(builder): guard extension row groups in initial_actual_num_records formula on async resume When extending an async run (num_records > state.target_num_records) and a crash occurs after an extension row group is written to disk but before write_metadata, the formula `min(buffer_size, state.target_num_records - rg_id * buffer_size)` yields a negative value for any extension row group (rg_id * buffer_size >= target), making initial_actual_num_records silently undercount. The RowGroupBufferManager then starts at the wrong offset, and the final metadata reports an incorrect actual_num_records with a false partial-completion warning. Fix: use state.target_num_records for original row groups and num_records for extension row groups (guarded by rg_id * buffer_size < state.target_num_records). Covers the scenario with a new regression test. * fix(builder): pre-compute row-group list in _build_async to fix sizes on non-aligned extension resume The partitioning loop in _prepare_async_run decremented remaining by min(buffer_size, remaining) for every row group, including skipped ones. For a non-aligned original run (e.g. target=5, buffer_size=2, last group has 1 record), the loop deducted 2 for the skipped last group, leaving remaining one short. Extension row groups received smaller sizes than intended, so the generated dataset was silently short by the deficit and a false partial-completion warning fired. Fix: pre-compute the full row-group list with correct per-group sizes in _build_async where state.target_num_records is available, then pass it to _prepare_async_run as precomputed_row_groups (replacing the skip_row_groups param). Original groups use min(buffer_size, target - rgbs); extension groups use min(buffer_size, extension_records - ext_idxbs). Also updates the skip_row_groups test to assert on precomputed_row_groups and adds a regression test for the non-aligned extension case. * chore: remove stale implementation plan for #525 The plan described the initial resume: bool design which has since been replaced by the full ResumeMode enum (NEVER/ALWAYS/IF_POSSIBLE), async engine support, filesystem reconciliation, and config compatibility checks. The PR description is the authoritative record of what shipped. * fix(engine): fix false 'already complete' when extension fits in last group's slack original_target=5, buffer_size=2 produces 3 groups [2,2,1]. Extending to num_records=6: ceil(6/2)=3 equalled len(completed_ids)=3, triggering the already-complete branch on both the async and sync paths — returning the 5-record dataset silently. Fix (async): replace ceil(num_records/bs) with num_original_groups + ceil(extension_records/bs) so any extension always adds new groups beyond num_original_groups. Fix (sync): add num_records_list param to DatasetBatchManager.start() and pass the correct per-batch sizes in _build_with_resume, giving the batch manager the right total batch count (4 instead of 3 in the example). * fix(engine): raise error when num_records is below original target on resume Prevents negative extension_records in async path which silently truncated the dataset and corrupted metadata without triggering a partial-completion warning. * fix(storage): refresh MediaStorage path after IF_POSSIBLE → NEVER downgrade When build() detected an incompatible config and downgraded resume from IF_POSSIBLE to NEVER, _media_storage.base_path remained bound to the original directory while all other path properties resolved to the new timestamped directory — causing broken image references in image-column runs. * fix(engine): preserve original_target_num_records across extension resume writes After finalize_row_group successfully wrote incremental metadata during an extension run, target_num_records in metadata was updated to the extension target. A subsequent resume would read this as the original target, making _rg_size() incorrect for all row groups and silently corrupting actual_num_records. Stores original_target_num_records as an immutable field in metadata so the original group boundaries are always recoverable regardless of how many incremental writes have occurred. --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>	2026-05-08 15:37:56 -06:00
Eric W. Tramel	417b0c715d	feat(cli): show version update notice (#602 )	2026-05-07 15:20:18 -04:00
Przemysław Boruta	0afe287a5f	feat(results): add export() method and --output-format CLI flag (#540 ) * feat(results): add export() method and --output-format CLI flag Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files. * fix(cli): validate output_format before dataset generation * fix(cli): remove top-level results import from create.py to preserve lazy loading * fix(results): address andreatgretel review — error types, UX ordering, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure * fix(results): correct Raises docstring — ValueError -> InvalidFileFormatError * feat(results): stream batch files in export() to avoid OOM on large datasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override * fix(results): address nabinchha review — memory safety, error wrapping, UX - Replace load_dataset() with count_records() in CLI to avoid OOM on large datasets; add count_records() method using pq.read_metadata (reads file metadata only, no data pages loaded) - Remove redundant format validation in controller — click.Choice in create.py already rejects invalid values at parse time; dead code removed along with corresponding test - Wrap pa.unify_schemas / table.cast ArrowInvalid as InvalidFileFormatError to normalize third-party exceptions at module boundaries per AGENTS.md - Lowercase file extension before format lookup so .JSONL/.CSV/.PARQUET are accepted without error - Add clarifying comment to trailing-newline guard in _export_jsonl - Add tests: count_records(), uppercase extension, incompatible schemas * fix(results): fix parquet export schema unification and controller path bug - Use promote_options="permissive" in pa.unify_schemas so minor numeric type drift (int64 vs float64) is handled by promotion instead of raising - Also catch ArrowTypeError from unify_schemas and ValueError from table.cast() — the actual exception types thrown by pyarrow for these cases (ArrowInvalid alone is not sufficient) - Wrap base_dataset_path in Path() in generation_controller.run_create to guard against callers that return a str (mock returns str, Path does not support / with str operands) - Update test_export_parquet_incompatible_schemas_raises to match the new error source: with permissive unification, different-column-name batches fail at cast() not at unify_schemas(), so the match string changes from "Cannot unify batch schemas" to "Cannot cast batch" * fix(results,cli): address nabinchha review round 2 - Use public pa.ArrowInvalid/ArrowTypeError instead of pa.lib.* in _export_parquet - Drop dead trailing-newline guard in _export_jsonl; skip empty batches with `if content` - Rename num_records→actual_record_count after count_records() call to avoid shadowing - Unlink partial export file before re-raising on export failure in run_create - Export filename now uses dataset_name (<dataset-name>.<format>) instead of literal "dataset" - Update help text and tests to match new export filename convention --------- Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com>	2026-05-06 17:13:57 -06:00
Nabin Mulepati	f73da1975c	feat(models): deprecate implicit default provider routing (#594 ) Some checks failed CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Coverage Check (Python 3.11) (push) Has been cancelled Details CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details * feat(models): deprecate implicit default provider routing Emit DeprecationWarning whenever the legacy "implicit default provider" path is exercised: `ModelConfig.provider=None`, the registry-level `ModelProviderRegistry.default`, the YAML `default:` key in `~/.data-designer/model_providers.yaml`, and the CLI's "Change default provider" workflow. `resolve_model_provider_registry` skips passing `default=` in the single-provider case so the common construction path stays quiet. Multi-provider registries still pass `default` (per `check_implicit_default`) and warn accordingly. Update docs, the package README, and test fixtures to specify `provider=` explicitly on every `ModelConfig`. New tests cover each warning entry point and pin the post-deprecation happy paths. Refs #589 Made-with: Cursor * fix(models): address PR #594 review feedback Greptile P1: ProviderRepository.load emitted its DeprecationWarning inside a `try/except Exception` block. Under `filterwarnings("error", DeprecationWarning)` the warn would raise, the except would swallow it, and `load()` would silently return None (losing the registry). Move the warn outside the catch-all so the strict-warning path no longer drops valid configs. Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and `_warn_on_explicit_default` use `stacklevel=2`, which lands inside pydantic v2's validator dispatch rather than at the user's `ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke both attribution (the source line was unhelpful) and Python's once-per-location dedup (every call collapsed to the same pydantic-internal key, suppressing all but the first warning). Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`, which walks past the helper, validator, and any pydantic frames to find the user's call site and emits via `warnings.warn_explicit` with the user frame's `__warningregistry__`. Keeps attribution accurate and dedup keyed on the user's (filename, lineno). johnnygreco: align the `provider_repository.py` warning copy with the sibling site in `default_model_settings.py` ("specify provider= explicitly on each ModelConfig instead") so both YAML-default warning sites give the same migration instruction. The previous wording pointed users at "ModelConfig entries" inside `model_providers.yaml`, where ModelConfig entries don't actually live. johnnygreco: dedup the cascade in `DataDesigner.__init__`. With `model_providers=None` and a YAML `default:`, the user previously saw two DeprecationWarnings for the same root cause — `get_default_provider_name()` warns about the YAML key, then `resolve_model_provider_registry(...)` re-warns from `_warn_on_explicit_default`. Suppress the registry-level duplicate in the YAML-fallback branch via `warnings.catch_warnings()` so users see exactly one warning per user action. johnnygreco: tighten `_warn_on_explicit_default` to fire only when `default is not None`. Passing `default=None` explicitly is semantically equivalent to omitting it (caller is opting out of a registry-level default), and shouldn't trigger the deprecation nudge. johnnygreco: add a `model_validate({...})` regression test for `ModelConfig` so the deserialization path (legacy on-disk configs) is pinned alongside the construction path. Tests: - Update `test_load_exists` and `test_save` to omit `default=` so the roundtrip stops exercising the deprecated YAML-default path unguarded (Greptile note). - Wrap `test_resolve_model_provider_registry_with_explicit_default`, `test_get_provider`, and `test_init_user_supplied_providers_preserve_first_wins_over_yaml_default` in `pytest.warns` so the suite stays green under `-W error::DeprecationWarning` (Greptile note). - Add `test_explicit_default_none_does_not_emit_deprecation_warning` to pin the tightened predicate. - Add `test_init_yaml_default_emits_single_deprecation_warning` to pin the cascade-dedup behavior. Refs #589 Made-with: Cursor * fix(models): make deprecation warnings visible under default filters andreatgretel (PR #594): the YAML-default warning in `get_default_provider_name` and the registry-default warning emitted from inside DataDesigner helpers were attributing to data_designer library frames, not user code. Python's default filter chain includes `ignore::DeprecationWarning`, so library-attributed entries are silenced — meaning a normal `DataDesigner()` call with a YAML `default:` set showed nothing, and `resolve_model_provider_registry` warnings were similarly invisible. Two related changes: 1. `warn_at_caller`: extend the default skip-list from `("pydantic",)` to `("pydantic", "pydantic_core", "data_designer")` so the walk escapes both pydantic's validator-dispatch frames and data_designer helper frames before attributing. Also tighten the prefix predicate to exact-or-dotted-prefix matching (`name == p or name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not falsely matched as part of `pydantic` (johnnygreco nit). Allow callers to pass a custom `skip_prefixes` for flexibility. Drop the "skip frame 0+1 unconditionally" guard now that prefix matching covers it. 2. `get_default_provider_name`: switch from `warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous stacklevel pointed into `default_model_settings.py`, which is a library file → silenced under default filters. Verified the fix empirically with `python -W default`: warning is now attributed to the user's call site and rendered. johnnygreco (PR #594): add the missing `test_explicit_default_none_does_not_emit_deprecation_warning` regression for the `self.default is not None` predicate landed in the prior round. Tests: - New `test_warning_helpers.py` pins prefix-matching precision (rejects `pydantic_helpers` / `data_designer_other`), default skip-list contents, attribution past skip-prefix frames, and per-call-site dedup behavior. - `test_get_default_provider_name_warning_attributes_to_user_frame` pins andreatgretel's repro for the YAML-default site. - `test_explicit_default_warning_attributes_to_user_frame` pins the multi-frame case: construction goes through `resolve_model_provider_registry`, so the walk has to escape both pydantic and data_designer before landing on the test file. - `test_explicit_default_none_does_not_emit_deprecation_warning` pins johnnygreco's predicate-tightening regression. 3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net from this round). Refs #589 Made-with: Cursor * fix(models): apply warn_at_caller to remaining deprecation sites greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`, which attributes to whichever data_designer frame called `load()` — controllers, services, list/reset commands, agent introspection. Every real call path lands on `data_designer.cli.*`, which falls under Python's default `ignore::DeprecationWarning` filter and is silenced. Audit found two more sites with the same problem: - `DatasetBuilder._resolve_async_compatibility` (`allow_resize` / issue #552) — was using `stacklevel=4` to walk past `_resolve_async_compatibility -> build/build_preview -> interface -> user`. Brittle: any added frame (decorator, async wrapping, the `try/except DeprecationWarning: raise` boundary) shifts attribution silently. The existing test passed only because it used `simplefilter("always") + record=True`, which records warnings regardless of attribution. - `ProviderController._handle_change_default` — was using `stacklevel=2`, which lands on the menu dispatcher in the same controller module. `print_warning` already shows the message visually, but programmatic observers (`pytest.warns`, `filterwarnings("error", ...)`) saw a library-attributed entry that default filters silenced. All three migrated to `warn_at_caller` (the helper from `247fa30`) so attribution lands on the user's call site regardless of internal chain shape. `data_designer` is already in `DEFAULT_INTERNAL_PREFIXES`, so the walk escapes the entire library in one pass. Added attribution regression tests at each site asserting `warning.filename == __file__`. A future regression to `warnings.warn(stacklevel=N)` now fails CI instead of silently silencing the user-facing nudge: - `test_load_with_yaml_default_attributes_warning_to_caller` (test_provider_repository.py) - `test_resolve_async_compatibility` extended with the same assertion - `test_handle_change_default_emits_deprecation_warning` rewritten from `pytest.warns(...)` to a `catch_warnings(record=True)` block that filters for the message and asserts `filename == __file__` (`pytest.warns` does not check attribution, so the rewrite is required to actually catch the regression). 3,125 tests pass (548 config + 1,923 engine + 654 interface). Refs #589	2026-05-05 13:39:12 -06:00
Mike Knepper	98715dcd86	chore(cli): Add --org option to NGC download command (#604 ) Some checks are pending CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Coverage Check (Python 3.11) (push) Waiting to run Details CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run Details CI / Lint and Format Check (push) Waiting to run Details CI / Check License Headers (push) Waiting to run Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details	2026-05-05 08:03:49 -05:00
Eric W. Tramel	fc0365cada	feat(cli): add data-designer --version (#599 )	2026-05-04 13:30:45 -04:00
Johnny Greco	a65903eb1a	chore: add ko_KR locale to nemotron personas datasets (#572 ) Some checks failed CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Coverage Check (Python 3.11) (push) Has been cancelled Details CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details * chore: add ko_KR locale to nemotron personas datasets Register Korean (ko_KR, 2.66 GB) as an available managed persona dataset locale, update related CLI/repository tests, and document the new locale and its NGC download command. * update person fields * update fr_FR size * docs: reconcile personas field tables with installed parquet schemas Remove stale per-locale fields that no longer exist in any managed parquet (commune, departement, prefecture), drop district from the India-specific section since it's already listed in Core Fields, rename digital_skills → digital_skill to match the actual ja_JP column, and add sections for ko_KR, en_SG, and the en_US/en_SG shared ethnic_background. Corrects the religion-family membership to include en_SG. * test: add missing fr_FR assertion in test_run_personas_with_all_flag The test asserts all 9 locales were downloaded but only enumerates 8 in its per-locale checks — fr_FR has been missing since before the ko_KR addition. Align the enumeration with the count. * docs: add ko_KR to locale parameter list	2026-04-24 17:19:02 -04:00
Johnny Greco	0d10bf8dc6	feat: add fr_FR locale to nemotron personas datasets (#468 ) * feat: add fr_FR locale to nemotron personas datasets Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES and add 7 France-specific PII fields: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement. * fix: update download controller and service tests for fr_FR locale Update hardcoded locale counts from 7 to 8 and add fr_FR assertions in download controller and download service tests. * fix: generate CLI locale help text dynamically from constants The --locale help text was hardcoded and already stale (missing en_SG, pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays in sync automatically. * refactor: add LOCALES_WITH_MANAGED_DATASETS_STR constant Centralise the comma-joined locale list so it is defined once in constants and reused in the CLI help text, PersonSamplerParams field description, and locale validation error message.	2026-03-31 17:28:03 -04:00
Johnny Greco	164db0aeb4	refactor: simplify agent CLI to context, types, and state (#418 ) (#420 ) * refactor: simplify agent CLI to context, types, and state subcommands - Remove schema and builder subcommands and all supporting code - Add description column (docstring first paragraph) to types table - Add config_file per family (relative to data_designer package) - Add config_package_path and library_version to context output - Clean section hierarchy: ## for sections, ### for family sub-tables - Add docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint * cleanup: remove dead code and fix redundant type discovery - Remove unused get_import_path (only used by deleted schema/builder) - Remove unused class_name from catalog dicts - Fix N+1: get_family_source_file uses get_args directly instead of rediscovering all types via discover_family_types * docs: update DropColumnsProcessorConfig docstring to prefer drop=True * fix: address Greptile review feedback - Add parameters:/params: to _SECTION_HEADERS for docstring parsing - Fix config_package_path to return parent of data_designer package so Path(base) / relative_file resolves correctly - Use last occurrence of data_designer in _get_source_file to handle nested paths (e.g. dev checkouts) - Return list of deduplicated files per family (get_family_source_files) instead of assuming all types live in one file - Add config_builder_file to context output * fix: resolve config_builder_file dynamically and fix fragile test - Use _get_source_file(DataDesignerConfigBuilder) instead of hardcoded string for config_builder_file, consistent with family file resolution - Fix test assertion that assumed "config" in path (only true in dev) * fix: return empty string for unresolvable source paths - _get_source_file returns "" instead of absolute path when data_designer is not in the path, consistent with error branch - Add Config Module section to context output pointing agent to the config module as the only part of the codebase to work with - Rename config_package_path to config_module_path (returns config dir) * refactor: remove ConfigBase.schema_text() and supporting helpers Schema rendering is no longer needed in the config layer — the agent CLI now provides file paths so agents can read source files directly. * Improve agent context output and processor discoverability - Redeclare `name: str` in DropColumnsProcessorConfig and SchemaTransformProcessorConfig so agents see the required field without reading the base class - Add base config file path to agent context output - Optimize agent context formatting: strip redundant path prefixes, remove family count summary, separate usable/unusable model aliases, rename sections for clarity * fix: restore emoji literal in get_column_emoji * fix: revert unnecessary name redeclarations and use posix paths - Remove bare name: str redeclarations in processor configs that silently dropped the parent Field(description=...) - Use Path.as_posix() in _get_source_file for consistent forward slashes * docs: standardize config docstrings with (required) markers and Inherited Attributes - Add (required) to all required parameters in Attributes sections - Add Inherited Attributes section to all config subclasses listing fields from parent classes (SingleColumnConfig, ProcessorConfig, Constraint) - Fix stale with_trace descriptions in LLM subclass inherited sections - Remove discriminator fields from Attributes sections - Remove redundant name: str redeclaration from ExpressionColumnConfig * fix: address Greptile feedback on model aliases and test paths - Show per-alias reason for unusable models instead of blanket "missing API keys" label - Surface model_config_present: tell agent when no config file exists - Fix test fixtures to use realistic data_designer/config/ paths that exercise _strip_config_prefix * test: add coverage for model_config_present=false branch * docs: put required attributes first in Inherited Attributes docstrings Move `name (required)` to the top of the Inherited Attributes section in LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig so required fields appear before optional ones. * fix: improve agent CLI output for clarity and agent comprehension - Use {config_root}/file.py path syntax across all agent output - Add config_root preamble to standalone `agent types` output - Replace type_name (discriminator) with type (class name) in tables - Show only usable model aliases; warn agent to surface config issues - Add directive scoping agents to the config module only - Reword import hint and config module description for directness * fix: fall back to absolute path for plugin source files _get_source_file() returned "" for types outside the data_designer package (e.g., plugin configs). Now returns the absolute path so the agent still gets a readable file reference. * fix: remove unreachable model_config_present branch from formatter main() calls ensure_cli_default_model_settings() before any agent command, so model config is always seeded. The model_config_present=False branch was dead code. * test: add coverage for no-usable-model-aliases warning Covers the remaining branch in _format_model_aliases_context where all aliases are unusable and the agent gets a warning to surface to the user. * fix: add inherited attributes to section headers and use posix paths Address two Greptile review comments: - Add "inherited attributes:" to _SECTION_HEADERS so docstring parsing stops before that section even without a preceding blank line. - Use .as_posix() in get_config_module_path() for consistent forward-slash paths across platforms.	2026-03-17 09:30:06 -07:00
Johnny Greco	4c19dba74b	feat: agent CLI introspection (simplified) (#415 ) * feat: add agent introspection cli * refactor: remove agent cli schema version * refactor: omit missing builder docstrings from context * refactor: tighten agent cli contract * feat: add schema_text() to ConfigBase for human-readable field summaries ConfigBase.schema_text() returns a concise text representation including the class docstring summary, field names, types, defaults, and descriptions. Field descriptions added to column config types to surface through this method. * refactor: flatten agent CLI into plain functions with text output mode Delete AgentController class and agent_command_defs module. Move all logic into agent_introspection (data) and agent_text_formatter (display) as plain functions. Add --json flag so commands default to human-readable text using schema_text(), with JSON as opt-in. Unify _emit helper, remove include_docstrings parameter, deduplicate catalog calls, and fix N+1 discover_family_types in get_family_schemas. * fix: port stale controller tests and consolidate command descriptions Port test_agent_controller.py to use plain functions instead of deleted AgentController. Extract AGENT_COMMANDS constant as single source for operation descriptions, syncing with main.py help strings. * style: fix ruff formatting in agent_introspection * refactor: centralize agent command definitions Extract AGENT_COMMANDS into agent_command_defs.py so main.py and agent_introspection.py share a single source for command names, help text, and metadata. The new module has no heavy dependencies, keeping --help latency unaffected. * fix: handle default_factory and empty providers in schema_text and introspection - schema_text() now detects default_factory fields and renders e.g. "list()" instead of leaking PydanticUndefined - Guard against IndexError when provider registry has an empty providers list - Add 15 edge-case tests for schema_text covering default_factory, enum defaults, None defaults, scalar defaults, descriptions, and docstrings * refactor: remove JSON output mode from agent CLI commands Text-only output simplifies the interface. Structured output can be added back trivially since the functions already return dicts. * docs: update schema_text docstring to reflect agent focus * fix: include builder section and import_path in agent text output - format_context_text now renders a ## Builder section - format_types_text now includes import_path column in tables * refactor: drop import_path from types tables All config objects are imported via dd.<ClassName>, so the full import path is redundant noise in agent output. * docs: add family definition and import hint to context output * refactor: rename Types section to Families, drop redundant "types" from sub-headers * fix: coerce None to empty string in table cells row.get(col, '') returns None when the key exists with value None, causing str(None) to render "None" in the output. Use `or ''` instead. * refactor: move agent controller tests to utils as introspection integration tests There is no controller layer — these tests exercise functions in agent_introspection.py, so they belong in tests/cli/utils/. * fix: only coerce None to empty string in table cells, not False The previous `or ''` pattern treated all falsy values (including False) as empty. Use an explicit None check so booleans render correctly. * style: address review nits from nabin - Add explicit parentheses to and/or precedence in _build_agent_lazy_group - Rename loop variable l to line in test_schema_text - Move get_family_schema import to module level in test_agent_text_formatter * fix: improve schema_text Literal display, builder signature quotes, and docstring parsing - _format_annotation now renders Literal['value'] instead of bare Literal - _format_signature strips quotes from stringified annotations caused by `from __future__ import annotations` - _get_docstring_summary stops at any Google-style section header, not just Attributes:	2026-03-13 18:26:00 -04:00
Johnny Greco	b94b88b7a4	feat(cli): bootstrap default configs on CLI startup (#401 ) * feat(cli): bootstrap default configs on command run * fix(cli): use active interpreter in bootstrap warning * refactor(cli): simplify bootstrap warning flow * refactor(cli): bootstrap defaults in main entrypoint * refactor(cli): keep bootstrap ownership in main * test(cli): cover lazy dispatch and runtime failure flag * refactor(cli): remove redundant bootstrap state * test(cli): assert bootstrap warning includes error * test: address cli bootstrap review feedback	2026-03-12 15:42:41 -04:00
Nabin Mulepati	e4857f62fa	feat: add Streamable HTTP transport support for remote MCP providers (#358 ) * feat: add Streamable HTTP transport support for remote MCP providers (#357) Add `streamable_http` as a supported transport type for `MCPProvider`, enabling connections to MCP servers that use the Streamable HTTP protocol (e.g. Tavily remote endpoints). Previously only SSE transport was supported, causing silent 5-minute timeouts when connecting to incompatible endpoints. - Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]` (default remains `"sse"` for backwards compatibility) - Route `streamable_http` providers through `streamablehttp_client` from the MCP SDK in `MCPIOService._get_or_create_session()` - Handle variable-length context manager results from MCP transport clients - Add `DataDesigner.list_mcp_tool_names()` for discovering available tools - Update CLI form builder and controller to support the new transport option - Add tests for streamable_http config, session creation, and form builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * updates * simplify import * address greptile comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 08:11:54 -07:00
Johnny Greco	03b3d6c726	chore: address Andre's feedback on --save-results and CLI preview (#335 ) * fix: suppress stdout when saving report and sample records to file Console(record=True) still prints to stdout by default. Use file=io.StringIO() to redirect output so save-path calls only write to disk. * refactor: --save-results skips terminal display When --save-results is used, records and the analysis report are no longer printed to the terminal. Extracted save logic into a dedicated _save_preview_results method and updated option help text accordingly. * feat: wrap-around navigation in sample records browser Prev/next buttons and arrow keys now cycle back to the beginning/end instead of clamping at boundaries. * test: reuse record_series fixture in visualization tests * feat: thread --theme through to sample records pager The pager shell was hardcoded dark, so --theme light produced light records inside a dark frame. Extract CSS variables into dark/light constants and pass the theme from the controller. * fix: cap terminal display width at display_width The module-level Console() had no width limit, so tables with expand=True stretched to the full terminal width. Cap terminal output at min(terminal_width, display_width) and thread the display_width parameter through the controller's display methods. * docs: update --display-width and --theme help text Remove "Only applies when --save-results is used" from --display-width since it now also affects terminal output. * fix: update generation controller tests to match display_width and save_results behavior	2026-02-18 20:17:03 -05:00
Johnny Greco	1439bbea7e	chore: Improve CLI startup with lazy heavy import cleanup (#330 ) * perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios.	2026-02-18 16:24:15 -05:00
Johnny Greco	f2a1657870	feat: add --save-results option to preview command (#333 ) * feat: add --save-report option to preview command * feat: add save_path option to display_sample_record Allow saving rendered sample records as HTML or SVG files via an optional save_path parameter on both the standalone function and the WithRecordSamplerMixin method. * feat: replace --save-report with --save-results on preview command Replace the single-file --save-report option with --save-results, which saves all preview artifacts (dataset parquet, analysis report HTML, and per-record sample HTMLs) into a timestamped directory under the artifact path. Add error handling around the save block, improve timestamp precision to microseconds, and expand test coverage for the new behavior. * feat: add sample records pager with theme toggle, postMessage bridge, and UI polish * feat: add dataset metadata subtitle to pager and clean up toolbar layout * fix: address review findings for preview save-results feature - Split try/except in generation_controller so report display errors don't produce misleading "failed to save" messages when not saving - Add browser HTML path to save success output for discoverability - Remove 5 unused CSS variables from pager theme constants - Add "N of M" record counter to pager toolbar - Add theme/display_width assertions to all preview_command tests - Add dedicated test for custom theme and display_width passthrough - Add tests for record counter and CSS variable cleanup * fix: address code review findings and simplify pager - Fix critical bug: analysis report now displays to console even when --save-results is active (was silently dropped via pass statement) - Fix latent UnboundLocalError in display_sample_record when index is out of bounds (num_records computed before try block) - Eliminate duplicated dark CSS between constant and theme listener script - Simplify sample_records_pager: remove dual-theme system, postMessage bridge, and responsive media queries; restore GitHub link; reorder toolbar to put prev/next buttons on the far left - Narrow except Exception to except OSError in save-results path - Use case-insensitive extension check and lambda-based re.sub - Collapse redundant preview command delegation tests into parametrize - Add missing type annotations and remove tautological assertions * style: move record counter to far right of pager toolbar * refactor: remove dead theme-listener script and inline CSS constant _THEME_LISTENER_SCRIPT and _SAMPLE_RECORD_DARK_CSS_INLINE became orphaned after the pager simplification removed the postMessage bridge. This removes both constants, drops the injection line, switches the idempotency guard to the viewport meta tag, and cleans up related test assertions. * fix: move Path import out of TYPE_CHECKING block in test_visualization * fix: rename _logger to logger to match codebase convention * fix: remove unnecessary cast in preview command theme parameter * refactor: extract DEFAULT_DISPLAY_WIDTH constant and make apply_html_post_processing public * Update packages/data-designer-config/tests/config/utils/test_visualization.py --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>	2026-02-18 15:58:35 -05:00
Johnny Greco	1514720596	feat: support loading config files from HTTP(S) URLs (#323 ) * support loading config files from http urls - allow config builder and CLI loader to load YAML/JSON configs from HTTP(S) URLs - reject unsupported URL extensions and remote Python module URLs - update CLI help text and add tests for URL success/failure paths * harden remote config loading and deduplicate URL validation - Add size limit (10 MB) when fetching configs from URLs - Validate parsed YAML is a dict before returning - Make is_http_url public and reuse it in CLI validate_url - Replace local CONFIG_FILE_EXTENSIONS with shared constant - Add tests for is_http_url, URL-with-no-extension edge cases * use requests for remote config loading - replace urllib URL fetching with requests and status checks - parse remote payloads via smart_load_yaml for consistent validation - expand tests for HTTP errors, size limits, and non-dict payloads * lower remote config size limit to 1 MB * improve config URL HTTP error reporting Add granular 401/403/404 and generic HTTP status errors for remote config fetching to make failures actionable. Clarify that authenticated config URL loading is not currently supported and update tests for status-aware behavior. * rewrite github blob URLs for remote loading Handle GitHub blob links by rewriting them to raw content URLs for config and dataframe HTTP loaders, preserving query params but avoiding query token leaks in logs. This also fixes extension detection for URLs with query strings and adds coverage for rewrite behavior. * remove validate_url wrapper in favor of is_http_url The validate_url function in cli/utils was just a thin wrapper around is_http_url from io_helpers. Remove it and have callers use is_http_url directly for clarity and reduced indirection. * fix optional type for artifact_path CLI option * fix URL recursion in smart_load_yaml - avoid treating remote payload strings as new URL inputs - add regression test for URL string payloads from remote config * rewrite huggingface blob URLs for remote loading	2026-02-11 15:12:52 -05:00
Johnny Greco	d3c4de76da	feat: add preview, create, and validate CLI commands (#313 ) * feat: add preview, create, and validate CLI commands Add three new top-level CLI commands for the data-designer workflow: - `data-designer preview` - generate preview datasets for fast iteration - `data-designer create` - create full datasets and save to disk - `data-designer validate` - validate configuration files Also includes: - Move wait_for_navigation_key() UI primitive from preview.py to ui.py - Add KeyPressEvent type annotations to all key binding handlers in ui.py - Refactor cli/utils.py into cli/utils/ package with config_loader module - Comprehensive test coverage for all new commands * fix: update pythonjsonlogger import and clean up dev dependencies - Update pythonjsonlogger import to use newer JsonFormatter API - Consolidate dev-dependencies into [dependency-groups] dev section - Remove unnecessary test cli/utils __init__.py * small E * address greptile feedback * organize CLI commands into rich help panels Group top-level commands under "Generation" and "Setup" panels for clearer help output. * refactor config loader to parse files directly and auto-detect config format - Parse YAML/JSON files into dicts before passing to from_config, providing format-specific error messages for parse failures - Auto-detect DataDesignerConfig format (columns at top level) and wrap it into BuilderConfig so users can provide either format - Clean up Python module loading with try/except/finally for reliable sys.modules and sys.path cleanup - Add comprehensive tests for parsing, validation, and auto-wrapping * fix sys.path cleanup in config loader and simplify tests - Use pop(0) instead of remove() to precisely undo the insert(0, ...) and avoid accidentally removing a different matching path entry - Replace MagicMock with real DataDesignerConfigBuilder in tests * move config format auto-detection into from_config Centralize the shorthand DataDesignerConfig detection (columns at top level without a data_designer wrapper) in DataDesignerConfigBuilder.from_config so all callers benefit, not just the CLI config loader. Simplify config_loader to delegate file parsing and format normalization entirely to from_config. * extract GenerationController from CLI commands Move shared generation logic (preview, validate, create) out of the individual Typer command functions into a dedicated GenerationController, matching the existing controller pattern (DownloadController, etc.). The command functions now delegate to the controller, keeping them as thin entry points. Tests updated accordingly — command tests verify delegation while controller tests cover the full behavior. * harden sys.path cleanup and add explanatory comments Use sys.path.remove() instead of checking sys.path[0] so cleanup succeeds even when exec_module inserts entries at index 0. Drop unnecessary spec=DataDesignerConfigBuilder from test mocks. * check stdout TTY in preview interactive mode detection Previously only stdin was checked, so piping stdout (e.g. `dd preview cfg.yaml \| head`) would still attempt interactive browsing. Now both stdin and stdout must be a TTY.	2026-02-11 14:06:06 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Johnny Greco	c19f35639f	chore: add publish script and update license headers (#253 )	2026-01-28 08:47:34 -05:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00

24 commits