mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
24 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
7a539c0e3d | fix: make display_tui the canonical run config flag | ||
|
|
a867a66d58 | fix: rename create progress flags to tui | ||
|
|
a24edaa06a | feat: add create progress override flags | ||
|
|
d14c9b3ccc
|
feat(cli): add plugin catalog core (#618)
* feat(cli): add plugin catalog services Add typed catalog and tap models, persistent tap storage, cached catalog loading, compatibility evaluation, install plan generation, and runtime plugin discovery helpers. Refs #617 * feat(cli): add plugins command group Wire list, search, info, install, installed, and tap management commands through the existing command-controller CLI pattern. Refs #617 * test(cli): cover plugin catalog workflows Add regression coverage for tap caching, catalog compatibility, installer command generation, local path resolution, and Typer command delegation. Refs #617 * fix(cli): align plugin taps with schema v2 Validate tap catalogs against the schema v2 contract used by NVIDIA-NeMo/DataDesignerPlugins#36, including source union fields, docs URLs, package paths, compatibility metadata, and unique runtime plugin names. Derive Git install targets as package-qualified PEP 508 direct references so git tap entries install the package described by the catalog source metadata. Refs #617 * fix(cli): address plugin review feedback - Invalidate import caches before post-install entry point verification - Make tap aliases case-insensitive and cache catalogs by alias plus URL - Prefer compatible catalog entries before falling back to forced installs - Clarify unused --tap behavior and list installed entry points without imports - Add direct controller coverage and update CLI plugin documentation Refs #617 * fix(cli): gate incompatible plugin installs Fetch install targets before compatibility filtering so the controller owns the final --force decision and the incompatible install guard stays reachable. Refs #617 * style(cli): format plugin catalog files Apply ruff formatting to the plugin command and tap repository tests so CI format checks pass on the PR merge commit. Refs #617 * fix(cli): reject duplicate plugin entry names Key catalog duplicate detection by entry_point.name so distinct catalog entries cannot register the same runtime plugin name. Refs #617 * fix(cli): preserve GitHub tree tap paths * fix(cli): verify plugin entry point names * align plugin CLI with catalog schema - adopt catalog terminology for plugin source aliases - parse package-first plugin catalog metadata from the plugin repo - install package requirements with optional catalog indexes * tidy plugin catalog workflow docs * align plugin catalog CLI with package contract * add plugin package uninstall workflow * test plugin package command targets * document plugin package aliases * address plugin catalog review feedback * prefer runtime plugin lookup matches * rename plugins command to plugin * show plugin package descriptions * rename plugin catalogs command * add protected plugin package installs * document plugin package install modes * avoid building project during plugin installs * harden plugin package installs * tighten plugin catalog contracts * fix no-args help exit code * make plugin docs links robust * document plugin CLI catalog workflows * clarify plugin entry point verification * simplify plugin CLI docs * narrow plugin search fields * hide plugin catalog cache ttl * remove plugin catalog trust flag * improve plugin CLI recovery UX * polish plugin catalog table display * stabilize plugin catalog table test * tighten plugin catalog edge cases * harden plugin catalog verification - Escape catalog-provided Rich markup before rendering CLI output - Reject runtime plugin names that collide after enum-key normalization - Load installed runtime entry points in a subprocess before reporting success * simplify plugin entry point verification Load matching entry points directly after install instead of spawning a separate Python process. This keeps the check package-scoped while still catching broken entry-point targets and non-Plugin objects. * require newer uv for plugin plans Use uv >= 0.10.0 as the single supported uv requirement for plugin package commands. Auto mode now falls back to a pip plan with an upgrade warning when uv is unavailable or too old, while explicit uv selection remains strict. * verify pip fallback availability * polish plugin CLI status markers * clarify plugin compatibility labels * simplify plugin info install details * address plugin CLI review nits * support versioned plugin package installs * share plugin install metadata rendering * show installed plugin packages * harden versioned plugin installs - Preserve catalog requirement constraints for versioned installs - Remove stale install-plan metadata fields - Expand parser, uv, controller, and local-catalog dry-run coverage * harden plugin help tests * show plugin package versions Add package version metadata support for plugin catalogs and resolve current versions from exact requirements or simple indexes when catalog entries omit them. Update plugin list/info/install metadata to show the plugin package version and Data Designer compatibility requirement while removing the separate Data Designer version line. * format plugin catalog tests * harden plugin package metadata checks * harden plugin CLI test coverage * add plugin discovery docs (#642) Signed-off-by: Johnny Greco <jogreco@nvidia.com> --------- Signed-off-by: Johnny Greco <jogreco@nvidia.com> |
||
|
|
810c681f7a
|
feat: resume interrupted dataset generation runs (sync + async engine) (#526)
Some checks failed
CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Coverage Check (Python 3.11) (push) Has been cancelled
CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Lint and Format Check (push) Has been cancelled
CI / Check License Headers (push) Has been cancelled
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
* docs: add implementation plan for resume mechanism
Fixes #525
* feat(storage): add resume flag and clear_partial_results()
- ArtifactStorage gains a `resume: bool = False` field
- resolved_dataset_name skips timestamp logic when resume=True,
returning the existing dataset folder name as-is
- Raises ArtifactStorageError on resume=True when the target folder
is absent or empty (no data to resume from)
- New clear_partial_results() removes in-flight partial results
left over from an interrupted run
Fixes #525
* feat(batch-manager): add start_batch param to start()
DatasetBatchManager.start() now accepts:
- start_batch: int = 0 — first batch index to process
- initial_actual_num_records: int = 0 — records already on disk
Both default to 0 so all existing call sites are unaffected.
Fixes #525
* feat(builder): implement resume logic in DatasetBuilder
- build() gains a resume: bool = False parameter
- _load_resume_state() reads metadata.json and validates that
num_records and buffer_size match the original run
- _build_with_resume() skips completed batches, clears in-flight
partial results, and continues from the first incomplete batch
- Raises DatasetGenerationError with clear messages for:
- missing metadata.json (interrupted before first batch completes)
- num_records mismatch
- buffer_size mismatch
- DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported)
- Logs a warning and returns early when dataset is already complete
Fixes #525
* feat(interface): expose resume on DataDesigner.create()
- create() gains resume: bool = False
- _create_resource_provider() passes resume to ArtifactStorage
- builder.build() receives the resume flag
Fixes #525
* test: add tests for resume mechanism
Covers:
- ArtifactStorage.resolved_dataset_name with resume=True
- ArtifactStorage.clear_partial_results()
- DatasetBatchManager.start() with start_batch and
initial_actual_num_records
- DatasetBuilder.build(resume=True): missing metadata, num_records
mismatch, buffer_size mismatch, already-complete detection
Fixes #525
* feat(builder): extend resume to async engine (DATA_DESIGNER_ASYNC_ENGINE=1)
- Add _find_completed_row_group_ids() to scan parquet-files/ for already-written
row groups by parsing batch_*.parquet filenames
- _build_async() now accepts resume=True: loads metadata, finds completed row groups,
clears partial results, and logs progress; returns early if all row groups are done
- _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and
initial_total_num_batches so the scheduler only processes remaining row groups
and RowGroupBufferManager starts from the correct counts
- RowGroupBufferManager.__init__ gains initial_actual_num_records and
initial_total_num_batches params to seed the counters on resume
- finalize_row_group closure now writes incremental metadata after each checkpoint
so any run (resume or not) can be resumed if interrupted mid-way
- Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1
- Add tests for all new paths
* fix(builder): skip after-generation processors when resume finds dataset already complete
_build_with_resume and _build_async now return False when the dataset is already
complete (early-return path), True otherwise. build() skips
_processor_runner.run_after_generation() on False, preventing processors from
calling shutil.rmtree and rewriting an already-finalized dataset.
Fixes the issue raised in review: greptile P1 comment on PR #526.
* fix(builder): use filesystem count for initial_total_num_batches on async resume
Metadata can lag by one row group if a crash occurs between
move_partial_result_to_final_file_path and write_metadata. Using
len(completed_ids) from the filesystem scan instead of
state.num_completed_batches ensures the final metadata reflects the
actual number of parquet files present, not the potentially stale
metadata count.
* feat(results): add export() method and --output-format CLI flag
Adds DatasetCreationResults.export(path, format=) supporting jsonl,
csv, and parquet. The CLI create command gains --output-format / -f
which writes dataset.<format> alongside the parquet batch files.
* fix(builder): handle resume when metadata.json missing (interrupted before first batch)
When a run is interrupted before any row group or batch completes, metadata.json
is never written. Previously resume=True would raise DatasetGenerationError in
this case. Now build() detects the missing file, logs an info message, clears
any leftover partial results and falls back to a clean fresh run.
This is the common scenario for small datasets (fewer records than buffer_size)
where all records fit in a single row group.
* docs(interface): fix resume docstring — async engine is supported
* fix(builder): derive initial_actual_num_records from filesystem in async resume
In the crash window (row group written to disk but write_metadata crashed before
updating the file), both initial_total_num_batches and initial_actual_num_records
now use the filesystem-discovered completed_ids as source of truth. Previously
initial_actual_num_records was read from potentially stale metadata, causing
actual_num_records in the final metadata to be undercounted by one row group.
Also adds a test covering the partial-resume crash-window scenario.
* feat(resume): replace resume: bool with ResumeMode enum (NEVER/ALWAYS/IF_POSSIBLE)
- Introduces ResumeMode(StrEnum) in artifact_storage.py for use across all layers
- Replaces resume: bool with resume: ResumeMode in DatasetBuilder.build(),
DataDesigner.create(), ArtifactStorage, and _build_async()
- Adds _check_resume_config_compatibility() using config fingerprints to support
IF_POSSIBLE: falls back to a fresh run when config has changed since last run
- Relaxes num_records validation from strict equality to num_records >= actual_num_records,
allowing dataset extension on resume; buffer_size must still match exactly
- Preserves exception chain with 'from exc' on FileNotFoundError in _load_resume_state
- Exports ResumeMode from data_designer.interface for users to import
- Adds skip_row_groups assertion test and IF_POSSIBLE storage behavior tests
* fix(resume): invalidate resolved_dataset_name cache when IF_POSSIBLE downgrades to NEVER
ArtifactStorage's Pydantic model validator accesses base_dataset_path at
construction time, caching resolved_dataset_name under IF_POSSIBLE semantics
before build() can set resume=NEVER. Pop the stale cache entry so the property
re-resolves with the correct NEVER semantics (timestamped directory).
Also fixes _check_resume_config_compatibility() to use artifact_path/dataset_name
directly instead of base_dataset_path, and adds a regression test covering the
cache-bypass scenario.
* fix(builder): move partial-completion warning before return in _build_async
* fix(builder): IF_POSSIBLE now starts fresh when no dataset directory exists
_check_resume_config_compatibility returned True when config_path was absent,
even when the dataset directory itself didn't exist. This caused IF_POSSIBLE to
upgrade to ALWAYS, which then raised ArtifactStorageError on the first-ever run
because ALWAYS requires an existing directory.
Fix: return False early when the dataset directory is absent. Also sets
actual_num_records on mock buffer managers in two async resume tests that
started failing after the partial-completion warning block was made reachable.
* fix(builder): use original target_num_records in async resume record count
When extending a non-aligned run (e.g. original num_records=5, buffer_size=2),
the last completed row group has 1 record, not buffer_size=2. Using new num_records
in the formula would overcount: min(2, 7-2*2)=2 instead of min(2, 5-2*2)=1.
Fix: capture state from _load_resume_state (previously discarded) and pass
state.target_num_records into the sum formula. Added target_num_records field to
_ResumeState, populated from metadata.json.
Test: test_build_async_resume_initial_actual_num_records_uses_original_target
* fix(builder): IF_POSSIBLE starts fresh on empty dataset directory
Empty directory (crash between mkdir and first file write) was treated as
compatible — _check_resume_config_compatibility returned True, IF_POSSIBLE
upgraded to ALWAYS, which then raised ArtifactStorageError.
Fix: treat empty directory the same as missing — return False from
_check_resume_config_compatibility when any(dir.iterdir()) is False.
Test: test_if_possible_starts_fresh_when_directory_is_empty
* fix(builder): ALWAYS raises DatasetGenerationError on config fingerprint mismatch
ResumeMode.ALWAYS was documented to raise when column/model config changed, but
_check_resume_config_compatibility() was only called in the IF_POSSIBLE branch.
A user resuming with ALWAYS after changing the config would silently mix records
from two different configs.
Fix:
- Refactor _check_resume_config_compatibility() to return _ConfigCompatibility
enum (COMPATIBLE / INCOMPATIBLE / NO_PRIOR_DATASET) instead of bool so callers
can distinguish 'no prior run' from 'configs differ'
- Call the check for both ALWAYS and IF_POSSIBLE before _write_builder_config()
- ALWAYS + INCOMPATIBLE → DatasetGenerationError
- IF_POSSIBLE + INCOMPATIBLE → silent fresh start (existing behaviour)
- IF_POSSIBLE + NO_PRIOR_DATASET → silent fresh start (existing behaviour)
Test: test_build_resume_always_raises_on_config_mismatch
* fix(resume): address nabinchha review — drop export collision, add CLI flag, fix edge cases
C1: drop commit
|
||
|
|
417b0c715d
|
feat(cli): show version update notice (#602) | ||
|
|
0afe287a5f
|
feat(results): add export() method and --output-format CLI flag (#540)
* feat(results): add export() method and --output-format CLI flag Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files. * fix(cli): validate output_format before dataset generation * fix(cli): remove top-level results import from create.py to preserve lazy loading * fix(results): address andreatgretel review — error types, UX ordering, import hygiene - Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart - Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions - Add date_format="iso" to to_json() for consistent datetime serialization across formats - Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time validation, better --help output, and tab completion - Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export - Move success message after the export block to avoid "Dataset created" followed by "Export failed" - Move imports to module level in test_results.py (json, Path, lazy already imported) - Add controller-level tests for output_format happy path, bad format rejection, and export failure * fix(results): correct Raises docstring — ValueError -> InvalidFileFormatError * feat(results): stream batch files in export() to avoid OOM on large datasets - Rewrite export() to read batch parquet files one at a time instead of materialising the full dataset via load_dataset(); peak memory is now proportional to a single batch regardless of dataset size - Infer output format from file extension by default; format= parameter kept as an explicit override (e.g. writing .txt as JSONL) - _export_parquet unifies schemas across batches (pa.unify_schemas) to handle type drift (e.g. int64 vs float64 in the same column) - Drop format= from the controller's export() call — path already carries the correct extension - Rewrite export tests around real batch parquet files (stub_batch_dir fixture); add tests for multi-batch output, schema unification, unknown extension, empty batch directory, and explicit format override * fix(results): address nabinchha review — memory safety, error wrapping, UX - Replace load_dataset() with count_records() in CLI to avoid OOM on large datasets; add count_records() method using pq.read_metadata (reads file metadata only, no data pages loaded) - Remove redundant format validation in controller — click.Choice in create.py already rejects invalid values at parse time; dead code removed along with corresponding test - Wrap pa.unify_schemas / table.cast ArrowInvalid as InvalidFileFormatError to normalize third-party exceptions at module boundaries per AGENTS.md - Lowercase file extension before format lookup so .JSONL/.CSV/.PARQUET are accepted without error - Add clarifying comment to trailing-newline guard in _export_jsonl - Add tests: count_records(), uppercase extension, incompatible schemas * fix(results): fix parquet export schema unification and controller path bug - Use promote_options="permissive" in pa.unify_schemas so minor numeric type drift (int64 vs float64) is handled by promotion instead of raising - Also catch ArrowTypeError from unify_schemas and ValueError from table.cast() — the actual exception types thrown by pyarrow for these cases (ArrowInvalid alone is not sufficient) - Wrap base_dataset_path in Path() in generation_controller.run_create to guard against callers that return a str (mock returns str, Path does not support / with str operands) - Update test_export_parquet_incompatible_schemas_raises to match the new error source: with permissive unification, different-column-name batches fail at cast() not at unify_schemas(), so the match string changes from "Cannot unify batch schemas" to "Cannot cast batch" * fix(results,cli): address nabinchha review round 2 - Use public pa.ArrowInvalid/ArrowTypeError instead of pa.lib.* in _export_parquet - Drop dead trailing-newline guard in _export_jsonl; skip empty batches with `if content` - Rename num_records→actual_record_count after count_records() call to avoid shadowing - Unlink partial export file before re-raising on export failure in run_create - Export filename now uses dataset_name (<dataset-name>.<format>) instead of literal "dataset" - Update help text and tests to match new export filename convention --------- Co-authored-by: Andre Manoel <165937436+andreatgretel@users.noreply.github.com> |
||
|
|
f73da1975c
|
feat(models): deprecate implicit default provider routing (#594)
Some checks failed
CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Coverage Check (Python 3.11) (push) Has been cancelled
CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
* feat(models): deprecate implicit default provider routing
Emit DeprecationWarning whenever the legacy "implicit default
provider" path is exercised: `ModelConfig.provider=None`, the
registry-level `ModelProviderRegistry.default`, the YAML
`default:` key in `~/.data-designer/model_providers.yaml`, and
the CLI's "Change default provider" workflow.
`resolve_model_provider_registry` skips passing `default=` in the
single-provider case so the common construction path stays quiet.
Multi-provider registries still pass `default` (per
`check_implicit_default`) and warn accordingly.
Update docs, the package README, and test fixtures to specify
`provider=` explicitly on every `ModelConfig`. New tests cover
each warning entry point and pin the post-deprecation happy paths.
Refs #589
Made-with: Cursor
* fix(models): address PR #594 review feedback
Greptile P1: ProviderRepository.load emitted its DeprecationWarning
inside a `try/except Exception` block. Under
`filterwarnings("error", DeprecationWarning)` the warn would raise,
the except would swallow it, and `load()` would silently return None
(losing the registry). Move the warn outside the catch-all so the
strict-warning path no longer drops valid configs.
Greptile P2 / johnnygreco: `_warn_on_implicit_provider` and
`_warn_on_explicit_default` use `stacklevel=2`, which lands inside
pydantic v2's validator dispatch rather than at the user's
`ModelConfig(...)` / `ModelProviderRegistry(...)` call. That broke
both attribution (the source line was unhelpful) and Python's
once-per-location dedup (every call collapsed to the same
pydantic-internal key, suppressing all but the first warning).
Introduce `data_designer.config.utils.warning_helpers.warn_at_caller`,
which walks past the helper, validator, and any pydantic frames to
find the user's call site and emits via `warnings.warn_explicit` with
the user frame's `__warningregistry__`. Keeps attribution accurate
and dedup keyed on the user's (filename, lineno).
johnnygreco: align the `provider_repository.py` warning copy with the
sibling site in `default_model_settings.py` ("specify provider=
explicitly on each ModelConfig instead") so both YAML-default warning
sites give the same migration instruction. The previous wording
pointed users at "ModelConfig entries" inside `model_providers.yaml`,
where ModelConfig entries don't actually live.
johnnygreco: dedup the cascade in `DataDesigner.__init__`. With
`model_providers=None` and a YAML `default:`, the user previously saw
two DeprecationWarnings for the same root cause —
`get_default_provider_name()` warns about the YAML key, then
`resolve_model_provider_registry(...)` re-warns from
`_warn_on_explicit_default`. Suppress the registry-level duplicate in
the YAML-fallback branch via `warnings.catch_warnings()` so users see
exactly one warning per user action.
johnnygreco: tighten `_warn_on_explicit_default` to fire only when
`default is not None`. Passing `default=None` explicitly is
semantically equivalent to omitting it (caller is opting *out* of a
registry-level default), and shouldn't trigger the deprecation
nudge.
johnnygreco: add a `model_validate({...})` regression test for
`ModelConfig` so the deserialization path (legacy on-disk configs)
is pinned alongside the construction path.
Tests:
- Update `test_load_exists` and `test_save` to omit `default=` so the
roundtrip stops exercising the deprecated YAML-default path
unguarded (Greptile note).
- Wrap `test_resolve_model_provider_registry_with_explicit_default`,
`test_get_provider`, and
`test_init_user_supplied_providers_preserve_first_wins_over_yaml_default`
in `pytest.warns` so the suite stays green under
`-W error::DeprecationWarning` (Greptile note).
- Add `test_explicit_default_none_does_not_emit_deprecation_warning`
to pin the tightened predicate.
- Add `test_init_yaml_default_emits_single_deprecation_warning` to
pin the cascade-dedup behavior.
Refs #589
Made-with: Cursor
* fix(models): make deprecation warnings visible under default filters
andreatgretel (PR #594): the YAML-default warning in
`get_default_provider_name` and the registry-default warning emitted
from inside DataDesigner helpers were attributing to data_designer
library frames, not user code. Python's default filter chain includes
`ignore::DeprecationWarning`, so library-attributed entries are
silenced — meaning a normal `DataDesigner()` call with a YAML
`default:` set showed nothing, and `resolve_model_provider_registry`
warnings were similarly invisible. Two related changes:
1. `warn_at_caller`: extend the default skip-list from `("pydantic",)`
to `("pydantic", "pydantic_core", "data_designer")` so the walk
escapes both pydantic's validator-dispatch frames and data_designer
helper frames before attributing. Also tighten the prefix predicate
to exact-or-dotted-prefix matching (`name == p or
name.startswith(p + ".")`) so e.g. `pydantic_helpers` is not
falsely matched as part of `pydantic` (johnnygreco nit). Allow
callers to pass a custom `skip_prefixes` for flexibility. Drop the
"skip frame 0+1 unconditionally" guard now that prefix matching
covers it.
2. `get_default_provider_name`: switch from
`warnings.warn(stacklevel=2)` to `warn_at_caller`. The previous
stacklevel pointed into `default_model_settings.py`, which is a
library file → silenced under default filters. Verified the fix
empirically with `python -W default`: warning is now attributed to
the user's call site and rendered.
johnnygreco (PR #594): add the missing
`test_explicit_default_none_does_not_emit_deprecation_warning`
regression for the `self.default is not None` predicate landed in
the prior round.
Tests:
- New `test_warning_helpers.py` pins prefix-matching precision
(rejects `pydantic_helpers` / `data_designer_other`), default
skip-list contents, attribution past skip-prefix frames, and
per-call-site dedup behavior.
- `test_get_default_provider_name_warning_attributes_to_user_frame`
pins andreatgretel's repro for the YAML-default site.
- `test_explicit_default_warning_attributes_to_user_frame` pins the
multi-frame case: construction goes through
`resolve_model_provider_registry`, so the walk has to escape both
pydantic and data_designer before landing on the test file.
- `test_explicit_default_none_does_not_emit_deprecation_warning`
pins johnnygreco's predicate-tightening regression.
3,124 tests pass (540 config + 1,923 engine + 653 interface; +10 net
from this round).
Refs #589
Made-with: Cursor
* fix(models): apply warn_at_caller to remaining deprecation sites
greptile-apps (PR #594, r3189904028): `ProviderRepository.load`'s
YAML-default `DeprecationWarning` was using `warnings.warn(stacklevel=2)`,
which attributes to whichever data_designer frame called `load()` —
controllers, services, list/reset commands, agent introspection. Every
real call path lands on `data_designer.cli.*`, which falls under
Python's default `ignore::DeprecationWarning` filter and is silenced.
Audit found two more sites with the same problem:
- `DatasetBuilder._resolve_async_compatibility` (`allow_resize` /
issue #552) — was using `stacklevel=4` to walk past
`_resolve_async_compatibility -> build/build_preview -> interface ->
user`. Brittle: any added frame (decorator, async wrapping, the
`try/except DeprecationWarning: raise` boundary) shifts attribution
silently. The existing test passed only because it used
`simplefilter("always") + record=True`, which records warnings
regardless of attribution.
- `ProviderController._handle_change_default` — was using
`stacklevel=2`, which lands on the menu dispatcher in the same
controller module. `print_warning` already shows the message
visually, but programmatic observers (`pytest.warns`,
`filterwarnings("error", ...)`) saw a library-attributed entry that
default filters silenced.
All three migrated to `warn_at_caller` (the helper from
|
||
|
|
98715dcd86
|
chore(cli): Add --org option to NGC download command (#604)
Some checks are pending
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Lint and Format Check (push) Waiting to run
CI / Check License Headers (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
|
||
|
|
fc0365cada
|
feat(cli): add data-designer --version (#599) | ||
|
|
a65903eb1a
|
chore: add ko_KR locale to nemotron personas datasets (#572)
Some checks failed
CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Coverage Check (Python 3.11) (push) Has been cancelled
CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
* chore: add ko_KR locale to nemotron personas datasets Register Korean (ko_KR, 2.66 GB) as an available managed persona dataset locale, update related CLI/repository tests, and document the new locale and its NGC download command. * update person fields * update fr_FR size * docs: reconcile personas field tables with installed parquet schemas Remove stale per-locale fields that no longer exist in any managed parquet (commune, departement, prefecture), drop district from the India-specific section since it's already listed in Core Fields, rename digital_skills → digital_skill to match the actual ja_JP column, and add sections for ko_KR, en_SG, and the en_US/en_SG shared ethnic_background. Corrects the religion-family membership to include en_SG. * test: add missing fr_FR assertion in test_run_personas_with_all_flag The test asserts all 9 locales were downloaded but only enumerates 8 in its per-locale checks — fr_FR has been missing since before the ko_KR addition. Align the enumeration with the count. * docs: add ko_KR to locale parameter list |
||
|
|
0d10bf8dc6
|
feat: add fr_FR locale to nemotron personas datasets (#468)
* feat: add fr_FR locale to nemotron personas datasets Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES and add 7 France-specific PII fields: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement. * fix: update download controller and service tests for fr_FR locale Update hardcoded locale counts from 7 to 8 and add fr_FR assertions in download controller and download service tests. * fix: generate CLI locale help text dynamically from constants The --locale help text was hardcoded and already stale (missing en_SG, pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays in sync automatically. * refactor: add LOCALES_WITH_MANAGED_DATASETS_STR constant Centralise the comma-joined locale list so it is defined once in constants and reused in the CLI help text, PersonSamplerParams field description, and locale validation error message. |
||
|
|
164db0aeb4
|
refactor: simplify agent CLI to context, types, and state (#418) (#420)
* refactor: simplify agent CLI to context, types, and state subcommands
- Remove schema and builder subcommands and all supporting code
- Add description column (docstring first paragraph) to types table
- Add config_file per family (relative to data_designer package)
- Add config_package_path and library_version to context output
- Clean section hierarchy: ## for sections, ### for family sub-tables
- Add docstrings to ScalarInequalityConstraint and ColumnInequalityConstraint
* cleanup: remove dead code and fix redundant type discovery
- Remove unused get_import_path (only used by deleted schema/builder)
- Remove unused class_name from catalog dicts
- Fix N+1: get_family_source_file uses get_args directly instead of
rediscovering all types via discover_family_types
* docs: update DropColumnsProcessorConfig docstring to prefer drop=True
* fix: address Greptile review feedback
- Add parameters:/params: to _SECTION_HEADERS for docstring parsing
- Fix config_package_path to return parent of data_designer package so
Path(base) / relative_file resolves correctly
- Use last occurrence of data_designer in _get_source_file to handle
nested paths (e.g. dev checkouts)
- Return list of deduplicated files per family (get_family_source_files)
instead of assuming all types live in one file
- Add config_builder_file to context output
* fix: resolve config_builder_file dynamically and fix fragile test
- Use _get_source_file(DataDesignerConfigBuilder) instead of hardcoded
string for config_builder_file, consistent with family file resolution
- Fix test assertion that assumed "config" in path (only true in dev)
* fix: return empty string for unresolvable source paths
- _get_source_file returns "" instead of absolute path when
data_designer is not in the path, consistent with error branch
- Add Config Module section to context output pointing agent to
the config module as the only part of the codebase to work with
- Rename config_package_path to config_module_path (returns config dir)
* refactor: remove ConfigBase.schema_text() and supporting helpers
Schema rendering is no longer needed in the config layer — the agent
CLI now provides file paths so agents can read source files directly.
* Improve agent context output and processor discoverability
- Redeclare `name: str` in DropColumnsProcessorConfig and
SchemaTransformProcessorConfig so agents see the required field
without reading the base class
- Add base config file path to agent context output
- Optimize agent context formatting: strip redundant path prefixes,
remove family count summary, separate usable/unusable model aliases,
rename sections for clarity
* fix: restore emoji literal in get_column_emoji
* fix: revert unnecessary name redeclarations and use posix paths
- Remove bare name: str redeclarations in processor configs that
silently dropped the parent Field(description=...)
- Use Path.as_posix() in _get_source_file for consistent forward slashes
* docs: standardize config docstrings with (required) markers and Inherited Attributes
- Add (required) to all required parameters in Attributes sections
- Add Inherited Attributes section to all config subclasses listing
fields from parent classes (SingleColumnConfig, ProcessorConfig, Constraint)
- Fix stale with_trace descriptions in LLM subclass inherited sections
- Remove discriminator fields from Attributes sections
- Remove redundant name: str redeclaration from ExpressionColumnConfig
* fix: address Greptile feedback on model aliases and test paths
- Show per-alias reason for unusable models instead of blanket
"missing API keys" label
- Surface model_config_present: tell agent when no config file exists
- Fix test fixtures to use realistic data_designer/config/ paths that
exercise _strip_config_prefix
* test: add coverage for model_config_present=false branch
* docs: put required attributes first in Inherited Attributes docstrings
Move `name (required)` to the top of the Inherited Attributes section
in LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig
so required fields appear before optional ones.
* fix: improve agent CLI output for clarity and agent comprehension
- Use {config_root}/file.py path syntax across all agent output
- Add config_root preamble to standalone `agent types` output
- Replace type_name (discriminator) with type (class name) in tables
- Show only usable model aliases; warn agent to surface config issues
- Add directive scoping agents to the config module only
- Reword import hint and config module description for directness
* fix: fall back to absolute path for plugin source files
_get_source_file() returned "" for types outside the data_designer
package (e.g., plugin configs). Now returns the absolute path so
the agent still gets a readable file reference.
* fix: remove unreachable model_config_present branch from formatter
main() calls ensure_cli_default_model_settings() before any agent
command, so model config is always seeded. The model_config_present=False
branch was dead code.
* test: add coverage for no-usable-model-aliases warning
Covers the remaining branch in _format_model_aliases_context where
all aliases are unusable and the agent gets a warning to surface to
the user.
* fix: add inherited attributes to section headers and use posix paths
Address two Greptile review comments:
- Add "inherited attributes:" to _SECTION_HEADERS so docstring parsing
stops before that section even without a preceding blank line.
- Use .as_posix() in get_config_module_path() for consistent
forward-slash paths across platforms.
|
||
|
|
4c19dba74b
|
feat: agent CLI introspection (simplified) (#415)
* feat: add agent introspection cli * refactor: remove agent cli schema version * refactor: omit missing builder docstrings from context * refactor: tighten agent cli contract * feat: add schema_text() to ConfigBase for human-readable field summaries ConfigBase.schema_text() returns a concise text representation including the class docstring summary, field names, types, defaults, and descriptions. Field descriptions added to column config types to surface through this method. * refactor: flatten agent CLI into plain functions with text output mode Delete AgentController class and agent_command_defs module. Move all logic into agent_introspection (data) and agent_text_formatter (display) as plain functions. Add --json flag so commands default to human-readable text using schema_text(), with JSON as opt-in. Unify _emit helper, remove include_docstrings parameter, deduplicate catalog calls, and fix N+1 discover_family_types in get_family_schemas. * fix: port stale controller tests and consolidate command descriptions Port test_agent_controller.py to use plain functions instead of deleted AgentController. Extract AGENT_COMMANDS constant as single source for operation descriptions, syncing with main.py help strings. * style: fix ruff formatting in agent_introspection * refactor: centralize agent command definitions Extract AGENT_COMMANDS into agent_command_defs.py so main.py and agent_introspection.py share a single source for command names, help text, and metadata. The new module has no heavy dependencies, keeping --help latency unaffected. * fix: handle default_factory and empty providers in schema_text and introspection - schema_text() now detects default_factory fields and renders e.g. "list()" instead of leaking PydanticUndefined - Guard against IndexError when provider registry has an empty providers list - Add 15 edge-case tests for schema_text covering default_factory, enum defaults, None defaults, scalar defaults, descriptions, and docstrings * refactor: remove JSON output mode from agent CLI commands Text-only output simplifies the interface. Structured output can be added back trivially since the functions already return dicts. * docs: update schema_text docstring to reflect agent focus * fix: include builder section and import_path in agent text output - format_context_text now renders a ## Builder section - format_types_text now includes import_path column in tables * refactor: drop import_path from types tables All config objects are imported via dd.<ClassName>, so the full import path is redundant noise in agent output. * docs: add family definition and import hint to context output * refactor: rename Types section to Families, drop redundant "types" from sub-headers * fix: coerce None to empty string in table cells row.get(col, '') returns None when the key exists with value None, causing str(None) to render "None" in the output. Use `or ''` instead. * refactor: move agent controller tests to utils as introspection integration tests There is no controller layer — these tests exercise functions in agent_introspection.py, so they belong in tests/cli/utils/. * fix: only coerce None to empty string in table cells, not False The previous `or ''` pattern treated all falsy values (including False) as empty. Use an explicit None check so booleans render correctly. * style: address review nits from nabin - Add explicit parentheses to and/or precedence in _build_agent_lazy_group - Rename loop variable l to line in test_schema_text - Move get_family_schema import to module level in test_agent_text_formatter * fix: improve schema_text Literal display, builder signature quotes, and docstring parsing - _format_annotation now renders Literal['value'] instead of bare Literal - _format_signature strips quotes from stringified annotations caused by `from __future__ import annotations` - _get_docstring_summary stops at any Google-style section header, not just Attributes: |
||
|
|
b94b88b7a4
|
feat(cli): bootstrap default configs on CLI startup (#401)
* feat(cli): bootstrap default configs on command run * fix(cli): use active interpreter in bootstrap warning * refactor(cli): simplify bootstrap warning flow * refactor(cli): bootstrap defaults in main entrypoint * refactor(cli): keep bootstrap ownership in main * test(cli): cover lazy dispatch and runtime failure flag * refactor(cli): remove redundant bootstrap state * test(cli): assert bootstrap warning includes error * test: address cli bootstrap review feedback |
||
|
|
e4857f62fa
|
feat: add Streamable HTTP transport support for remote MCP providers (#358)
* feat: add Streamable HTTP transport support for remote MCP providers (#357) Add `streamable_http` as a supported transport type for `MCPProvider`, enabling connections to MCP servers that use the Streamable HTTP protocol (e.g. Tavily remote endpoints). Previously only SSE transport was supported, causing silent 5-minute timeouts when connecting to incompatible endpoints. - Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]` (default remains `"sse"` for backwards compatibility) - Route `streamable_http` providers through `streamablehttp_client` from the MCP SDK in `MCPIOService._get_or_create_session()` - Handle variable-length context manager results from MCP transport clients - Add `DataDesigner.list_mcp_tool_names()` for discovering available tools - Update CLI form builder and controller to support the new transport option - Add tests for streamable_http config, session creation, and form builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * updates * simplify import * address greptile comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
03b3d6c726
|
chore: address Andre's feedback on --save-results and CLI preview (#335)
* fix: suppress stdout when saving report and sample records to file Console(record=True) still prints to stdout by default. Use file=io.StringIO() to redirect output so save-path calls only write to disk. * refactor: --save-results skips terminal display When --save-results is used, records and the analysis report are no longer printed to the terminal. Extracted save logic into a dedicated _save_preview_results method and updated option help text accordingly. * feat: wrap-around navigation in sample records browser Prev/next buttons and arrow keys now cycle back to the beginning/end instead of clamping at boundaries. * test: reuse record_series fixture in visualization tests * feat: thread --theme through to sample records pager The pager shell was hardcoded dark, so --theme light produced light records inside a dark frame. Extract CSS variables into dark/light constants and pass the theme from the controller. * fix: cap terminal display width at display_width The module-level Console() had no width limit, so tables with expand=True stretched to the full terminal width. Cap terminal output at min(terminal_width, display_width) and thread the display_width parameter through the controller's display methods. * docs: update --display-width and --theme help text Remove "Only applies when --save-results is used" from --display-width since it now also affects terminal output. * fix: update generation controller tests to match display_width and save_results behavior |
||
|
|
1439bbea7e
|
chore: Improve CLI startup with lazy heavy import cleanup (#330)
* perf: defer heavy imports to improve CLI startup time Move expensive imports (engine, models, controllers) out of the module-level import path so that data-designer --help and other non-generation commands no longer pay the full startup cost. Key changes: - Defer controller imports to inside command functions - Remove eager re-export chains from CLI package __init__ files - Move default-settings bootstrap into load_config_builder() and DataDesigner.__init__() instead of running at import time - Add lazy __getattr__ exports in interface/__init__.py - Replace module-level tokenizer init with cached lazy getter - Fix ModelProvider import to use config layer instead of engine - Update test mock paths to match new import locations Reduces CLI import-time from ~1.67s to ~0.46s. * perf: defer pandas/numpy in io_helpers and add config_list benchmark - Replace eager `from lazy_heavy_imports import pd, np` in io_helpers with module-level __getattr__ (for backwards-compatible external access / test mocks) and function-level imports in the 3 functions that actually use them (read_parquet_dataset, smart_load_dataframe, _convert_to_serializable). Importing io_helpers no longer triggers pandas/numpy loading. - Defer heavy imports in list and reset CLI commands into function bodies to avoid loading repositories, Rich, and prompt_toolkit at module import time. - Add `config_list` (data-designer config list) measurement to the CLI startup benchmark with isolated cold measurement in a separate venv and a --skip-config-list-check flag. - Update test mock paths to match new import locations. * Refine lazy import usage and TYPE_CHECKING cleanup * Run license header updater on PR-touched files * fix: update sqlfluff mock target for lazy imports in test_sql * perf: cache globals() in lazy __getattr__ to avoid repeated lookups Add globals() caching and explanatory comment to all three lazy __getattr__ implementations (lazy_heavy_imports, config/__init__, interface/__init__) so subsequent attribute accesses bypass __getattr__. * perf: lazy CLI command loading and deferred heavy import evaluations - Add LazyTyperGroup to defer command module loading until invocation, allowing module-level imports in all CLI command files - Split DataFrameSeedSource into seed_source_dataframe.py to isolate pandas dependency from other seed source classes - Move TypeVar/TypeAlias definitions (DataT, NumpyArray1dT, RadomStateT, EngineT) to TYPE_CHECKING blocks with runtime fallbacks - Wrap module-level constants in lru_cache (phone_number parquet data, jsonschema validator) to defer I/O and heavy imports to first use - Update test mock targets to patch at usage-site for module-level imports * refactor: use direct pandas import in seed_source_dataframe Drop lazy-loading for pandas in DataFrameSeedSource; use direct import for simplicity. * update lazy import pattern * update tests to use lazy import namespace Switch test modules to import data_designer.lazy_heavy_imports as lazy and reference heavy libraries through that namespace. This keeps heavy imports deferred during module import and aligns tests with the new lazy-import usage pattern. * tighten import perf test thresholds Document recent baseline timings and lower the allowed average import time and timeout so regressions are detected sooner. * document pandas import requirement Clarify that Pydantic needs DataFrame resolved at module load and that keeping the direct import preserves IDE typing support. * increase timeout time * use lazy pandas imports in visualization tests - replace direct pandas usage with lazy.pd in visualization tests to avoid eager imports - add TYPE_CHECKING pandas import and keep CLI controller imports sorted * fix lazy pandas runtime usage and preview mocks Switch sample-record handling to lazy pandas types so runtime paths no longer depend on TYPE_CHECKING imports. Align preview controller tests to patch the module-local DataDesigner symbol, preventing real engine invocation in save results scenarios. |
||
|
|
f2a1657870
|
feat: add --save-results option to preview command (#333)
* feat: add --save-report option to preview command * feat: add save_path option to display_sample_record Allow saving rendered sample records as HTML or SVG files via an optional save_path parameter on both the standalone function and the WithRecordSamplerMixin method. * feat: replace --save-report with --save-results on preview command Replace the single-file --save-report option with --save-results, which saves all preview artifacts (dataset parquet, analysis report HTML, and per-record sample HTMLs) into a timestamped directory under the artifact path. Add error handling around the save block, improve timestamp precision to microseconds, and expand test coverage for the new behavior. * feat: add sample records pager with theme toggle, postMessage bridge, and UI polish * feat: add dataset metadata subtitle to pager and clean up toolbar layout * fix: address review findings for preview save-results feature - Split try/except in generation_controller so report display errors don't produce misleading "failed to save" messages when not saving - Add browser HTML path to save success output for discoverability - Remove 5 unused CSS variables from pager theme constants - Add "N of M" record counter to pager toolbar - Add theme/display_width assertions to all preview_command tests - Add dedicated test for custom theme and display_width passthrough - Add tests for record counter and CSS variable cleanup * fix: address code review findings and simplify pager - Fix critical bug: analysis report now displays to console even when --save-results is active (was silently dropped via pass statement) - Fix latent UnboundLocalError in display_sample_record when index is out of bounds (num_records computed before try block) - Eliminate duplicated dark CSS between constant and theme listener script - Simplify sample_records_pager: remove dual-theme system, postMessage bridge, and responsive media queries; restore GitHub link; reorder toolbar to put prev/next buttons on the far left - Narrow except Exception to except OSError in save-results path - Use case-insensitive extension check and lambda-based re.sub - Collapse redundant preview command delegation tests into parametrize - Add missing type annotations and remove tautological assertions * style: move record counter to far right of pager toolbar * refactor: remove dead theme-listener script and inline CSS constant _THEME_LISTENER_SCRIPT and _SAMPLE_RECORD_DARK_CSS_INLINE became orphaned after the pager simplification removed the postMessage bridge. This removes both constants, drops the injection line, switches the idempotency guard to the viewport meta tag, and cleans up related test assertions. * fix: move Path import out of TYPE_CHECKING block in test_visualization * fix: rename _logger to logger to match codebase convention * fix: remove unnecessary cast in preview command theme parameter * refactor: extract DEFAULT_DISPLAY_WIDTH constant and make apply_html_post_processing public * Update packages/data-designer-config/tests/config/utils/test_visualization.py --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> |
||
|
|
1514720596
|
feat: support loading config files from HTTP(S) URLs (#323)
* support loading config files from http urls - allow config builder and CLI loader to load YAML/JSON configs from HTTP(S) URLs - reject unsupported URL extensions and remote Python module URLs - update CLI help text and add tests for URL success/failure paths * harden remote config loading and deduplicate URL validation - Add size limit (10 MB) when fetching configs from URLs - Validate parsed YAML is a dict before returning - Make is_http_url public and reuse it in CLI validate_url - Replace local CONFIG_FILE_EXTENSIONS with shared constant - Add tests for is_http_url, URL-with-no-extension edge cases * use requests for remote config loading - replace urllib URL fetching with requests and status checks - parse remote payloads via smart_load_yaml for consistent validation - expand tests for HTTP errors, size limits, and non-dict payloads * lower remote config size limit to 1 MB * improve config URL HTTP error reporting Add granular 401/403/404 and generic HTTP status errors for remote config fetching to make failures actionable. Clarify that authenticated config URL loading is not currently supported and update tests for status-aware behavior. * rewrite github blob URLs for remote loading Handle GitHub blob links by rewriting them to raw content URLs for config and dataframe HTTP loaders, preserving query params but avoiding query token leaks in logs. This also fixes extension detection for URLs with query strings and adds coverage for rewrite behavior. * remove validate_url wrapper in favor of is_http_url The validate_url function in cli/utils was just a thin wrapper around is_http_url from io_helpers. Remove it and have callers use is_http_url directly for clarity and reduced indirection. * fix optional type for artifact_path CLI option * fix URL recursion in smart_load_yaml - avoid treating remote payload strings as new URL inputs - add regression test for URL string payloads from remote config * rewrite huggingface blob URLs for remote loading |
||
|
|
d3c4de76da
|
feat: add preview, create, and validate CLI commands (#313)
* feat: add preview, create, and validate CLI commands Add three new top-level CLI commands for the data-designer workflow: - `data-designer preview` - generate preview datasets for fast iteration - `data-designer create` - create full datasets and save to disk - `data-designer validate` - validate configuration files Also includes: - Move wait_for_navigation_key() UI primitive from preview.py to ui.py - Add KeyPressEvent type annotations to all key binding handlers in ui.py - Refactor cli/utils.py into cli/utils/ package with config_loader module - Comprehensive test coverage for all new commands * fix: update pythonjsonlogger import and clean up dev dependencies - Update pythonjsonlogger import to use newer JsonFormatter API - Consolidate dev-dependencies into [dependency-groups] dev section - Remove unnecessary test cli/utils __init__.py * small E * address greptile feedback * organize CLI commands into rich help panels Group top-level commands under "Generation" and "Setup" panels for clearer help output. * refactor config loader to parse files directly and auto-detect config format - Parse YAML/JSON files into dicts before passing to from_config, providing format-specific error messages for parse failures - Auto-detect DataDesignerConfig format (columns at top level) and wrap it into BuilderConfig so users can provide either format - Clean up Python module loading with try/except/finally for reliable sys.modules and sys.path cleanup - Add comprehensive tests for parsing, validation, and auto-wrapping * fix sys.path cleanup in config loader and simplify tests - Use pop(0) instead of remove() to precisely undo the insert(0, ...) and avoid accidentally removing a different matching path entry - Replace MagicMock with real DataDesignerConfigBuilder in tests * move config format auto-detection into from_config Centralize the shorthand DataDesignerConfig detection (columns at top level without a data_designer wrapper) in DataDesignerConfigBuilder.from_config so all callers benefit, not just the CLI config loader. Simplify config_loader to delegate file parsing and format normalization entirely to from_config. * extract GenerationController from CLI commands Move shared generation logic (preview, validate, create) out of the individual Typer command functions into a dedicated GenerationController, matching the existing controller pattern (DownloadController, etc.). The command functions now delegate to the controller, keeping them as thin entry points. Tests updated accordingly — command tests verify delegation while controller tests cover the full behavior. * harden sys.path cleanup and add explanatory comments Use sys.path.remove() instead of checking sys.path[0] so cleanup succeeds even when exec_module inserts entries at index 0. Drop unnecessary spec=DataDesignerConfigBuilder from test mocks. * check stdout TTY in preview interactive mode detection Previously only stdin was checked, so piping stdout (e.g. `dd preview cfg.yaml | head`) would still attempt interactive browsing. Now both stdin and stdout must be a TTY. |
||
|
|
e6e58e692e
|
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) | ||
|
|
c19f35639f
|
chore: add publish script and update license headers (#253) | ||
|
|
ae0665fa16
|
refactor: slim package refactor into three subpackages (#240)
* remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main |