DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Andre Manoel	b6de38d894	docs: remove docs code reference (#674 ) Some checks failed CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details CI / Lint and Format Check (push) Blocked by required conditions Details CI / Check License Headers (push) Blocked by required conditions Details CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions Details CI / Coverage Check (Python 3.11) (push) Blocked by required conditions Details CI / End to end test (Python 3.10 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.11 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.12 on macos-latest) (push) Blocked by required conditions Details CI / End to end test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions Details CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions Details CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions Details Publish Fern devnotes / deploy (push) Has been cancelled Details	2026-05-21 18:29:18 -04:00
Johnny Greco	d14c9b3ccc	feat(cli): add plugin catalog core (#618 ) * feat(cli): add plugin catalog services Add typed catalog and tap models, persistent tap storage, cached catalog loading, compatibility evaluation, install plan generation, and runtime plugin discovery helpers. Refs #617 * feat(cli): add plugins command group Wire list, search, info, install, installed, and tap management commands through the existing command-controller CLI pattern. Refs #617 * test(cli): cover plugin catalog workflows Add regression coverage for tap caching, catalog compatibility, installer command generation, local path resolution, and Typer command delegation. Refs #617 * fix(cli): align plugin taps with schema v2 Validate tap catalogs against the schema v2 contract used by NVIDIA-NeMo/DataDesignerPlugins#36, including source union fields, docs URLs, package paths, compatibility metadata, and unique runtime plugin names. Derive Git install targets as package-qualified PEP 508 direct references so git tap entries install the package described by the catalog source metadata. Refs #617 * fix(cli): address plugin review feedback - Invalidate import caches before post-install entry point verification - Make tap aliases case-insensitive and cache catalogs by alias plus URL - Prefer compatible catalog entries before falling back to forced installs - Clarify unused --tap behavior and list installed entry points without imports - Add direct controller coverage and update CLI plugin documentation Refs #617 * fix(cli): gate incompatible plugin installs Fetch install targets before compatibility filtering so the controller owns the final --force decision and the incompatible install guard stays reachable. Refs #617 * style(cli): format plugin catalog files Apply ruff formatting to the plugin command and tap repository tests so CI format checks pass on the PR merge commit. Refs #617 * fix(cli): reject duplicate plugin entry names Key catalog duplicate detection by entry_point.name so distinct catalog entries cannot register the same runtime plugin name. Refs #617 * fix(cli): preserve GitHub tree tap paths * fix(cli): verify plugin entry point names * align plugin CLI with catalog schema - adopt catalog terminology for plugin source aliases - parse package-first plugin catalog metadata from the plugin repo - install package requirements with optional catalog indexes * tidy plugin catalog workflow docs * align plugin catalog CLI with package contract * add plugin package uninstall workflow * test plugin package command targets * document plugin package aliases * address plugin catalog review feedback * prefer runtime plugin lookup matches * rename plugins command to plugin * show plugin package descriptions * rename plugin catalogs command * add protected plugin package installs * document plugin package install modes * avoid building project during plugin installs * harden plugin package installs * tighten plugin catalog contracts * fix no-args help exit code * make plugin docs links robust * document plugin CLI catalog workflows * clarify plugin entry point verification * simplify plugin CLI docs * narrow plugin search fields * hide plugin catalog cache ttl * remove plugin catalog trust flag * improve plugin CLI recovery UX * polish plugin catalog table display * stabilize plugin catalog table test * tighten plugin catalog edge cases * harden plugin catalog verification - Escape catalog-provided Rich markup before rendering CLI output - Reject runtime plugin names that collide after enum-key normalization - Load installed runtime entry points in a subprocess before reporting success * simplify plugin entry point verification Load matching entry points directly after install instead of spawning a separate Python process. This keeps the check package-scoped while still catching broken entry-point targets and non-Plugin objects. * require newer uv for plugin plans Use uv >= 0.10.0 as the single supported uv requirement for plugin package commands. Auto mode now falls back to a pip plan with an upgrade warning when uv is unavailable or too old, while explicit uv selection remains strict. * verify pip fallback availability * polish plugin CLI status markers * clarify plugin compatibility labels * simplify plugin info install details * address plugin CLI review nits * support versioned plugin package installs * share plugin install metadata rendering * show installed plugin packages * harden versioned plugin installs - Preserve catalog requirement constraints for versioned installs - Remove stale install-plan metadata fields - Expand parser, uv, controller, and local-catalog dry-run coverage * harden plugin help tests * show plugin package versions Add package version metadata support for plugin catalogs and resolve current versions from exact requirements or simple indexes when catalog entries omit them. Update plugin list/info/install metadata to show the plugin package version and Data Designer compatibility requirement while removing the separate Data Designer version line. * format plugin catalog tests * harden plugin package metadata checks * harden plugin CLI test coverage * add plugin discovery docs (#642) Signed-off-by: Johnny Greco <jogreco@nvidia.com> --------- Signed-off-by: Johnny Greco <jogreco@nvidia.com>	2026-05-13 12:26:58 -04:00
Nabin Mulepati	4b93f5b245	feat: let column configs declare all model aliases for the startup health check (#626 ) * feat(engine): let column configs declare all model aliases for the startup health check Plugin column configs that depend on more than one model alias (generator + judge, critic, etc.) previously could not opt their secondary aliases into the standard startup health check, and configs without a `model_alias` field crashed the collection loop with AttributeError. Add `SingleColumnConfig.get_model_aliases()` as the single override hook the builder uses to enumerate aliases. The default returns the column's primary `model_alias` (if any), so built-in LLM, embedding, and image columns work unchanged. `CustomColumnConfig` overrides it to surface decorator-declared aliases, replacing the special-case `isinstance` branch in the builder. Plugin configs with multiple model fields override it to opt every endpoint into the health check. Fixes #606 Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> * fix(config): forward empty model_alias to startup health check SingleColumnConfig.get_model_aliases() used `if alias` to filter, which also dropped empty-string aliases. Empty model_alias values are accepted by the config model and previously reached run_health_check, where they failed fast with "No model config with alias '' found!". Treating them as "no model endpoints" silently delayed that error to first generation. Use `alias is not None` so only a truly missing attribute skips the health check, and add a regression test that exercises an empty-string model_alias on a built-in config. Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com> --------- Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>	2026-05-11 11:33:50 -06:00
Johnny Greco	8b8d748446	docs: graduate plugins out of experimental mode (#603 ) * chore: add __init__.py to engine namespace subpackages Griffe (used by mkdocstrings) skips directories without __init__.py when resolving module paths, which prevented the new plugins code reference from rendering SeedReader, FileSystemSeedReader, and Processor. Adding empty __init__.py files in engine/resources/, engine/processing/, and engine/processing/processors/ aligns with the convention already used in engine/mcp/, engine/models/, etc. * docs: flesh out docstrings on plugin extension-point classes Plugin authors now see meaningful descriptions for every field and method on the bases rendered in the plugins code reference: - Plugin and PluginType: class docstrings + Attributes tables for fields and enum members; fix typo in config_qualified_name field description. - SingleColumnConfig: document allow_resize. - ProcessorConfig: document processor_type discriminator. - SeedSource: document seed_type discriminator. - FileSystemSeedSource: add class docstring + Attributes table for path / file_pattern / recursive. - ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add class docstrings explaining when to use each base, plus method docstrings on the abstract generate() implementations. * docs: graduate plugins out of experimental mode Restructures plugin documentation around the now-stable extension points (column generator, seed reader, processor) and treats plugins as a first-class story for customizing Data Designer. - Add code_reference/plugins.md: single-stop reference for the Plugin object and the config + implementation base classes used by all three plugin types. - Add code_reference/generators.md: column generator implementation base classes, separated from column configs. - Surface SingleColumnConfig in code_reference/column_configs.md. - Add plugins/implement.md ("Build Your Own"): per-type implementation instructions across column generators, seed readers, and processors. - Add plugins/processor.md: complete processor plugin package example. - Rewrite plugins/overview.md: open with why plugins exist, drop the internal-helpers note (PluginRegistry / PluginManager), and focus the guide on what plugin builders need. - Refresh plugins/available.md (Catalog) and plugins/filesystem_seed_reader.md to match the new structure. - Delete plugins/example.md (replaced by per-type guides). - Reorder Code Reference nav alphabetically and add the new pages. - Minor link / wording fixes in concepts/processors.md and concepts/deployment-options.md. * docs: simplify plugin docs structure Replace the overview's how-to walkthrough and the per-type plugin guides with a single Build Your Own page that covers all three plugin types side-by-side. Add a dedicated Using Models in Plugins guide and a seed_readers code reference, and trim the overview down to what the plugin types are, how to use one, and how discovery works. - Rename plugins/implement.md to plugins/build_your_own.md. - Delete plugins/filesystem_seed_reader.md and plugins/processor.md (their content is now in build_your_own.md and the per-type code references). - Add plugins/models.md for model-backed column generator authoring. - Add code_reference/seed_readers.md for seed reader implementation base classes. - Rewrite plugins/overview.md: shorter intro, type bullets link to the relevant code reference, drop the multi-step "How do you create plugins" walkthrough in favor of a single Build a Plugin pointer, tighten Discovery troubleshooting. - Refresh plugins/available.md (Available Plugins): point to the DataDesignerPlugins catalog and explain how to request a community listing. - Update cross-page links in concepts/processors.md, concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md, code_reference/plugins.md, and code_reference/generators.md to match the new structure. - Update mkdocs.yml nav: rename to Build Your Own, add Using Models, add seed_readers code reference. * docs: scroll wide tables horizontally instead of wrapping Code-heavy reference tables (plugin bases, column generators, etc.) were wrapping aggressively on narrow viewports, breaking long identifiers across multiple lines. Switch the table container to horizontal overflow and prevent code cells from wrapping so identifiers stay readable. * docs: address PR #603 review feedback - Add an Implementation base section to code_reference/processors.md rendering the engine-side Processor class. This justifies the engine/processing/__init__.py files added earlier and gives processor plugin authors an auto-rendered API reference, matching the pattern used by code_reference/generators.md and seed_readers.md. - build_your_own.md: replace the placeholder "x" emoji on the IndexMultiplier example with the actual multiplication sign. - build_your_own.md: drop the manual `re.compile + apply(lambda)` pattern in the regex-filter processor in favor of the idiomatic `Series.str.contains(..., regex=True)`. - build_your_own.md: add a kernel-restart caveat after the editable install instructions — PluginRegistry caches discovery on first import, so notebooks need a fresh kernel to pick up freshly installed plugins. - build_your_own.md: state explicitly what `assert_valid_plugin` checks (config base + plugin-type-appropriate impl base). - code_reference/plugins.md: link out to the processors code reference alongside generators and seed_readers. * docs: split code reference by package * docs: add interface code reference * docs: add code reference overviews * docs: refine code reference pages * docs: improve code reference tables * docs: correct reference docstrings * docs: embed plugin catalog table * docs: note plugin discovery restart caveat * docs: explain generator base class choice * docs: mention async cell generator examples * docs: clarify plugin model usage * docs: clarify plugin model aliases * docs: address plugin review feedback * docs: update available plugins page	2026-05-06 18:12:44 -04:00
Eric W. Tramel	116184b5e6	docs: consolidated seed reader documentation for filesystem and agent rollout sources (#481 ) Add comprehensive documentation for DirectorySeedSource, FileContentsSeedSource, and AgentRolloutSeedSource to the seed datasets concept page. Add FileSystemSeedReader plugin authoring guide and Markdown section seed reader recipe. Supersedes #425 and #452. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 13:31:42 -04:00
Andre Manoel	982ce79ca9	feat: add processor plugin support (#299 ) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/	2026-02-25 16:40:01 -03:00
Andre Manoel	70dc48884e	feat: add allow_resize for 1:N and N:1 generation patterns (#286 ) * feat: add allow_resize for 1:N and N:1 generation patterns Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests * docs: add allow_resize to custom columns documentation * refactor: consolidate buffer API and elevate allow_resize to base config - Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns * test: add chained resize and multi-batch integration tests - Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract) * extend allow_resize to cell-by-cell (return dict or list[dict]) - Config: allow allow_resize with CELL_BY_CELL; relax validator - Custom generator: accept dict \| list[dict] when cell_by_cell + allow_resize; validate per row via _validate_cell_output - Builder: collect results by index when cell allow_resize, flatten and replace_buffer; add _log_resize_if_changed and _column_display_name - Docs: ALL_CAPS for strategies, simplify allow_resize table text - Tests: parametrized preview and multibatch; factories with n param; _RESIZE_SPECS with inline factory calls; ids ordered like specs * reorder allow_resize specs and add edge-case tests - Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd, cell_drop_all to _RESIZE_SPECS - Stubs before specs: _resize_full_keep_first, _resize_cell_expand, _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories - Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS - Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3 multibatch cases (5_2, 4_2) first - Handle all-batches-skipped in multibatch test (empty df when path missing) - test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list) * tidy allow_resize: drop validator, shared stub, explicit flag - Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call * fix: add missing f prefix to error message in custom.py * docs(plugins): add section on setting allow_resize=True for resize plugins * fix: address PR review comments on allow_resize - Replace getattr with direct attribute access where config is always SingleColumnConfig (custom.py, cell-by-cell path in builder) - Keep getattr in _run_full_column_generator which also handles multi-column configs without allow_resize - Restructure allow_resize validation branching in CustomColumnGenerator - Fix error message wording: "key" -> "column" * fix: remove duplicate tool_alias log, fix test docstring - Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory * fix: avoid duplicate undeclared-column warning in _validate_output Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time. * fix: use lazy.pd instead of pd for runtime pandas usage in tests The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.	2026-02-18 18:39:31 -03:00
Johnny Greco	11143c788f	docs: restructure plugin docs with multi-file layout and seed reader type (#302 ) * docs: restructure plugin docs with multi-file layout and seed reader type - Update plugin overview to document both column generator and seed reader plugin types - Restructure example plugin to use separate config.py, impl.py, and plugin.py files instead of a single-file approach - Add sections for plugin validation and multiple plugins per package - Document required config class methods (get_column_emoji, required_columns, side_effect_columns) * docs: clarify benefits of multi-file plugin structure Expand explanation to mention circular dependency prevention as a key reason for separating config, impl, and plugin modules. * docs: fix import ordering in plugin example * import spacing * better example column name * add a bit to the comment * Updated plugin docs * update plugin overview call-to-action wording --------- Co-authored-by: Kirit93 <kthadaka@nvidia.com>	2026-02-09 16:03:56 -05:00
Johnny Greco	87119a545b	refactor: move SingleColumnConfig to config.base module (#287 ) * create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test	2026-02-03 14:04:04 -05:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00
Johnny Greco	3d9f5185d7	refactor: remove task metadata property (#216 ) * remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring	2026-01-15 14:12:11 -05:00
Johnny Greco	69cd989285	refactor: update required resources treatment and use subclasses over mixins (#184 ) * removing required resources * fix tests * add get required resources method to base column generator * move classification functions to engine; remove required resources * drop single from subclass names * update model config logging * fix unit test * typo * update type hint * move tests	2026-01-09 14:42:09 -05:00
Mike Knepper	8e69ab0336	refactor: Plugins rename task to impl (#189 )	2026-01-08 13:34:05 -06:00
Mike Knepper	36a174af04	refactor: plugin system updates (#168 )	2026-01-06 10:29:47 -06:00
Johnny Greco	48fdc8c838	docs: add initial plugin documentation (#107 ) * add docstrings * add analysis modules * include toc for plugins section * add plugin docs * remove scope creep * Update docs/plugins/example.md Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com> * address feedback --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>	2025-12-11 16:05:11 -05:00

15 commits