* chore: add __init__.py to engine namespace subpackages
Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.
* docs: flesh out docstrings on plugin extension-point classes
Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:
- Plugin and PluginType: class docstrings + Attributes tables for
fields and enum members; fix typo in config_qualified_name field
description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
class docstrings explaining when to use each base, plus method
docstrings on the abstract generate() implementations.
* docs: graduate plugins out of experimental mode
Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.
- Add code_reference/plugins.md: single-stop reference for the Plugin
object and the config + implementation base classes used by all
three plugin types.
- Add code_reference/generators.md: column generator implementation
base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
internal-helpers note (PluginRegistry / PluginManager), and focus
the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
concepts/deployment-options.md.
* docs: simplify plugin docs structure
Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.
- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
(their content is now in build_your_own.md and the per-type code
references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
the relevant code reference, drop the multi-step "How do you
create plugins" walkthrough in favor of a single Build a Plugin
pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
DataDesignerPlugins catalog and explain how to request a community
listing.
- Update cross-page links in concepts/processors.md,
concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
code_reference/plugins.md, and code_reference/generators.md to
match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
add seed_readers code reference.
* docs: scroll wide tables horizontally instead of wrapping
Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.
* docs: address PR #603 review feedback
- Add an Implementation base section to code_reference/processors.md
rendering the engine-side Processor class. This justifies the
engine/processing/__init__.py files added earlier and gives
processor plugin authors an auto-rendered API reference, matching
the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
pattern in the regex-filter processor in favor of the idiomatic
`Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
install instructions — PluginRegistry caches discovery on first
import, so notebooks need a fresh kernel to pick up freshly
installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
reference alongside generators and seed_readers.
* docs: split code reference by package
* docs: add interface code reference
* docs: add code reference overviews
* docs: refine code reference pages
* docs: improve code reference tables
* docs: correct reference docstrings
* docs: embed plugin catalog table
* docs: note plugin discovery restart caveat
* docs: explain generator base class choice
* docs: mention async cell generator examples
* docs: clarify plugin model usage
* docs: clarify plugin model aliases
* docs: address plugin review feedback
* docs: update available plugins page
Add comprehensive documentation for DirectorySeedSource, FileContentsSeedSource,
and AgentRolloutSeedSource to the seed datasets concept page. Add FileSystemSeedReader
plugin authoring guide and Markdown section seed reader recipe. Supersedes #425 and #452.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add processor plugin support
Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).
- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation
* test: add processor plugin registration test
Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.
* test: simplify processor plugin registration test
* move ProcessorConfig to base and convert demo to e2e test
- Move ProcessorConfig from processors.py to config.base to guard
against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/
* move plan to plans/299/
* feat: add allow_resize for 1:N and N:1 generation patterns
Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.
Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests
* docs: add allow_resize to custom columns documentation
* refactor: consolidate buffer API and elevate allow_resize to base config
- Merge update_records and replace_buffer into a single replace_buffer
method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns
* test: add chained resize and multi-batch integration tests
- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)
* extend allow_resize to cell-by-cell (return dict or list[dict])
- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
_RESIZE_SPECS with inline factory calls; ids ordered like specs
* reorder allow_resize specs and add edge-case tests
- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
_resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)
* tidy allow_resize: drop validator, shared stub, explicit flag
- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call
* fix: add missing f prefix to error message in custom.py
* docs(plugins): add section on setting allow_resize=True for resize plugins
* fix: address PR review comments on allow_resize
- Replace getattr with direct attribute access where config is always
SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"
* fix: remove duplicate tool_alias log, fix test docstring
- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory
* fix: avoid duplicate undeclared-column warning in _validate_output
Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.
* fix: use lazy.pd instead of pd for runtime pandas usage in tests
The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
* docs: restructure plugin docs with multi-file layout and seed reader type
- Update plugin overview to document both column generator and seed
reader plugin types
- Restructure example plugin to use separate config.py, impl.py, and
plugin.py files instead of a single-file approach
- Add sections for plugin validation and multiple plugins per package
- Document required config class methods (get_column_emoji,
required_columns, side_effect_columns)
* docs: clarify benefits of multi-file plugin structure
Expand explanation to mention circular dependency prevention
as a key reason for separating config, impl, and plugin modules.
* docs: fix import ordering in plugin example
* import spacing
* better example column name
* add a bit to the comment
* Updated plugin docs
* update plugin overview call-to-action wording
---------
Co-authored-by: Kirit93 <kthadaka@nvidia.com>
* create top-level base file
* add note
* update license header
* move exportable config and move base to config module
* update references in docs
* do not include single column config in init
* add inverse import order e2e test
* removing required resources
* fix tests
* add get required resources method to base column generator
* move classification functions to engine; remove required resources
* drop single from subclass names
* update model config logging
* fix unit test
* typo
* update type hint
* move tests