2025-12-11 21:05:11 +00:00
!!! warning "Experimental Feature"
The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub ](https://github.com/NVIDIA-NeMo/DataDesigner/discussions ).
2026-02-09 21:03:56 +00:00
# Example Plugin: Column Generator
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
Data Designer supports two plugin types: **column generators** and **seed readers** . This page walks through a complete column generator example.
2025-12-11 21:05:11 +00:00
A Data Designer plugin is implemented as a Python package with three main components:
1. **Configuration Class** : Defines the parameters users can configure
2026-02-09 21:03:56 +00:00
2. **Implementation Class** : Contains the core logic of the plugin
3. **Plugin Object** : Connects the config and implementation classes to make the plugin discoverable
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
We recommend separating these into individual files (`config.py`, `impl.py` , `plugin.py` ) within a plugin subdirectory. This keeps the code organized, makes it easy to test each component independently, and guards against circular dependencies — since the config module can be imported without pulling in the engine-level implementation classes, and the plugin object can be discovered without importing either.
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
---
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
## Column Generator Plugin: Index Multiplier
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
In this section, we will build a simple column generator plugin that generates values by multiplying the row index by a user-specified multiplier.
### Step 1: Create a Python package
We recommend the following structure for column generator plugins:
2025-12-11 21:05:11 +00:00
```
data-designer-index-multiplier/
├── pyproject.toml
└── src/
└── data_designer_index_multiplier/
├── __init__ .py
2026-02-09 21:03:56 +00:00
├── config.py
├── impl.py
2025-12-11 21:05:11 +00:00
└── plugin.py
```
2026-02-09 21:03:56 +00:00
### Step 2: Create the config class
2025-12-11 21:05:11 +00:00
The configuration class defines what parameters users can set when using your plugin. For column generator plugins, it must inherit from [SingleColumnConfig ](../code_reference/column_configs.md#data_designer.config.column_configs.SingleColumnConfig ) and include a [discriminator field ](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions ).
2026-02-09 21:03:56 +00:00
Create `src/data_designer_index_multiplier/config.py` :
2025-12-11 21:05:11 +00:00
```python
from typing import Literal
2026-02-09 21:03:56 +00:00
2026-02-03 19:04:04 +00:00
from data_designer.config.base import SingleColumnConfig
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
2025-12-11 21:05:11 +00:00
class IndexMultiplierColumnConfig(SingleColumnConfig):
"""Configuration for the index multiplier column generator."""
# Required: discriminator field with a unique Literal type
# This value identifies your plugin and becomes its column_type
column_type: Literal["index-multiplier"] = "index-multiplier"
2026-02-09 21:03:56 +00:00
# Configurable parameter for this plugin
multiplier: int = 2
@staticmethod
def get_column_emoji() -> str:
return "✖️"
@property
def required_columns(self) -> list[str]:
"""Columns that must exist before this generator runs."""
return []
@property
def side_effect_columns(self) -> list[str]:
"""Additional columns produced beyond the primary column."""
return []
2025-12-11 21:05:11 +00:00
```
**Key points:**
- The `column_type` field must be a `Literal` type with a string default
- This value uniquely identifies your plugin (use kebab-case)
- Add any custom parameters your plugin needs (here: `multiplier` )
- `SingleColumnConfig` is a Pydantic model, so you can leverage all of Pydantic's validation features
2026-02-09 21:03:56 +00:00
- `get_column_emoji()` returns the emoji displayed in logs for this column type
- `required_columns` lists any columns this generator depends on (empty if none)
- `side_effect_columns` lists any additional columns this generator produces beyond the primary column (empty if none)
2025-12-11 21:05:11 +00:00
feat: add allow_resize for 1:N and N:1 generation patterns (#286)
* feat: add allow_resize for 1:N and N:1 generation patterns
Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.
Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests
* docs: add allow_resize to custom columns documentation
* refactor: consolidate buffer API and elevate allow_resize to base config
- Merge update_records and replace_buffer into a single replace_buffer
method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns
* test: add chained resize and multi-batch integration tests
- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)
* extend allow_resize to cell-by-cell (return dict or list[dict])
- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
_RESIZE_SPECS with inline factory calls; ids ordered like specs
* reorder allow_resize specs and add edge-case tests
- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
_resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)
* tidy allow_resize: drop validator, shared stub, explicit flag
- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call
* fix: add missing f prefix to error message in custom.py
* docs(plugins): add section on setting allow_resize=True for resize plugins
* fix: address PR review comments on allow_resize
- Replace getattr with direct attribute access where config is always
SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"
* fix: remove duplicate tool_alias log, fix test docstring
- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory
* fix: avoid duplicate undeclared-column warning in _validate_output
Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.
* fix: use lazy.pd instead of pd for runtime pandas usage in tests
The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
2026-02-18 21:39:31 +00:00
**If your plugin can expand or retract the number of rows (1:N or N:1):** set `allow_resize=True` in the config class so the pipeline updates batch bookkeeping correctly. For example:
```python
class MyColumnConfig(SingleColumnConfig):
column_type: Literal["my-plugin"] = "my-plugin"
allow_resize: bool = True # required when output row count can differ from input
# ...
```
The default is `False` ; only set it to `True` when your `generate` method can return more or fewer rows than it receives.
2026-02-09 21:03:56 +00:00
### Step 3: Create the implementation class
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
The implementation class defines the actual business logic of the plugin. For column generator plugins, inherit from `ColumnGeneratorFullColumn` or `ColumnGeneratorCellByCell` and implement the `generate` method.
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
Create `src/data_designer_index_multiplier/impl.py` :
2025-12-11 21:05:11 +00:00
```python
import logging
2026-02-09 21:03:56 +00:00
2025-12-11 21:05:11 +00:00
import pandas as pd
2026-02-09 21:03:56 +00:00
from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
from data_designer_index_multiplier.config import IndexMultiplierColumnConfig
2025-12-11 21:05:11 +00:00
logger = logging.getLogger(__name__)
2026-02-09 21:03:56 +00:00
2026-01-15 19:12:11 +00:00
class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]):
2025-12-11 21:05:11 +00:00
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
"""Generate the column data.
Args:
data: The current DataFrame being built
Returns:
The DataFrame with the new column added
"""
logger.info(
f"Generating column {self.config.name} "
f"with multiplier {self.config.multiplier}"
)
data[self.config.name] = data.index * self.config.multiplier
return data
```
**Key points:**
2026-02-09 21:03:56 +00:00
- Generic type `ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]` connects the implementation to its config
2025-12-11 21:05:11 +00:00
- You have access to the configuration parameters via `self.config`
!!! info "Understanding generation_strategy"
2026-02-09 21:03:56 +00:00
The `generation_strategy` specifies how the column generator will generate data. You choose a strategy by inheriting from the corresponding base class:
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
- **`ColumnGeneratorFullColumn`**: Generates the full column (at the batch level) in a single call to `generate`
2026-01-15 19:12:11 +00:00
- `generate` must take as input a `pd.DataFrame` with all previous columns and return a `pd.DataFrame` with the generated column appended.
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
- **`ColumnGeneratorCellByCell`**: Generates one cell at a time
2025-12-11 21:05:11 +00:00
- `generate` must take as input a `dict` with key/value pairs for all previous columns and return a `dict` with an additional key/value for the generated cell
- Supports concurrent workers via a `max_parallel_requests` parameter on the configuration
2026-02-09 21:03:56 +00:00
### Step 4: Create the plugin object
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
Create a `Plugin` object that makes the plugin discoverable and connects the implementation and config classes.
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
Create `src/data_designer_index_multiplier/plugin.py` :
2025-12-11 21:05:11 +00:00
```python
from data_designer.plugins import Plugin, PluginType
plugin = Plugin(
2026-02-09 21:03:56 +00:00
config_qualified_name="data_designer_index_multiplier.config.IndexMultiplierColumnConfig",
impl_qualified_name="data_designer_index_multiplier.impl.IndexMultiplierColumnGenerator",
2025-12-11 21:05:11 +00:00
plugin_type=PluginType.COLUMN_GENERATOR,
)
```
2026-02-09 21:03:56 +00:00
### Step 5: Package your plugin
2025-12-11 21:05:11 +00:00
Create a `pyproject.toml` file to define your package and register the entry point:
```toml
[project]
name = "data-designer-index-multiplier"
version = "1.0.0"
description = "Data Designer index multiplier plugin"
requires-python = ">=3.10"
dependencies = [
"data-designer",
]
# Register this plugin via entry points
[project.entry-points."data_designer.plugins"]
index-multiplier = "data_designer_index_multiplier.plugin:plugin"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/data_designer_index_multiplier"]
```
!!! info "Entry Point Registration"
Plugins are discovered automatically using [Python entry points ](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata ). It is important to register your plugin as an entry point under the `data_designer.plugins` group.
The entry point format is:
```toml
[project.entry-points."data_designer.plugins"]
< entry-point-name > = "< module.path > :< plugin-instance-name > "
```
2026-02-09 21:03:56 +00:00
### Step 6: Install and use your plugin locally
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
Install your plugin in editable mode — this is all you need to start using it. No PyPI publishing required:
2025-12-11 21:05:11 +00:00
```bash
# From the plugin directory
uv pip install -e .
```
2026-02-09 21:03:56 +00:00
That's it. The editable install registers the entry point so Data Designer discovers your plugin automatically. Any changes you make to the plugin source code are picked up immediately without reinstalling.
2025-12-11 21:05:11 +00:00
Once installed, your plugin works just like built-in column types:
```python
2026-01-27 18:53:20 +00:00
import data_designer.config as dd
from data_designer.interface import DataDesigner
2025-12-11 21:05:11 +00:00
2026-02-09 21:03:56 +00:00
from data_designer_index_multiplier.config import IndexMultiplierColumnConfig
2025-12-11 21:05:11 +00:00
data_designer = DataDesigner()
2026-01-27 18:53:20 +00:00
builder = dd.DataDesignerConfigBuilder()
2025-12-11 21:05:11 +00:00
# Add a regular column
builder.add_column(
2026-01-27 18:53:20 +00:00
dd.SamplerColumnConfig(
2025-12-11 21:05:11 +00:00
name="category",
sampler_type="category",
2026-01-27 18:53:20 +00:00
params=dd.CategorySamplerParams(values=["A", "B", "C"]),
2025-12-11 21:05:11 +00:00
)
)
# Add your custom plugin column
builder.add_column(
IndexMultiplierColumnConfig(
2026-02-09 21:03:56 +00:00
name="scaled_index",
2025-12-11 21:05:11 +00:00
multiplier=5,
)
)
# Generate data
results = data_designer.create(builder, num_records=10)
print(results.load_dataset())
```
Output:
```
2026-02-09 21:03:56 +00:00
category scaled_index
0 B 0
1 A 5
2 C 10
3 A 15
4 B 20
2025-12-11 21:05:11 +00:00
...
```
2026-02-09 21:03:56 +00:00
---
## Validating Your Plugin
Data Designer provides a testing utility to validate that your plugin is structured correctly. Use `assert_valid_plugin` to check that your config and implementation classes are properly defined:
```python
from data_designer.engine.testing.utils import assert_valid_plugin
from data_designer_index_multiplier.plugin import plugin
# Raises AssertionError with a descriptive message if anything is wrong with the general plugin structure
assert_valid_plugin(plugin)
```
This validates that:
- The config class is a subclass of `ConfigBase`
- For column generator plugins: the implementation class is a subclass of `ConfigurableTask`
- For seed reader plugins: the implementation class is a subclass of `SeedReader`
---
## Multiple Plugins in One Package
A single Python package can register multiple plugins. Simply define multiple `Plugin` instances and register each one as a separate entry point:
```toml
[project.entry-points."data_designer.plugins"]
my-column-generator = "my_package.plugins.column_generator.plugin:column_generator_plugin"
my-seed-reader = "my_package.plugins.seed_reader.plugin:seed_reader_plugin"
```
For an example of this pattern, see the end-to-end test plugins in the [tests_e2e/ ](https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/tests_e2e ) directory.
That's it! You now know how to create a Data Designer plugin. A local editable install (`uv pip install -e .`) is all you need to develop, test, and use your plugin. If you want to make it available for others to install via `pip install` , publish it to PyPI or your organization's package index.