mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
* chore: add __init__.py to engine namespace subpackages
Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.
* docs: flesh out docstrings on plugin extension-point classes
Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:
- Plugin and PluginType: class docstrings + Attributes tables for
fields and enum members; fix typo in config_qualified_name field
description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
class docstrings explaining when to use each base, plus method
docstrings on the abstract generate() implementations.
* docs: graduate plugins out of experimental mode
Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.
- Add code_reference/plugins.md: single-stop reference for the Plugin
object and the config + implementation base classes used by all
three plugin types.
- Add code_reference/generators.md: column generator implementation
base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
internal-helpers note (PluginRegistry / PluginManager), and focus
the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
concepts/deployment-options.md.
* docs: simplify plugin docs structure
Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.
- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
(their content is now in build_your_own.md and the per-type code
references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
the relevant code reference, drop the multi-step "How do you
create plugins" walkthrough in favor of a single Build a Plugin
pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
DataDesignerPlugins catalog and explain how to request a community
listing.
- Update cross-page links in concepts/processors.md,
concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
code_reference/plugins.md, and code_reference/generators.md to
match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
add seed_readers code reference.
* docs: scroll wide tables horizontally instead of wrapping
Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.
* docs: address PR #603 review feedback
- Add an Implementation base section to code_reference/processors.md
rendering the engine-side Processor class. This justifies the
engine/processing/__init__.py files added earlier and gives
processor plugin authors an auto-rendered API reference, matching
the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
pattern in the regex-filter processor in favor of the idiomatic
`Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
install instructions — PluginRegistry caches discovery on first
import, so notebooks need a fresh kernel to pick up freshly
installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
reference alongside generators and seed_readers.
* docs: split code reference by package
* docs: add interface code reference
* docs: add code reference overviews
* docs: refine code reference pages
* docs: improve code reference tables
* docs: correct reference docstrings
* docs: embed plugin catalog table
* docs: note plugin discovery restart caveat
* docs: explain generator base class choice
* docs: mention async cell generator examples
* docs: clarify plugin model usage
* docs: clarify plugin model aliases
* docs: address plugin review feedback
* docs: update available plugins page
179 lines
6.4 KiB
Markdown
179 lines
6.4 KiB
Markdown
# Processors
|
|
|
|
Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
|
|
|
|
!!! tip "When to Use Processors"
|
|
Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
|
|
|
|
## Overview
|
|
|
|
Each processor:
|
|
|
|
- Receives the complete batch DataFrame
|
|
- Applies its transformation
|
|
- Passes the result to the next processor (or to output)
|
|
|
|
Processors can run at three stages, determined by which callback methods they implement:
|
|
|
|
| Stage | When it runs | Callback method | Use cases |
|
|
|-------|--------------|-----------------|-----------|
|
|
| Pre-batch | After seed columns, before dependent columns | `process_before_batch()` | Transform seed data before other columns are generated |
|
|
| Post-batch | After each batch completes | `process_after_batch()` | Drop columns, transform schema per batch |
|
|
| After generation | Once, on final dataset after all batches | `process_after_generation()` | Deduplicate, aggregate statistics, final cleanup |
|
|
|
|
!!! info "Full Schema Available During Generation"
|
|
Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches.
|
|
|
|
!!! warning "Row-count changes under the async engine"
|
|
The async engine (default) enforces row-count invariance in `process_before_batch()` and `process_after_batch()` — a processor returning a different row count raises `DatasetGenerationError`. Run row-filtering or expansion logic in `process_after_generation()`, which operates on the final dataset and supports row-count changes. The legacy sync engine (opt-out via `DATA_DESIGNER_ASYNC_ENGINE=0`) is permissive about row-count changes at all stages.
|
|
|
|
A processor can implement any combination of these callbacks. The built-in processors use `process_after_batch()` by default.
|
|
|
|
## Processor Types
|
|
|
|
### 🗑️ Drop Columns Processor
|
|
|
|
Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference.
|
|
|
|
!!! tip "Dropping Columns is More Easily Achieved via `drop = True`"
|
|
The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same.
|
|
|
|
**Configuration:**
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
processor = dd.DropColumnsProcessorConfig(
|
|
name="remove_intermediate",
|
|
column_names=["temp_calculation", "raw_input", "debug_info"],
|
|
)
|
|
```
|
|
|
|
**Behavior:**
|
|
|
|
- Columns specified in `column_names` are removed from the output
|
|
- Original values are preserved in a separate parquet file
|
|
- Missing columns produce a warning but don't fail the build
|
|
- Column configs are automatically marked with `drop=True` when this processor is added
|
|
|
|
**Use Cases:**
|
|
|
|
- Removing intermediate columns used only for LLM context
|
|
- Cleaning up debug or validation columns before final output
|
|
- Separating sensitive data from the main dataset
|
|
|
|
### 🔄 Schema Transform Processor
|
|
|
|
Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
|
|
|
|
**Configuration:**
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
processor = dd.SchemaTransformProcessorConfig(
|
|
name="chat_format",
|
|
template={
|
|
"messages": [
|
|
{"role": "user", "content": "{{ question }}"},
|
|
{"role": "assistant", "content": "{{ answer }}"},
|
|
],
|
|
"metadata": "{{ category | upper }}",
|
|
},
|
|
)
|
|
```
|
|
|
|
**Behavior:**
|
|
|
|
- Each key in `template` becomes a column in the transformed dataset
|
|
- Values are Jinja2 templates with access to all columns in the batch
|
|
- Complex structures (lists, nested dicts) are supported
|
|
- Output is saved to the `processors-files/{name}/` directory
|
|
- The original dataset passes through unchanged
|
|
|
|
**Template Capabilities:**
|
|
|
|
- **Variable substitution**: `{{ column_name }}`
|
|
- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}`
|
|
- **Nested structures**: Arbitrarily deep JSON structures
|
|
- **Lists**: `["{{ col1 }}", "{{ col2 }}"]`
|
|
|
|
**Use Cases:**
|
|
|
|
- Converting flat columns to chat message format
|
|
- Restructuring data for specific model training formats
|
|
- Creating derived views without modifying the source dataset
|
|
|
|
## Using Processors
|
|
|
|
Add processors to your configuration using the builder's `add_processor` method:
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
builder = dd.DataDesignerConfigBuilder()
|
|
|
|
# ... add columns ...
|
|
|
|
# Drop intermediate columns
|
|
builder.add_processor(
|
|
dd.DropColumnsProcessorConfig(
|
|
name="cleanup",
|
|
column_names=["scratch_work", "raw_context"],
|
|
)
|
|
)
|
|
|
|
# Transform to chat format
|
|
builder.add_processor(
|
|
dd.SchemaTransformProcessorConfig(
|
|
name="chat_format",
|
|
template={
|
|
"messages": [
|
|
{"role": "user", "content": "{{ question }}"},
|
|
{"role": "assistant", "content": "{{ answer }}"},
|
|
],
|
|
},
|
|
)
|
|
)
|
|
```
|
|
|
|
### Execution Order
|
|
|
|
Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
|
|
|
|
## Processor Plugins
|
|
|
|
You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors.
|
|
|
|
```python
|
|
from my_processor_plugin.config import MyProcessorConfig
|
|
|
|
builder.add_processor(
|
|
MyProcessorConfig(
|
|
name="my_processor",
|
|
# ... plugin-specific parameters ...
|
|
)
|
|
)
|
|
```
|
|
|
|
For implementation instructions across all plugin types, see [Build Your Own](../plugins/build_your_own.md).
|
|
|
|
## Configuration Parameters
|
|
|
|
### Common Parameters
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `name` | str | Identifier for the processor, used in output directory names |
|
|
|
|
### DropColumnsProcessorConfig
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `column_names` | list[str] | Columns to remove from output |
|
|
|
|
### SchemaTransformProcessorConfig
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. |
|