mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/
189 lines
6.4 KiB
Markdown
189 lines
6.4 KiB
Markdown
# Processors
|
|
|
|
Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
|
|
|
|
!!! tip "When to Use Processors"
|
|
Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
|
|
|
|
## Overview
|
|
|
|
Each processor:
|
|
|
|
- Receives the complete batch DataFrame
|
|
- Applies its transformation
|
|
- Passes the result to the next processor (or to output)
|
|
|
|
Processors can run at three stages, determined by which callback methods they implement:
|
|
|
|
| Stage | When it runs | Callback method | Use cases |
|
|
|-------|--------------|-----------------|-----------|
|
|
| Pre-batch | After seed columns, before dependent columns | `process_before_batch()` | Transform seed data before other columns are generated |
|
|
| Post-batch | After each batch completes | `process_after_batch()` | Drop columns, transform schema per batch |
|
|
| After generation | Once, on final dataset after all batches | `process_after_generation()` | Deduplicate, aggregate statistics, final cleanup |
|
|
|
|
!!! info "Full Schema Available During Generation"
|
|
Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches.
|
|
|
|
A processor can implement any combination of these callbacks. The built-in processors use `process_after_batch()` by default.
|
|
|
|
## Processor Types
|
|
|
|
### 🗑️ Drop Columns Processor
|
|
|
|
Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference.
|
|
|
|
!!! tip "Dropping Columns is More Easily Achieved via `drop = True`"
|
|
The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same.
|
|
|
|
**Configuration:**
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
processor = dd.DropColumnsProcessorConfig(
|
|
name="remove_intermediate",
|
|
column_names=["temp_calculation", "raw_input", "debug_info"],
|
|
)
|
|
```
|
|
|
|
**Behavior:**
|
|
|
|
- Columns specified in `column_names` are removed from the output
|
|
- Original values are preserved in a separate parquet file
|
|
- Missing columns produce a warning but don't fail the build
|
|
- Column configs are automatically marked with `drop=True` when this processor is added
|
|
|
|
**Use Cases:**
|
|
|
|
- Removing intermediate columns used only for LLM context
|
|
- Cleaning up debug or validation columns before final output
|
|
- Separating sensitive data from the main dataset
|
|
|
|
### 🔄 Schema Transform Processor
|
|
|
|
Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
|
|
|
|
**Configuration:**
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
processor = dd.SchemaTransformProcessorConfig(
|
|
name="chat_format",
|
|
template={
|
|
"messages": [
|
|
{"role": "user", "content": "{{ question }}"},
|
|
{"role": "assistant", "content": "{{ answer }}"},
|
|
],
|
|
"metadata": "{{ category | upper }}",
|
|
},
|
|
)
|
|
```
|
|
|
|
**Behavior:**
|
|
|
|
- Each key in `template` becomes a column in the transformed dataset
|
|
- Values are Jinja2 templates with access to all columns in the batch
|
|
- Complex structures (lists, nested dicts) are supported
|
|
- Output is saved to the `processors-outputs/{name}/` directory
|
|
- The original dataset passes through unchanged
|
|
|
|
**Template Capabilities:**
|
|
|
|
- **Variable substitution**: `{{ column_name }}`
|
|
- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}`
|
|
- **Nested structures**: Arbitrarily deep JSON structures
|
|
- **Lists**: `["{{ col1 }}", "{{ col2 }}"]`
|
|
|
|
**Use Cases:**
|
|
|
|
- Converting flat columns to chat message format
|
|
- Restructuring data for specific model training formats
|
|
- Creating derived views without modifying the source dataset
|
|
|
|
## Using Processors
|
|
|
|
Add processors to your configuration using the builder's `add_processor` method:
|
|
|
|
```python
|
|
import data_designer.config as dd
|
|
|
|
builder = dd.DataDesignerConfigBuilder()
|
|
|
|
# ... add columns ...
|
|
|
|
# Drop intermediate columns
|
|
builder.add_processor(
|
|
dd.DropColumnsProcessorConfig(
|
|
name="cleanup",
|
|
column_names=["scratch_work", "raw_context"],
|
|
)
|
|
)
|
|
|
|
# Transform to chat format
|
|
builder.add_processor(
|
|
dd.SchemaTransformProcessorConfig(
|
|
name="chat_format",
|
|
template={
|
|
"messages": [
|
|
{"role": "user", "content": "{{ question }}"},
|
|
{"role": "assistant", "content": "{{ answer }}"},
|
|
],
|
|
},
|
|
)
|
|
)
|
|
```
|
|
|
|
### Execution Order
|
|
|
|
Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
|
|
|
|
## Processor Plugins
|
|
|
|
You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). A processor plugin is a Python package that provides:
|
|
|
|
- A **config class** inheriting from `ProcessorConfig` with a `processor_type: Literal["your-type"]` discriminator
|
|
- An **implementation class** inheriting from `Processor` that overrides the desired callback methods
|
|
- A **`Plugin` instance** connecting the two
|
|
|
|
Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors.
|
|
|
|
```python
|
|
from my_processor_plugin.config import MyProcessorConfig
|
|
|
|
builder.add_processor(
|
|
MyProcessorConfig(
|
|
name="my_processor",
|
|
# ... plugin-specific parameters ...
|
|
)
|
|
)
|
|
```
|
|
|
|
**Entry point configuration** in `pyproject.toml`:
|
|
|
|
```toml
|
|
[project.entry-points."data_designer.plugins"]
|
|
my-processor = "my_plugin.plugin:my_processor_plugin"
|
|
```
|
|
|
|
See the [plugins overview](../plugins/overview.md) for the full guide on creating plugins.
|
|
|
|
## Configuration Parameters
|
|
|
|
### Common Parameters
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `name` | str | Identifier for the processor, used in output directory names |
|
|
|
|
### DropColumnsProcessorConfig
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `column_names` | list[str] | Columns to remove from output |
|
|
|
|
### SchemaTransformProcessorConfig
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. |
|