DataDesigner/docs/concepts/processors.md
Andre Manoel 982ce79ca9
feat: add processor plugin support (#299)
* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/
2026-02-25 16:40:01 -03:00

189 lines
6.4 KiB
Markdown

# Processors
Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
!!! tip "When to Use Processors"
Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
## Overview
Each processor:
- Receives the complete batch DataFrame
- Applies its transformation
- Passes the result to the next processor (or to output)
Processors can run at three stages, determined by which callback methods they implement:
| Stage | When it runs | Callback method | Use cases |
|-------|--------------|-----------------|-----------|
| Pre-batch | After seed columns, before dependent columns | `process_before_batch()` | Transform seed data before other columns are generated |
| Post-batch | After each batch completes | `process_after_batch()` | Drop columns, transform schema per batch |
| After generation | Once, on final dataset after all batches | `process_after_generation()` | Deduplicate, aggregate statistics, final cleanup |
!!! info "Full Schema Available During Generation"
Each batch carries the full dataset schema during generation. Post-batch schema changes such as column dropping only alter past batches, so all columns remain accessible to generators while building follow-up batches.
A processor can implement any combination of these callbacks. The built-in processors use `process_after_batch()` by default.
## Processor Types
### 🗑️ Drop Columns Processor
Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference.
!!! tip "Dropping Columns is More Easily Achieved via `drop = True`"
The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same.
**Configuration:**
```python
import data_designer.config as dd
processor = dd.DropColumnsProcessorConfig(
name="remove_intermediate",
column_names=["temp_calculation", "raw_input", "debug_info"],
)
```
**Behavior:**
- Columns specified in `column_names` are removed from the output
- Original values are preserved in a separate parquet file
- Missing columns produce a warning but don't fail the build
- Column configs are automatically marked with `drop=True` when this processor is added
**Use Cases:**
- Removing intermediate columns used only for LLM context
- Cleaning up debug or validation columns before final output
- Separating sensitive data from the main dataset
### 🔄 Schema Transform Processor
Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
**Configuration:**
```python
import data_designer.config as dd
processor = dd.SchemaTransformProcessorConfig(
name="chat_format",
template={
"messages": [
{"role": "user", "content": "{{ question }}"},
{"role": "assistant", "content": "{{ answer }}"},
],
"metadata": "{{ category | upper }}",
},
)
```
**Behavior:**
- Each key in `template` becomes a column in the transformed dataset
- Values are Jinja2 templates with access to all columns in the batch
- Complex structures (lists, nested dicts) are supported
- Output is saved to the `processors-outputs/{name}/` directory
- The original dataset passes through unchanged
**Template Capabilities:**
- **Variable substitution**: `{{ column_name }}`
- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}`
- **Nested structures**: Arbitrarily deep JSON structures
- **Lists**: `["{{ col1 }}", "{{ col2 }}"]`
**Use Cases:**
- Converting flat columns to chat message format
- Restructuring data for specific model training formats
- Creating derived views without modifying the source dataset
## Using Processors
Add processors to your configuration using the builder's `add_processor` method:
```python
import data_designer.config as dd
builder = dd.DataDesignerConfigBuilder()
# ... add columns ...
# Drop intermediate columns
builder.add_processor(
dd.DropColumnsProcessorConfig(
name="cleanup",
column_names=["scratch_work", "raw_context"],
)
)
# Transform to chat format
builder.add_processor(
dd.SchemaTransformProcessorConfig(
name="chat_format",
template={
"messages": [
{"role": "user", "content": "{{ question }}"},
{"role": "assistant", "content": "{{ answer }}"},
],
},
)
)
```
### Execution Order
Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
## Processor Plugins
You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). A processor plugin is a Python package that provides:
- A **config class** inheriting from `ProcessorConfig` with a `processor_type: Literal["your-type"]` discriminator
- An **implementation class** inheriting from `Processor` that overrides the desired callback methods
- A **`Plugin` instance** connecting the two
Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors.
```python
from my_processor_plugin.config import MyProcessorConfig
builder.add_processor(
MyProcessorConfig(
name="my_processor",
# ... plugin-specific parameters ...
)
)
```
**Entry point configuration** in `pyproject.toml`:
```toml
[project.entry-points."data_designer.plugins"]
my-processor = "my_plugin.plugin:my_processor_plugin"
```
See the [plugins overview](../plugins/overview.md) for the full guide on creating plugins.
## Configuration Parameters
### Common Parameters
| Parameter | Type | Description |
|-----------|------|-------------|
| `name` | str | Identifier for the processor, used in output directory names |
### DropColumnsProcessorConfig
| Parameter | Type | Description |
|-----------|------|-------------|
| `column_names` | list[str] | Columns to remove from output |
### SchemaTransformProcessorConfig
| Parameter | Type | Description |
|-----------|------|-------------|
| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. |