DataDesigner/plans/299/processor-plugins.md

# Plan: Processor Plugins

Created: 2026-02-03
Updated: 2026-02-19
Status: Complete

## Goal

Enable third-party processor plugins via the existing plugin discovery mechanism, and create a demo plugin package.

## Context

The callback-based processor design is already on main:
- `ProcessorConfig` base class in `processors.py`
- `Processor` base class with `process_before_batch()` and `process_after_generation()` callbacks
- `ProcessorRunner` handles all stages in the builder
- Preview runs `process_after_generation` via `run_after_generation_on_df()`

The plugin system exists for `COLUMN_GENERATOR` and `SEED_READER` types. The `_types` module pattern
(e.g., `column_types.py`, `seed_source_types.py`) separates base classes from type unions with plugin injection.

`ProcessorConfigT` is currently defined inline in `processors.py` without plugin injection.

## Success Criteria

- [x] `PluginType.PROCESSOR` enables external processor plugins
- [x] ProcessorRegistry loads plugins from entry points
- [x] Demo plugin package demonstrates both preprocessing and postprocessing
- [x] Existing `POST_BATCH` behavior unchanged

## Implementation Steps

### Step 1: Add Processor Plugin Support

Enable third-party processor plugins through the existing plugin system.

- [x] Add `PluginType.PROCESSOR` to the plugin types enum
- [x] Update `discriminator_field` property to return `"processor_type"` for processors
- [x] Update `ProcessorRegistry` to discover and load processor plugins
  - Follow the pattern used for column generator plugins
  - Use string keys for plugin processors (not enum values)

- [x] Inject plugin processor configs into the `ProcessorConfigT` type union
  - Follow the existing `_types` pattern used for columns and seed sources

**Follow the `_types` Module Pattern**:

- Keep `processors.py` with base classes and concrete configs
- Create `processor_types.py` for `ProcessorConfigT` with plugin injection
- Plugin configs import from `processors.py` (no circular dependency)

**Threading Note**:

The `PluginRegistry` uses `Lock`. If plugin discovery triggers nested imports that re-enter
the registry (e.g., a plugin imports `data_designer.config.config_builder`), this will deadlock.
Use `RLock` instead of `Lock` to allow reentrant acquisition.

Import chain that triggers this:
```
plugin.py → data_designer.config.config_builder
          → data_designer.config.column_types (calls PluginManager())
          → data_designer.config.processor_types (calls PluginManager())
          → data_designer.config.seed_source_types (calls PluginManager())
```

### Step 2: Create Demo Plugin Package

Create a separate package demonstrating both processor types.

- [x] Create package structure under `demo/data_designer_demo_processors/`
- [x] Implement `RegexFilterProcessor` (runs at `process_before_batch`)
  - Config: column, pattern, invert flag
  - Filters rows based on regex matching
- [x] Implement `SemanticDedupProcessor` (runs at `process_after_generation`)
  - Config: column, similarity_threshold, model_name
  - Uses embeddings to find and remove similar rows
  - Use sentence-transformers with a small model like `all-MiniLM-L6-v2`

- [x] Configure entry points in `pyproject.toml` under `data_designer.plugins`
- [x] Add unit tests for each processor
- [x] Add README with installation and usage examples

**Logging Suppression** (for sentence-transformers):

- Use `transformers.utils.logging.set_verbosity_error()` to suppress info/warning messages
- Use `transformers.utils.logging.disable_progress_bar()` to suppress progress bars
- Pass `show_progress_bar=False` to `model.encode()` for batch encoding

### Step 3: Demo Notebook

Create a simple, short demo that tests all features end-to-end.

- [x] Use `#%%` cell markers for IDE compatibility
- [x] Keep the demo minimal - just enough to verify the feature works
- [x] Include sample seed data with rows to filter (process_before_batch test)
- [x] Add an LLM column to generate content, use the `openai-text` model
- [x] Configure both process_before_batch and process_after_generation processors
- [x] **Run the demo and fix any issues** - don't just write it, execute it
- [x] Verify the output shows filtering and deduplication working

**Important**: The demo must actually run successfully. Test it before considering this step complete.

**API Notes**: Check the docs for correct Data Designer API usage.

### Step 4: Documentation

Update existing documentation to cover new capabilities.

- [x] Update processor concepts doc with plugin processor info
- [x] Update plugins overview to mention processor plugins
- [x] Include example entry point configuration

## Testing Strategy

- Write tests alongside implementation, not as a separate step
- Use mocks for external dependencies (seed readers, artifact storage)
- For plugin registry tests, create actual mock classes (not Mock objects) to satisfy type validation

## Risks & Considerations

- **Memory usage**: `process_after_generation` holds full dataset in memory
- **Model download**: Embedding models download on first use; perform pre-download on uv install
feat: add processor plugin support (#299) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/ 2026-02-25 19:40:01 +00:00			`# Plan: Processor Plugins`

			`Created: 2026-02-03`
			`Updated: 2026-02-19`
			`Status: Complete`

			`## Goal`

			`Enable third-party processor plugins via the existing plugin discovery mechanism, and create a demo plugin package.`

			`## Context`

			`The callback-based processor design is already on main:`
			- `ProcessorConfig` base class in `processors.py`
			- `Processor` base class with `process_before_batch()` and `process_after_generation()` callbacks
			- `ProcessorRunner` handles all stages in the builder
			- Preview runs `process_after_generation` via `run_after_generation_on_df()`

			The plugin system exists for `COLUMN_GENERATOR` and `SEED_READER` types. The `_types` module pattern
			(e.g., `column_types.py`, `seed_source_types.py`) separates base classes from type unions with plugin injection.

			`ProcessorConfigT` is currently defined inline in `processors.py` without plugin injection.

			`## Success Criteria`

			- [x] `PluginType.PROCESSOR` enables external processor plugins
			`- [x] ProcessorRegistry loads plugins from entry points`
			`- [x] Demo plugin package demonstrates both preprocessing and postprocessing`
			- [x] Existing `POST_BATCH` behavior unchanged

			`## Implementation Steps`

			`### Step 1: Add Processor Plugin Support`

			`Enable third-party processor plugins through the existing plugin system.`

			- [x] Add `PluginType.PROCESSOR` to the plugin types enum
			- [x] Update `discriminator_field` property to return `"processor_type"` for processors
			- [x] Update `ProcessorRegistry` to discover and load processor plugins
			`- Follow the pattern used for column generator plugins`
			`- Use string keys for plugin processors (not enum values)`

			- [x] Inject plugin processor configs into the `ProcessorConfigT` type union
			- Follow the existing `_types` pattern used for columns and seed sources

			Follow the `_types` Module Pattern:

			- Keep `processors.py` with base classes and concrete configs
			- Create `processor_types.py` for `ProcessorConfigT` with plugin injection
			- Plugin configs import from `processors.py` (no circular dependency)

			`Threading Note:`

			The `PluginRegistry` uses `Lock`. If plugin discovery triggers nested imports that re-enter
			the registry (e.g., a plugin imports `data_designer.config.config_builder`), this will deadlock.
			Use `RLock` instead of `Lock` to allow reentrant acquisition.

			`Import chain that triggers this:`
			```
			`plugin.py → data_designer.config.config_builder`
			`→ data_designer.config.column_types (calls PluginManager())`
			`→ data_designer.config.processor_types (calls PluginManager())`
			`→ data_designer.config.seed_source_types (calls PluginManager())`
			```

			`### Step 2: Create Demo Plugin Package`

			`Create a separate package demonstrating both processor types.`

			- [x] Create package structure under `demo/data_designer_demo_processors/`
			- [x] Implement `RegexFilterProcessor` (runs at `process_before_batch`)
			`- Config: column, pattern, invert flag`
			`- Filters rows based on regex matching`
			- [x] Implement `SemanticDedupProcessor` (runs at `process_after_generation`)
			`- Config: column, similarity_threshold, model_name`
			`- Uses embeddings to find and remove similar rows`
			- Use sentence-transformers with a small model like `all-MiniLM-L6-v2`

			- [x] Configure entry points in `pyproject.toml` under `data_designer.plugins`
			`- [x] Add unit tests for each processor`
			`- [x] Add README with installation and usage examples`

			`Logging Suppression (for sentence-transformers):`

			- Use `transformers.utils.logging.set_verbosity_error()` to suppress info/warning messages
			- Use `transformers.utils.logging.disable_progress_bar()` to suppress progress bars
			- Pass `show_progress_bar=False` to `model.encode()` for batch encoding

			`### Step 3: Demo Notebook`

			`Create a simple, short demo that tests all features end-to-end.`

			- [x] Use `#%%` cell markers for IDE compatibility
			`- [x] Keep the demo minimal - just enough to verify the feature works`
			`- [x] Include sample seed data with rows to filter (process_before_batch test)`
			- [x] Add an LLM column to generate content, use the `openai-text` model
			`- [x] Configure both process_before_batch and process_after_generation processors`
			`- [x] Run the demo and fix any issues - don't just write it, execute it`
			`- [x] Verify the output shows filtering and deduplication working`

			`Important: The demo must actually run successfully. Test it before considering this step complete.`

			`API Notes: Check the docs for correct Data Designer API usage.`

			`### Step 4: Documentation`

			`Update existing documentation to cover new capabilities.`

			`- [x] Update processor concepts doc with plugin processor info`
			`- [x] Update plugins overview to mention processor plugins`
			`- [x] Include example entry point configuration`

			`## Testing Strategy`

			`- Write tests alongside implementation, not as a separate step`
			`- Use mocks for external dependencies (seed readers, artifact storage)`
			`- For plugin registry tests, create actual mock classes (not Mock objects) to satisfy type validation`

			`## Risks & Considerations`

			- Memory usage: `process_after_generation` holds full dataset in memory
			`- Model download: Embedding models download on first use; perform pre-download on uv install`