* feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/
5 KiB
Plan: Processor Plugins
Created: 2026-02-03 Updated: 2026-02-19 Status: Complete
Goal
Enable third-party processor plugins via the existing plugin discovery mechanism, and create a demo plugin package.
Context
The callback-based processor design is already on main:
ProcessorConfigbase class inprocessors.pyProcessorbase class withprocess_before_batch()andprocess_after_generation()callbacksProcessorRunnerhandles all stages in the builder- Preview runs
process_after_generationviarun_after_generation_on_df()
The plugin system exists for COLUMN_GENERATOR and SEED_READER types. The _types module pattern
(e.g., column_types.py, seed_source_types.py) separates base classes from type unions with plugin injection.
ProcessorConfigT is currently defined inline in processors.py without plugin injection.
Success Criteria
PluginType.PROCESSORenables external processor plugins- ProcessorRegistry loads plugins from entry points
- Demo plugin package demonstrates both preprocessing and postprocessing
- Existing
POST_BATCHbehavior unchanged
Implementation Steps
Step 1: Add Processor Plugin Support
Enable third-party processor plugins through the existing plugin system.
-
Add
PluginType.PROCESSORto the plugin types enum -
Update
discriminator_fieldproperty to return"processor_type"for processors -
Update
ProcessorRegistryto discover and load processor plugins- Follow the pattern used for column generator plugins
- Use string keys for plugin processors (not enum values)
-
Inject plugin processor configs into the
ProcessorConfigTtype union- Follow the existing
_typespattern used for columns and seed sources
- Follow the existing
Follow the _types Module Pattern:
- Keep
processors.pywith base classes and concrete configs - Create
processor_types.pyforProcessorConfigTwith plugin injection - Plugin configs import from
processors.py(no circular dependency)
Threading Note:
The PluginRegistry uses Lock. If plugin discovery triggers nested imports that re-enter
the registry (e.g., a plugin imports data_designer.config.config_builder), this will deadlock.
Use RLock instead of Lock to allow reentrant acquisition.
Import chain that triggers this:
plugin.py → data_designer.config.config_builder
→ data_designer.config.column_types (calls PluginManager())
→ data_designer.config.processor_types (calls PluginManager())
→ data_designer.config.seed_source_types (calls PluginManager())
Step 2: Create Demo Plugin Package
Create a separate package demonstrating both processor types.
-
Create package structure under
demo/data_designer_demo_processors/ -
Implement
RegexFilterProcessor(runs atprocess_before_batch)- Config: column, pattern, invert flag
- Filters rows based on regex matching
-
Implement
SemanticDedupProcessor(runs atprocess_after_generation)- Config: column, similarity_threshold, model_name
- Uses embeddings to find and remove similar rows
- Use sentence-transformers with a small model like
all-MiniLM-L6-v2
-
Configure entry points in
pyproject.tomlunderdata_designer.plugins -
Add unit tests for each processor
-
Add README with installation and usage examples
Logging Suppression (for sentence-transformers):
- Use
transformers.utils.logging.set_verbosity_error()to suppress info/warning messages - Use
transformers.utils.logging.disable_progress_bar()to suppress progress bars - Pass
show_progress_bar=Falsetomodel.encode()for batch encoding
Step 3: Demo Notebook
Create a simple, short demo that tests all features end-to-end.
- Use
#%%cell markers for IDE compatibility - Keep the demo minimal - just enough to verify the feature works
- Include sample seed data with rows to filter (process_before_batch test)
- Add an LLM column to generate content, use the
openai-textmodel - Configure both process_before_batch and process_after_generation processors
- Run the demo and fix any issues - don't just write it, execute it
- Verify the output shows filtering and deduplication working
Important: The demo must actually run successfully. Test it before considering this step complete.
API Notes: Check the docs for correct Data Designer API usage.
Step 4: Documentation
Update existing documentation to cover new capabilities.
- Update processor concepts doc with plugin processor info
- Update plugins overview to mention processor plugins
- Include example entry point configuration
Testing Strategy
- Write tests alongside implementation, not as a separate step
- Use mocks for external dependencies (seed readers, artifact storage)
- For plugin registry tests, create actual mock classes (not Mock objects) to satisfy type validation
Risks & Considerations
- Memory usage:
process_after_generationholds full dataset in memory - Model download: Embedding models download on first use; perform pre-download on uv install