Elgato_dark/DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Andre Manoel 982ce79ca9

feat: add processor plugin support (#299 )

* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/

2026-02-25 16:40:01 -03:00

5 KiB

Raw Permalink Blame History

Plan: Processor Plugins

Created: 2026-02-03 Updated: 2026-02-19 Status: Complete

Goal

Enable third-party processor plugins via the existing plugin discovery mechanism, and create a demo plugin package.

Context

The callback-based processor design is already on main:

ProcessorConfig base class in processors.py
Processor base class with process_before_batch() and process_after_generation() callbacks
ProcessorRunner handles all stages in the builder
Preview runs process_after_generation via run_after_generation_on_df()

The plugin system exists for COLUMN_GENERATOR and SEED_READER types. The _types module pattern (e.g., column_types.py, seed_source_types.py) separates base classes from type unions with plugin injection.

ProcessorConfigT is currently defined inline in processors.py without plugin injection.

Success Criteria

PluginType.PROCESSOR enables external processor plugins
ProcessorRegistry loads plugins from entry points
Demo plugin package demonstrates both preprocessing and postprocessing
Existing POST_BATCH behavior unchanged

Implementation Steps

Step 1: Add Processor Plugin Support

Enable third-party processor plugins through the existing plugin system.

Add PluginType.PROCESSOR to the plugin types enum
Update discriminator_field property to return "processor_type" for processors
Update ProcessorRegistry to discover and load processor plugins
- Follow the pattern used for column generator plugins
- Use string keys for plugin processors (not enum values)
Inject plugin processor configs into the ProcessorConfigT type union
- Follow the existing _types pattern used for columns and seed sources

Follow the _types Module Pattern:

Keep processors.py with base classes and concrete configs
Create processor_types.py for ProcessorConfigT with plugin injection
Plugin configs import from processors.py (no circular dependency)

Threading Note:

The PluginRegistry uses Lock. If plugin discovery triggers nested imports that re-enter the registry (e.g., a plugin imports data_designer.config.config_builder), this will deadlock. Use RLock instead of Lock to allow reentrant acquisition.

Import chain that triggers this:

plugin.py → data_designer.config.config_builder
          → data_designer.config.column_types (calls PluginManager())
          → data_designer.config.processor_types (calls PluginManager())
          → data_designer.config.seed_source_types (calls PluginManager())

Step 2: Create Demo Plugin Package

Create a separate package demonstrating both processor types.

Create package structure under demo/data_designer_demo_processors/
Implement RegexFilterProcessor (runs at process_before_batch)
- Config: column, pattern, invert flag
- Filters rows based on regex matching
Implement SemanticDedupProcessor (runs at process_after_generation)
- Config: column, similarity_threshold, model_name
- Uses embeddings to find and remove similar rows
- Use sentence-transformers with a small model like all-MiniLM-L6-v2
Configure entry points in pyproject.toml under data_designer.plugins
Add unit tests for each processor
Add README with installation and usage examples

Logging Suppression (for sentence-transformers):

Use transformers.utils.logging.set_verbosity_error() to suppress info/warning messages
Use transformers.utils.logging.disable_progress_bar() to suppress progress bars
Pass show_progress_bar=False to model.encode() for batch encoding

Step 3: Demo Notebook

Create a simple, short demo that tests all features end-to-end.

Use #%% cell markers for IDE compatibility
Keep the demo minimal - just enough to verify the feature works
Include sample seed data with rows to filter (process_before_batch test)
Add an LLM column to generate content, use the openai-text model
Configure both process_before_batch and process_after_generation processors
Run the demo and fix any issues - don't just write it, execute it
Verify the output shows filtering and deduplication working

Important: The demo must actually run successfully. Test it before considering this step complete.

API Notes: Check the docs for correct Data Designer API usage.

Step 4: Documentation

Update existing documentation to cover new capabilities.

Update processor concepts doc with plugin processor info
Update plugins overview to mention processor plugins
Include example entry point configuration

Testing Strategy

Write tests alongside implementation, not as a separate step
Use mocks for external dependencies (seed readers, artifact storage)
For plugin registry tests, create actual mock classes (not Mock objects) to satisfy type validation

Risks & Considerations

Memory usage: process_after_generation holds full dataset in memory
Model download: Embedding models download on first use; perform pre-download on uv install