mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
123 lines
5 KiB
Markdown
123 lines
5 KiB
Markdown
|
|
# Plan: Processor Plugins
|
||
|
|
|
||
|
|
Created: 2026-02-03
|
||
|
|
Updated: 2026-02-19
|
||
|
|
Status: Complete
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
Enable third-party processor plugins via the existing plugin discovery mechanism, and create a demo plugin package.
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
The callback-based processor design is already on main:
|
||
|
|
- `ProcessorConfig` base class in `processors.py`
|
||
|
|
- `Processor` base class with `process_before_batch()` and `process_after_generation()` callbacks
|
||
|
|
- `ProcessorRunner` handles all stages in the builder
|
||
|
|
- Preview runs `process_after_generation` via `run_after_generation_on_df()`
|
||
|
|
|
||
|
|
The plugin system exists for `COLUMN_GENERATOR` and `SEED_READER` types. The `_types` module pattern
|
||
|
|
(e.g., `column_types.py`, `seed_source_types.py`) separates base classes from type unions with plugin injection.
|
||
|
|
|
||
|
|
`ProcessorConfigT` is currently defined inline in `processors.py` without plugin injection.
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
- [x] `PluginType.PROCESSOR` enables external processor plugins
|
||
|
|
- [x] ProcessorRegistry loads plugins from entry points
|
||
|
|
- [x] Demo plugin package demonstrates both preprocessing and postprocessing
|
||
|
|
- [x] Existing `POST_BATCH` behavior unchanged
|
||
|
|
|
||
|
|
## Implementation Steps
|
||
|
|
|
||
|
|
### Step 1: Add Processor Plugin Support
|
||
|
|
|
||
|
|
Enable third-party processor plugins through the existing plugin system.
|
||
|
|
|
||
|
|
- [x] Add `PluginType.PROCESSOR` to the plugin types enum
|
||
|
|
- [x] Update `discriminator_field` property to return `"processor_type"` for processors
|
||
|
|
- [x] Update `ProcessorRegistry` to discover and load processor plugins
|
||
|
|
- Follow the pattern used for column generator plugins
|
||
|
|
- Use string keys for plugin processors (not enum values)
|
||
|
|
|
||
|
|
- [x] Inject plugin processor configs into the `ProcessorConfigT` type union
|
||
|
|
- Follow the existing `_types` pattern used for columns and seed sources
|
||
|
|
|
||
|
|
**Follow the `_types` Module Pattern**:
|
||
|
|
|
||
|
|
- Keep `processors.py` with base classes and concrete configs
|
||
|
|
- Create `processor_types.py` for `ProcessorConfigT` with plugin injection
|
||
|
|
- Plugin configs import from `processors.py` (no circular dependency)
|
||
|
|
|
||
|
|
**Threading Note**:
|
||
|
|
|
||
|
|
The `PluginRegistry` uses `Lock`. If plugin discovery triggers nested imports that re-enter
|
||
|
|
the registry (e.g., a plugin imports `data_designer.config.config_builder`), this will deadlock.
|
||
|
|
Use `RLock` instead of `Lock` to allow reentrant acquisition.
|
||
|
|
|
||
|
|
Import chain that triggers this:
|
||
|
|
```
|
||
|
|
plugin.py → data_designer.config.config_builder
|
||
|
|
→ data_designer.config.column_types (calls PluginManager())
|
||
|
|
→ data_designer.config.processor_types (calls PluginManager())
|
||
|
|
→ data_designer.config.seed_source_types (calls PluginManager())
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Create Demo Plugin Package
|
||
|
|
|
||
|
|
Create a separate package demonstrating both processor types.
|
||
|
|
|
||
|
|
- [x] Create package structure under `demo/data_designer_demo_processors/`
|
||
|
|
- [x] Implement `RegexFilterProcessor` (runs at `process_before_batch`)
|
||
|
|
- Config: column, pattern, invert flag
|
||
|
|
- Filters rows based on regex matching
|
||
|
|
- [x] Implement `SemanticDedupProcessor` (runs at `process_after_generation`)
|
||
|
|
- Config: column, similarity_threshold, model_name
|
||
|
|
- Uses embeddings to find and remove similar rows
|
||
|
|
- Use sentence-transformers with a small model like `all-MiniLM-L6-v2`
|
||
|
|
|
||
|
|
- [x] Configure entry points in `pyproject.toml` under `data_designer.plugins`
|
||
|
|
- [x] Add unit tests for each processor
|
||
|
|
- [x] Add README with installation and usage examples
|
||
|
|
|
||
|
|
**Logging Suppression** (for sentence-transformers):
|
||
|
|
|
||
|
|
- Use `transformers.utils.logging.set_verbosity_error()` to suppress info/warning messages
|
||
|
|
- Use `transformers.utils.logging.disable_progress_bar()` to suppress progress bars
|
||
|
|
- Pass `show_progress_bar=False` to `model.encode()` for batch encoding
|
||
|
|
|
||
|
|
### Step 3: Demo Notebook
|
||
|
|
|
||
|
|
Create a simple, short demo that tests all features end-to-end.
|
||
|
|
|
||
|
|
- [x] Use `#%%` cell markers for IDE compatibility
|
||
|
|
- [x] Keep the demo minimal - just enough to verify the feature works
|
||
|
|
- [x] Include sample seed data with rows to filter (process_before_batch test)
|
||
|
|
- [x] Add an LLM column to generate content, use the `openai-text` model
|
||
|
|
- [x] Configure both process_before_batch and process_after_generation processors
|
||
|
|
- [x] **Run the demo and fix any issues** - don't just write it, execute it
|
||
|
|
- [x] Verify the output shows filtering and deduplication working
|
||
|
|
|
||
|
|
**Important**: The demo must actually run successfully. Test it before considering this step complete.
|
||
|
|
|
||
|
|
**API Notes**: Check the docs for correct Data Designer API usage.
|
||
|
|
|
||
|
|
### Step 4: Documentation
|
||
|
|
|
||
|
|
Update existing documentation to cover new capabilities.
|
||
|
|
|
||
|
|
- [x] Update processor concepts doc with plugin processor info
|
||
|
|
- [x] Update plugins overview to mention processor plugins
|
||
|
|
- [x] Include example entry point configuration
|
||
|
|
|
||
|
|
## Testing Strategy
|
||
|
|
|
||
|
|
- Write tests alongside implementation, not as a separate step
|
||
|
|
- Use mocks for external dependencies (seed readers, artifact storage)
|
||
|
|
- For plugin registry tests, create actual mock classes (not Mock objects) to satisfy type validation
|
||
|
|
|
||
|
|
## Risks & Considerations
|
||
|
|
|
||
|
|
- **Memory usage**: `process_after_generation` holds full dataset in memory
|
||
|
|
- **Model download**: Embedding models download on first use; perform pre-download on uv install
|