mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

docs: graduate plugins out of experimental mode (#603 )

* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page

2026-05-06 18:12:44 -04:00

10 KiB

Raw Blame History

Using Models in Plugins

Model access belongs in column generator implementations, not config objects. Keep the config declarative by asking users for model aliases, then resolve those aliases at runtime through the model registry.

Do not construct model clients in plugin configs, read API keys in configs, or bypass Data Designer's model providers. The engine builds a ResourceProvider and exposes its model registry to every generator at:

self.resource_provider.model_registry

Access the registry

Use a model-aware column generator base whenever your plugin needs the registry:

Need	Base class	Registry access
Primary model alias	`ColumnGeneratorWithModel`	Use `self.model`, `self.model_config`, and `self.inference_parameters`.
Multiple aliases or provider inspection	`ColumnGeneratorWithModelRegistry`	Use `self.get_model(alias)`, `self.get_model_config(alias)`, and `self.get_model_provider_name(alias)`.

ColumnGeneratorWithModel is a convenience subclass of ColumnGeneratorWithModelRegistry. It expects the config to have a model_alias field and resolves that one alias for you. For independent model calls, return GenerationStrategy.CELL_BY_CELL so the runtime can fan out rows like the built-in LLM, embedding, and image generators. Use full-column generation only when your plugin intentionally calls a batched API for the whole DataFrame.

from __future__ import annotations

from data_designer.config.column_configs import GenerationStrategy
from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModel
from data_designer.engine.models.parsers.errors import ParserException

from data_designer_sentiment_label.config import SentimentLabelColumnConfig


def parse_sentiment_label(response: str) -> str:
    label = response.strip().lower()
    if label not in {"positive", "neutral", "negative"}:
        raise ParserException("Expected exactly one of: positive, neutral, negative.", source=response)
    return label


class SentimentLabelColumnGenerator(ColumnGeneratorWithModel[SentimentLabelColumnConfig]):
    @staticmethod
    def get_generation_strategy() -> GenerationStrategy:
        return GenerationStrategy.CELL_BY_CELL

    async def agenerate(self, data: dict) -> dict:
        label, _ = await self.model.agenerate(
            prompt=f"Classify the sentiment of this text: {data[self.config.source_column]}",
            system_prompt="Return exactly one label: positive, neutral, or negative.",
            parser=parse_sentiment_label,
            max_correction_steps=self.resource_provider.run_config.max_conversation_correction_steps,
            max_conversation_restarts=self.resource_provider.run_config.max_conversation_restarts,
            purpose=f"running generation for column '{self.config.name}'",
        )
        data[self.config.name] = label
        return data

The matching config must include model_alias: str as a normal user-facing field:

from __future__ import annotations

from typing import Literal

from data_designer.config.base import SingleColumnConfig


class SentimentLabelColumnConfig(SingleColumnConfig):
    column_type: Literal["sentiment-label"] = "sentiment-label"
    source_column: str
    model_alias: str

    @property
    def required_columns(self) -> list[str]:
        return [self.source_column]

    @property
    def side_effect_columns(self) -> list[str]:
        return []

Users set that alias from default model settings or from DataDesignerConfigBuilder(model_configs=...).

Use multiple models

If your plugin uses multiple model aliases, inherit from ColumnGeneratorWithModelRegistry and resolve each alias explicitly with self.get_model(...).

The config must include a primary model_alias: str field. Startup health checks read it directly from any column config whose generator inherits from ColumnGeneratorWithModelRegistry, including generators that inherit through ColumnGeneratorWithModel. A config for this pattern might also define judge_model_alias, critic_model_alias, or another task-specific alias.

Validate additional alias fields in _validate() or _initialize() with get_model_config(...) so missing aliases fail before generation starts. get_model_config(alias) only verifies that the alias is registered; it does not call the endpoint. Endpoint reachability is only exercised for the primary model_alias collected by the standard startup health check.

The matching config shows which alias gets the standard startup health check and which alias the plugin validates itself:

from __future__ import annotations

from typing import Literal

from data_designer.config.base import SingleColumnConfig


class PairwiseJudgeColumnConfig(SingleColumnConfig):
    column_type: Literal["pairwise-judge"] = "pairwise-judge"
    question_column: str
    model_alias: str
    judge_model_alias: str

    @property
    def required_columns(self) -> list[str]:
        return [self.question_column]

    @property
    def side_effect_columns(self) -> list[str]:
        return []

from __future__ import annotations

from data_designer.config.column_configs import GenerationStrategy
from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModelRegistry
from data_designer.engine.models.parsers.errors import ParserException

from data_designer_pairwise_judge.config import PairwiseJudgeColumnConfig


def parse_score(response: str) -> int:
    text = response.strip()
    if text not in {"1", "2", "3", "4", "5"}:
        raise ParserException("Expected an integer score from 1 to 5.", source=response)
    return int(text)


class PairwiseJudgeColumnGenerator(ColumnGeneratorWithModelRegistry[PairwiseJudgeColumnConfig]):
    @staticmethod
    def get_generation_strategy() -> GenerationStrategy:
        return GenerationStrategy.CELL_BY_CELL

    def _validate(self) -> None:
        self.get_model_config(self.config.model_alias)
        self.get_model_config(self.config.judge_model_alias)

    async def agenerate(self, data: dict) -> dict:
        generator_model = self.get_model(self.config.model_alias)
        judge_model = self.get_model(self.config.judge_model_alias)
        retry_kwargs = {
            "max_correction_steps": self.resource_provider.run_config.max_conversation_correction_steps,
            "max_conversation_restarts": self.resource_provider.run_config.max_conversation_restarts,
        }

        draft, _ = await generator_model.agenerate(
            prompt=f"Draft an answer for: {data[self.config.question_column]}",
            purpose=f"drafting an answer for column '{self.config.name}'",
            **retry_kwargs,
        )
        score, _ = await judge_model.agenerate(
            prompt=f"Score this answer from 1 to 5: {draft}",
            system_prompt="Return exactly one integer from 1 to 5.",
            parser=parse_score,
            purpose=f"judging an answer for column '{self.config.name}'",
            **retry_kwargs,
        )
        data[self.config.name] = {"draft": draft, "score": score}
        return data

What the registry returns

get_model(...) returns a ModelFacade. Call the facade based on the modality your plugin needs:

Chat completion aliases use model.generate(...) or await model.agenerate(...) and return (parsed_output, trace).
Embedding aliases use model.generate_text_embeddings(...) or await model.agenerate_text_embeddings(...) and return list[list[float]].
Image aliases use model.generate_image(...) or await model.agenerate_image(...) and return list[str] of base64-encoded image data.

Choose a model alias whose ModelConfig.inference_parameters.generation_type matches the facade method you call. The facade merges the alias's configured inference parameters into each request.

Pass runtime context such as prompt, system_prompt, parser, tool_alias, multi_modal_context, max_correction_steps, max_conversation_restarts, and purpose at the call site. Parser functions should raise ParserException for invalid model responses; that is what allows ModelFacade.generate(...) and ModelFacade.agenerate(...) to run correction turns and conversation restarts.

Prefer implementing agenerate(...) for model-backed plugins. The base generate(...) method can bridge to agenerate(...) for sync runs when the subclass only implements async generation. If your plugin has a sync-specific path, implement both generate(...) and agenerate(...), as the built-in generators do.

Health checks and scheduling

The model-aware bases mark the generator as LLM-bound, so the async scheduler treats the work like other model calls.

Plugin discovery treats column generator implementations that inherit from ColumnGeneratorWithModelRegistry as model-generated column types for startup model health checks. The standard health-check collection reads a primary model_alias field directly from the config. Additional alias fields should be registration-validated by the plugin implementation; their endpoints are not pinged by the standard startup health check.

Built-in patterns

The built-in model-backed generators use these same hooks:

LLMTextCellGenerator, LLMCodeCellGenerator, LLMStructuredCellGenerator, and LLMJudgeCellGenerator inherit through a chat-completion base that uses ColumnGeneratorWithModel. They render prompts from row data, call self.model.generate(...) or self.model.agenerate(...), pass parsers into the ModelFacade, and store optional trace side-effect columns.
EmbeddingCellGenerator uses ColumnGeneratorWithModel but calls the facade's embedding methods instead of chat completion.
ImageCellGenerator uses ColumnGeneratorWithModel, renders a prompt, calls the facade's image methods, and writes generated media through the artifact storage supplied by the same ResourceProvider.
CustomColumnGenerator is the inline-function counterpart: when users declare model_aliases, it builds a models dict from resource_provider.model_registry. Packaged plugins usually use ColumnGeneratorWithModel or ColumnGeneratorWithModelRegistry directly instead of recreating that dict.

See Column Generators for the full base-class API and Custom Model Settings for configuring model aliases.

10 KiB Raw Blame History