mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

docs: graduate plugins out of experimental mode (#603 )

* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page

2026-05-06 18:12:44 -04:00

8 KiB

Raw Permalink Blame History

Deployment Options: Library vs. Microservice

Data Designer is available as both an open-source library and a NeMo Microservice. This guide helps you choose the right deployment option for your use case.

Deployment Architectures at a Glance

Data Designer supports three main deployment patterns:

Library + Your LLM Provider

Each user runs the library locally and connects to their choice of LLM provider.
Library + Enterprise Gateway

Users run the library locally but share a centralized enterprise LLM gateway with RBAC and governance.
SDG as a Service (Microservice)

A centralized SDG service that multiple users access via REST API.

Quick Comparison

Aspect	Open-Source Library	NeMo Microservice
What it is	Python package you import and run	REST API service exposing `preview` and `create` methods
Best for	Developers with LLM access who want flexibility and customization	Teams using NeMo Microservices platform
LLM Access	You provide (any OpenAI-compatible API)	Integrated with NeMo Microservices Platform
Installation	`pip install data-designer`	Deploy via NeMo Microservices platform
Scaling	You manage inference capacity	Managed alongside other NeMo services

!!! success "Same Configuration API" Both the library and microservice use the same DataDesignerConfigBuilder API. Start with the library, and your configurations migrate seamlessly if you later adopt the NeMo platform.

📦 When to Use the Open-Source Library

The library is the right choice for most users. Choose it if you:

You Have Access to LLMs

{ align=right width="350" }

You have API keys or endpoints for LLM inference:

Cloud APIs: NVIDIA API Catalog (build.nvidia.com), OpenAI, Azure OpenAI, Anthropic
Self-hosted: vLLM, TGI, TensorRT-LLM, or any OpenAI-compatible server
Enterprise gateways: Centralized LLM gateway with RBAC, rate limiting, or other enterprise features

from data_designer.interface import DataDesigner
from data_designer.config import ModelConfig

# Use any OpenAI-compatible endpoint
model = ModelConfig(
    alias="my-model",
    model="nvidia/nemotron-3-nano-30b-a3b",
    provider="nvidia",  # or "openai", or a custom ModelProvider
)

dd = DataDesigner()
# Your code controls the full workflow

You Need Maximum Flexibility

Custom plugins: Extend Data Designer with custom column generators, seed readers, or processors
Local development: Rapid iteration with immediate feedback
Integration: Embed Data Designer into existing Python pipelines or notebooks
Experimentation: Research workflows with custom models or configurations

You Already Have Enterprise LLM Infrastructure

{ align=right width="350" }

!!! tip "Library + Enterprise LLM Gateway" Many enterprises already have centralized LLM access through API gateways with:

- Role-based access control (RBAC)
- Rate limiting and quotas
- Audit logging
- Cost allocation

In this case, **use the library** and point it at your enterprise gateway. You get enterprise-grade LLM access while retaining full control over your Data Designer workflows.

from data_designer.config import ModelConfig, ModelProvider

# Define your enterprise gateway as a provider
enterprise_provider = ModelProvider(
    name="enterprise-gateway",
    endpoint="https://llm-gateway.yourcompany.com/v1",
    api_key="ENTERPRISE_LLM_KEY",  # Environment variable name (uppercase) or actual key
)

# Use the provider in your model config
model = ModelConfig(
    alias="enterprise-llm",
    model="gpt-4",
    provider="enterprise-gateway",  # References the provider above
)

☁️ When to Use the Microservice

{ align=right width="350" }

The NeMo Microservice exposes Data Designer's preview and create methods as REST API endpoints. Choose it if you:

You're Using the NeMo Microservices Platform

The primary value of the microservice is integration with other NeMo Microservices:

NeMo Inference Microservices (NIMs): Seamless integration with NVIDIA's optimized inference endpoints
NeMo Customizer: Generate synthetic data for model fine-tuning workflows
NeMo Evaluator: Create evaluation datasets alongside model assessment
Unified deployment: Single platform for your entire AI pipeline

You Want to Expose SDG as a Team Service

If you need to provide synthetic data generation as a shared service:

Multi-tenant access: Multiple teams submit generation jobs via API
Job management: Queue, monitor, and manage generation jobs centrally
Resource sharing: Shared infrastructure for SDG workloads

When users can submit configs containing Jinja templates to a shared engine, template rendering becomes a remote code execution concern and part of your security boundary. See Security for guidance on when to keep the default JinjaRenderingEngine.SECURE mode.

🧭 Decision Flowchart

                    ┌─────────────────────────┐
                    │ Are you using the NeMo  │
                    │ Microservices platform? │
                    └───────────┬─────────────┘
                                │
                    ┌───────────┴───────────┐
                    ▼                       ▼
                   YES                      NO
                    │                       │
                    ▼                       ▼
        ┌───────────────────┐   ┌───────────────────────────┐
        │ Use Microservice  │   │ Do you need to expose SDG │
        │                   │   │ as a shared REST service? │
        │ Integrates with   │   └─────────────┬─────────────┘
        │ NIMs, Customizer, │                 │
        │ Evaluator         │     ┌───────────┴───────────┐
        └───────────────────┘     ▼                       ▼
                                 YES                      NO
                                  │                       │
                                  ▼                       ▼
                      ┌─────────────────────┐   ┌─────────────────┐
                      │ Consider if the     │   │ Use the Library │
                      │ overhead is worth   │   │                 │
                      │ it vs. library +    │   │ Most flexible   │
                      │ enterprise gateway  │   │ option for      │
                      └─────────────────────┘   │ direct use      │
                                                └─────────────────┘

Learn More

Library: Continue with this documentation
Microservice: See the NeMo Data Designer Microservice documentation{target="_blank"}
Security model: See Security

8 KiB Raw Permalink Blame History

Deployment Options: Library vs. Microservice

Deployment Architectures at a Glance

Quick Comparison

📦 When to Use the Open-Source Library

You Have Access to LLMs

You Need Maximum Flexibility

You Already Have Enterprise LLM Infrastructure

☁️ When to Use the Microservice

You're Using the NeMo Microservices Platform

You Want to Expose SDG as a Team Service

🧭 Decision Flowchart

Learn More

8 KiB

Raw Permalink Blame History