mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

docs: graduate plugins out of experimental mode (#603 )

* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page

2026-05-06 18:12:44 -04:00

5.4 KiB

Raw Blame History

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

Setting up the DataDesigner interface
Configuring models and inference parameters
Using built-in samplers (Category, Person, Uniform)
Generating LLM text columns with dependencies
Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs, Jinja Expressions, and Conditional Generation

Explore more advanced data generation capabilities:

Creating structured JSON outputs with schemas
Using Jinja expressions for derived columns
Combining samplers with structured data
Building complex data dependencies
Working with nested data structures
Conditional generation with skip.when

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

Loading and using seed datasets
Sampling from real data distributions
Combining seed data with LLM generation
Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

Processing and converting images to base64 format for model consumption
Using vision-language models (VLMs) to analyze visual documents
Generating detailed summaries from document images
Inspecting and validating vision-based generation results

5. Generating Images

Generate synthetic image data with Data Designer:

Configuring image-generation models with ImageInferenceParams
Adding image columns with Jinja2 prompts and sampler-driven diversity
Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
Displaying generated images in the notebook

6. Image-to-Image Editing

Chain image generation columns to generate and then edit images:

Generating images from text and then editing them in a follow-up column
Using ImageContext with auto-detection to pass generated images to an editing model
Combining sampled accessories and settings for varied edits
Comparing generated vs edited images in preview and create modes

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

Welcome & Installation - Overview of Data Designer capabilities and installation instructions

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
Validators - Understand how to validate generated data with Python, SQL, and remote validators
Person Sampling - Learn how to sample realistic person data with demographic attributes

Code Reference

Quick reference guides for the main configuration objects:

column_configs - All column configuration types
config_builder - The DataDesignerConfigBuilder API
data_designer_config - Main configuration schema
validator_params - Validator configuration options

5.4 KiB Raw Blame History