DataDesigner/docs/concepts/seed-datasets.md
Johnny Greco 8b8d748446
docs: graduate plugins out of experimental mode (#603)
* chore: add __init__.py to engine namespace subpackages

Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.

* docs: flesh out docstrings on plugin extension-point classes

Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:

- Plugin and PluginType: class docstrings + Attributes tables for
  fields and enum members; fix typo in config_qualified_name field
  description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
  path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
  class docstrings explaining when to use each base, plus method
  docstrings on the abstract generate() implementations.

* docs: graduate plugins out of experimental mode

Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.

- Add code_reference/plugins.md: single-stop reference for the Plugin
  object and the config + implementation base classes used by all
  three plugin types.
- Add code_reference/generators.md: column generator implementation
  base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
  instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
  internal-helpers note (PluginRegistry / PluginManager), and focus
  the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
  plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
  concepts/deployment-options.md.

* docs: simplify plugin docs structure

Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.

- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
  (their content is now in build_your_own.md and the per-type code
  references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
  base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
  the relevant code reference, drop the multi-step "How do you
  create plugins" walkthrough in favor of a single Build a Plugin
  pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
  DataDesignerPlugins catalog and explain how to request a community
  listing.
- Update cross-page links in concepts/processors.md,
  concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
  code_reference/plugins.md, and code_reference/generators.md to
  match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
  add seed_readers code reference.

* docs: scroll wide tables horizontally instead of wrapping

Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.

* docs: address PR #603 review feedback

- Add an Implementation base section to code_reference/processors.md
  rendering the engine-side Processor class. This justifies the
  engine/processing/__init__.py files added earlier and gives
  processor plugin authors an auto-rendered API reference, matching
  the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
  IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
  pattern in the regex-filter processor in favor of the idiomatic
  `Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
  install instructions — PluginRegistry caches discovery on first
  import, so notebooks need a fresh kernel to pick up freshly
  installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
  checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
  reference alongside generators and seed_readers.

* docs: split code reference by package

* docs: add interface code reference

* docs: add code reference overviews

* docs: refine code reference pages

* docs: improve code reference tables

* docs: correct reference docstrings

* docs: embed plugin catalog table

* docs: note plugin discovery restart caveat

* docs: explain generator base class choice

* docs: mention async cell generator examples

* docs: clarify plugin model usage

* docs: clarify plugin model aliases

* docs: address plugin review feedback

* docs: update available plugins page
2026-05-06 18:12:44 -04:00

10 KiB

Seed Datasets

Seed datasets let you bootstrap synthetic data generation from existing data. Instead of generating everything from scratch, you provide a dataset whose columns become available as context in your prompts and expressions—grounding your synthetic data in real-world examples.

!!! tip "When to Use Seed Datasets" Seed datasets shine when you have real data you want to build on:

- Product catalogs → generate customer reviews
- Medical diagnoses → generate physician notes
- Code snippets → generate documentation
- Company profiles → generate financial reports

The seed data provides realism and domain specificity; Data Designer adds volume and variation.

The Basic Pattern

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Define your model configuration
model_configs = [
    dd.ModelConfig(
        alias="my-model",
        model="nvidia/nemotron-3-nano-30b-a3b",
        provider="nvidia",
    )
]

config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)

# 1. Attach a seed dataset
seed_source = dd.LocalFileSeedSource(path="products.csv")
config_builder.with_seed_dataset(seed_source)

# 2. Reference seed columns in your prompts
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="my-model",
        prompt="""\
Write a customer review for {{ product_name }}.
Category: {{ category }}
Price: ${{ price }}
""",
    )
)

Every column in your seed dataset becomes available as a Jinja2 variable in prompts and expressions. Data Designer automatically:

  • Reads rows from the seed dataset
  • Injects seed column values into templates

Seed Sources

Data Designer supports multiple ways to provide seed data, including:

📁 LocalFileSeedSource

Load from a local file—CSV, Parquet, or JSON.

# Single file
seed_source = dd.LocalFileSeedSource(path="data/products.csv")

# Parquet files with wildcard
seed_source = dd.LocalFileSeedSource(path="data/products/*.parquet")

!!! note "Supported Formats" - CSV (.csv) - Parquet (.parquet) - JSON (.json, .jsonl)

🤗 HuggingFaceSeedSource

Load directly from HuggingFace datasets without downloading manually.

seed_source = dd.HuggingFaceSeedSource(
    path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet",
    token="hf_...",  # Optional, for private datasets
)

🐼 DataFrameSeedSource

Use an in-memory pandas DataFrame—great for preprocessing or combining multiple sources.

import pandas as pd

df = pd.read_csv("raw_data.csv")
df = df[df["quality_score"] > 0.8]  # Filter to high-quality rows

seed_source = dd.DataFrameSeedSource(df=df)

!!! warning "Serialization" DataFrameSeedSource can't be serialized to YAML/JSON configs. Use LocalFileSeedSource if you need to save and share configurations.

🗂️ DirectorySeedSource

Treat a directory tree as the seed dataset. Each matching file becomes one seed row, exposing file metadata you can reference in prompts and expressions.

seed_source = dd.DirectorySeedSource(
    path="docs/",
    file_pattern="*.md",
    recursive=True,
)

config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="doc_label",
        expr="{{ source_kind }}::{{ relative_path }}",
    )
)

Directory-backed seed datasets expose these columns:

  • source_kind — always "directory_file"
  • source_path — full path to the matched file
  • relative_path — path relative to the configured directory
  • file_name — basename of the matched file

!!! note "Filesystem matching" file_pattern matches file names only, not relative paths. recursive=True is the default, so nested subdirectories are searched unless you turn it off.

📄 FileContentsSeedSource

Read matching text files into the seed dataset. Each file becomes one seed row with the same metadata as DirectorySeedSource, plus the decoded file contents in a content column.

seed_source = dd.FileContentsSeedSource(
    path="docs/",
    file_pattern="*.md",
    encoding="utf-8",
)

config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="summary",
        model_alias="my-model",
        prompt="""\
Summarize the following document.

File: {{ file_name }}
Path: {{ relative_path }}

{{ content }}
""",
    )
)

FileContentsSeedSource exposes these seeded columns:

  • source_kind — always "file_contents"
  • source_path — full path to the matched file
  • relative_path — path relative to the configured directory
  • file_name — basename of the matched file
  • content — decoded text contents of the matched file

!!! tip "Custom Filesystem Readers" If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom FileSystemSeedReader and pass it via DataDesigner(seed_readers=[...]). For packaging and registration, see Build Your Own.

!!! note "Encoding" encoding="utf-8" is the default. Set a different Python codec name if your files use another text encoding.

🤖 AgentRolloutSeedSource

Parse agent rollout trace files (e.g. from ATIF, Claude Code, Codex, or Hermes Agent) into a structured seed dataset. Each trace becomes one seed row with normalized metadata and the full message history, ready for distillation or analysis pipelines.

seed_source = dd.AgentRolloutSeedSource(
    format=dd.AgentRolloutFormat.CLAUDE_CODE,
)

config_builder.with_seed_dataset(seed_source)

!!! info "Dedicated guide" See Agent Rollout Ingestion for the rollout-specific guide, including:

- supported rollout formats and default locations
- format-specific configuration details like `path` and `file_pattern`
- the full normalized seeded-column schema exposed by `AgentRolloutSeedSource`

!!! tip "Trace Distillation" See the Agent Rollout Trace Distillation recipe for a complete example that turns agent traces into supervised fine-tuning data.

Sampling Strategies

Control how rows are read from the seed dataset.

Ordered (Default)

Rows are read sequentially in their original order. Each generated record corresponds to the next row in the seed dataset. If you generate more records than exist in the seed dataset, it will cycle in order until completion.

config_builder.with_seed_dataset(
    seed_source,
    sampling_strategy=dd.SamplingStrategy.ORDERED,
)

Shuffle

Rows are randomly shuffled before sampling. Useful when your seed data has some ordering you want to break.

config_builder.with_seed_dataset(
    seed_source,
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)

Selection Strategies

Select a subset of your seed dataset—useful for large datasets or parallel processing.

IndexRange

Select a specific range of row indices.

# Use only rows 100-199 (100 rows total)
config_builder.with_seed_dataset(
    seed_source,
    selection_strategy=dd.IndexRange(start=100, end=199),
)

PartitionBlock

Split the dataset into N equal partitions and select one. Perfect for distributing work across multiple jobs.

# Split into 5 partitions, use the 3rd one (index=2, zero-based)
config_builder.with_seed_dataset(
    seed_source,
    selection_strategy=dd.PartitionBlock(index=2, num_partitions=5),
)

!!! tip "Parallel Processing" Run 5 parallel jobs, each with a different partition index, to process a large seed dataset in parallel:

```python
# Job 0: PartitionBlock(index=0, num_partitions=5)
# Job 1: PartitionBlock(index=1, num_partitions=5)
# Job 2: PartitionBlock(index=2, num_partitions=5)
# ...
```

Combining Strategies

Sampling and selection strategies work together. For example, shuffle rows within a specific partition:

config_builder.with_seed_dataset(
    seed_source,
    sampling_strategy=dd.SamplingStrategy.SHUFFLE,
    selection_strategy=dd.PartitionBlock(index=0, num_partitions=10),
)

Complete Example

Here's a complete example generating physician notes from a symptom-to-diagnosis seed dataset:

import data_designer.config as dd
from data_designer.interface import DataDesigner

data_designer = DataDesigner()

model_configs = [
    dd.ModelConfig(
        alias="medical-notes",
        model="nvidia/nemotron-3-nano-30b-a3b",
        provider="nvidia",
    )
]

config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)

# Attach seed dataset (has 'diagnosis' and 'symptoms' columns)
seed_source = dd.LocalFileSeedSource(path="symptom_to_diagnosis.csv")
config_builder.with_seed_dataset(seed_source)

# Generate patient info
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="patient",
        sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
        params=dd.PersonFromFakerSamplerParams(),
    )
)

config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="patient_name",
        expr="{{ patient.first_name }} {{ patient.last_name }}",
    )
)

# Generate notes grounded in seed data
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="physician_notes",
        model_alias="medical-notes",
        prompt="""\
You are a physician writing notes after a patient visit.

Patient: {{ patient_name }}
Diagnosis: {{ diagnosis }}
Reported Symptoms: {{ symptoms }}

Write detailed clinical notes for this visit.
""",
    )
)

# Preview
preview = data_designer.preview(config_builder, num_records=5)
preview.display_sample_record()

Best Practices

Keep Seed Data Clean

Garbage in, garbage out. Clean your seed data before using it:

  • Remove duplicates
  • Fix encoding issues
  • Filter out low-quality rows
  • Standardize column names

Match Generation Volume to Seed Size

If your seed dataset has 1,000 rows and you generate 10,000 records, each seed row will be used ~10 times. Consider whether that's appropriate for your use case.

Use Seed Data for Diversity Control

Seed datasets are excellent for controlling the distribution of your synthetic data. Want 30% electronics, 50% clothing, 20% home goods? Curate your seed dataset to match.