* chore: add __init__.py to engine namespace subpackages
Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.
* docs: flesh out docstrings on plugin extension-point classes
Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:
- Plugin and PluginType: class docstrings + Attributes tables for
fields and enum members; fix typo in config_qualified_name field
description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
class docstrings explaining when to use each base, plus method
docstrings on the abstract generate() implementations.
* docs: graduate plugins out of experimental mode
Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.
- Add code_reference/plugins.md: single-stop reference for the Plugin
object and the config + implementation base classes used by all
three plugin types.
- Add code_reference/generators.md: column generator implementation
base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
internal-helpers note (PluginRegistry / PluginManager), and focus
the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
concepts/deployment-options.md.
* docs: simplify plugin docs structure
Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.
- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
(their content is now in build_your_own.md and the per-type code
references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
the relevant code reference, drop the multi-step "How do you
create plugins" walkthrough in favor of a single Build a Plugin
pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
DataDesignerPlugins catalog and explain how to request a community
listing.
- Update cross-page links in concepts/processors.md,
concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
code_reference/plugins.md, and code_reference/generators.md to
match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
add seed_readers code reference.
* docs: scroll wide tables horizontally instead of wrapping
Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.
* docs: address PR #603 review feedback
- Add an Implementation base section to code_reference/processors.md
rendering the engine-side Processor class. This justifies the
engine/processing/__init__.py files added earlier and gives
processor plugin authors an auto-rendered API reference, matching
the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
pattern in the regex-filter processor in favor of the idiomatic
`Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
install instructions — PluginRegistry caches discovery on first
import, so notebooks need a fresh kernel to pick up freshly
installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
reference alongside generators and seed_readers.
* docs: split code reference by package
* docs: add interface code reference
* docs: add code reference overviews
* docs: refine code reference pages
* docs: improve code reference tables
* docs: correct reference docstrings
* docs: embed plugin catalog table
* docs: note plugin discovery restart caveat
* docs: explain generator base class choice
* docs: mention async cell generator examples
* docs: clarify plugin model usage
* docs: clarify plugin model aliases
* docs: address plugin review feedback
* docs: update available plugins page
10 KiB
Seed Datasets
Seed datasets let you bootstrap synthetic data generation from existing data. Instead of generating everything from scratch, you provide a dataset whose columns become available as context in your prompts and expressions—grounding your synthetic data in real-world examples.
!!! tip "When to Use Seed Datasets" Seed datasets shine when you have real data you want to build on:
- Product catalogs → generate customer reviews
- Medical diagnoses → generate physician notes
- Code snippets → generate documentation
- Company profiles → generate financial reports
The seed data provides realism and domain specificity; Data Designer adds volume and variation.
The Basic Pattern
import data_designer.config as dd
from data_designer.interface import DataDesigner
# Define your model configuration
model_configs = [
dd.ModelConfig(
alias="my-model",
model="nvidia/nemotron-3-nano-30b-a3b",
provider="nvidia",
)
]
config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
# 1. Attach a seed dataset
seed_source = dd.LocalFileSeedSource(path="products.csv")
config_builder.with_seed_dataset(seed_source)
# 2. Reference seed columns in your prompts
config_builder.add_column(
dd.LLMTextColumnConfig(
name="review",
model_alias="my-model",
prompt="""\
Write a customer review for {{ product_name }}.
Category: {{ category }}
Price: ${{ price }}
""",
)
)
Every column in your seed dataset becomes available as a Jinja2 variable in prompts and expressions. Data Designer automatically:
- Reads rows from the seed dataset
- Injects seed column values into templates
Seed Sources
Data Designer supports multiple ways to provide seed data, including:
📁 LocalFileSeedSource
Load from a local file—CSV, Parquet, or JSON.
# Single file
seed_source = dd.LocalFileSeedSource(path="data/products.csv")
# Parquet files with wildcard
seed_source = dd.LocalFileSeedSource(path="data/products/*.parquet")
!!! note "Supported Formats"
- CSV (.csv)
- Parquet (.parquet)
- JSON (.json, .jsonl)
🤗 HuggingFaceSeedSource
Load directly from HuggingFace datasets without downloading manually.
seed_source = dd.HuggingFaceSeedSource(
path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet",
token="hf_...", # Optional, for private datasets
)
🐼 DataFrameSeedSource
Use an in-memory pandas DataFrame—great for preprocessing or combining multiple sources.
import pandas as pd
df = pd.read_csv("raw_data.csv")
df = df[df["quality_score"] > 0.8] # Filter to high-quality rows
seed_source = dd.DataFrameSeedSource(df=df)
!!! warning "Serialization"
DataFrameSeedSource can't be serialized to YAML/JSON configs. Use LocalFileSeedSource if you need to save and share configurations.
🗂️ DirectorySeedSource
Treat a directory tree as the seed dataset. Each matching file becomes one seed row, exposing file metadata you can reference in prompts and expressions.
seed_source = dd.DirectorySeedSource(
path="docs/",
file_pattern="*.md",
recursive=True,
)
config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
dd.ExpressionColumnConfig(
name="doc_label",
expr="{{ source_kind }}::{{ relative_path }}",
)
)
Directory-backed seed datasets expose these columns:
source_kind— always"directory_file"source_path— full path to the matched filerelative_path— path relative to the configured directoryfile_name— basename of the matched file
!!! note "Filesystem matching"
file_pattern matches file names only, not relative paths. recursive=True is the default, so nested subdirectories are searched unless you turn it off.
📄 FileContentsSeedSource
Read matching text files into the seed dataset. Each file becomes one seed row with the same metadata as DirectorySeedSource, plus the decoded file contents in a content column.
seed_source = dd.FileContentsSeedSource(
path="docs/",
file_pattern="*.md",
encoding="utf-8",
)
config_builder.with_seed_dataset(seed_source)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="summary",
model_alias="my-model",
prompt="""\
Summarize the following document.
File: {{ file_name }}
Path: {{ relative_path }}
{{ content }}
""",
)
)
FileContentsSeedSource exposes these seeded columns:
source_kind— always"file_contents"source_path— full path to the matched filerelative_path— path relative to the configured directoryfile_name— basename of the matched filecontent— decoded text contents of the matched file
!!! tip "Custom Filesystem Readers"
If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom FileSystemSeedReader and pass it via DataDesigner(seed_readers=[...]). For packaging and registration, see Build Your Own.
!!! note "Encoding"
encoding="utf-8" is the default. Set a different Python codec name if your files use another text encoding.
🤖 AgentRolloutSeedSource
Parse agent rollout trace files (e.g. from ATIF, Claude Code, Codex, or Hermes Agent) into a structured seed dataset. Each trace becomes one seed row with normalized metadata and the full message history, ready for distillation or analysis pipelines.
seed_source = dd.AgentRolloutSeedSource(
format=dd.AgentRolloutFormat.CLAUDE_CODE,
)
config_builder.with_seed_dataset(seed_source)
!!! info "Dedicated guide" See Agent Rollout Ingestion for the rollout-specific guide, including:
- supported rollout formats and default locations
- format-specific configuration details like `path` and `file_pattern`
- the full normalized seeded-column schema exposed by `AgentRolloutSeedSource`
!!! tip "Trace Distillation" See the Agent Rollout Trace Distillation recipe for a complete example that turns agent traces into supervised fine-tuning data.
Sampling Strategies
Control how rows are read from the seed dataset.
Ordered (Default)
Rows are read sequentially in their original order. Each generated record corresponds to the next row in the seed dataset. If you generate more records than exist in the seed dataset, it will cycle in order until completion.
config_builder.with_seed_dataset(
seed_source,
sampling_strategy=dd.SamplingStrategy.ORDERED,
)
Shuffle
Rows are randomly shuffled before sampling. Useful when your seed data has some ordering you want to break.
config_builder.with_seed_dataset(
seed_source,
sampling_strategy=dd.SamplingStrategy.SHUFFLE,
)
Selection Strategies
Select a subset of your seed dataset—useful for large datasets or parallel processing.
IndexRange
Select a specific range of row indices.
# Use only rows 100-199 (100 rows total)
config_builder.with_seed_dataset(
seed_source,
selection_strategy=dd.IndexRange(start=100, end=199),
)
PartitionBlock
Split the dataset into N equal partitions and select one. Perfect for distributing work across multiple jobs.
# Split into 5 partitions, use the 3rd one (index=2, zero-based)
config_builder.with_seed_dataset(
seed_source,
selection_strategy=dd.PartitionBlock(index=2, num_partitions=5),
)
!!! tip "Parallel Processing" Run 5 parallel jobs, each with a different partition index, to process a large seed dataset in parallel:
```python
# Job 0: PartitionBlock(index=0, num_partitions=5)
# Job 1: PartitionBlock(index=1, num_partitions=5)
# Job 2: PartitionBlock(index=2, num_partitions=5)
# ...
```
Combining Strategies
Sampling and selection strategies work together. For example, shuffle rows within a specific partition:
config_builder.with_seed_dataset(
seed_source,
sampling_strategy=dd.SamplingStrategy.SHUFFLE,
selection_strategy=dd.PartitionBlock(index=0, num_partitions=10),
)
Complete Example
Here's a complete example generating physician notes from a symptom-to-diagnosis seed dataset:
import data_designer.config as dd
from data_designer.interface import DataDesigner
data_designer = DataDesigner()
model_configs = [
dd.ModelConfig(
alias="medical-notes",
model="nvidia/nemotron-3-nano-30b-a3b",
provider="nvidia",
)
]
config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
# Attach seed dataset (has 'diagnosis' and 'symptoms' columns)
seed_source = dd.LocalFileSeedSource(path="symptom_to_diagnosis.csv")
config_builder.with_seed_dataset(seed_source)
# Generate patient info
config_builder.add_column(
dd.SamplerColumnConfig(
name="patient",
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
params=dd.PersonFromFakerSamplerParams(),
)
)
config_builder.add_column(
dd.ExpressionColumnConfig(
name="patient_name",
expr="{{ patient.first_name }} {{ patient.last_name }}",
)
)
# Generate notes grounded in seed data
config_builder.add_column(
dd.LLMTextColumnConfig(
name="physician_notes",
model_alias="medical-notes",
prompt="""\
You are a physician writing notes after a patient visit.
Patient: {{ patient_name }}
Diagnosis: {{ diagnosis }}
Reported Symptoms: {{ symptoms }}
Write detailed clinical notes for this visit.
""",
)
)
# Preview
preview = data_designer.preview(config_builder, num_records=5)
preview.display_sample_record()
Best Practices
Keep Seed Data Clean
Garbage in, garbage out. Clean your seed data before using it:
- Remove duplicates
- Fix encoding issues
- Filter out low-quality rows
- Standardize column names
Match Generation Volume to Seed Size
If your seed dataset has 1,000 rows and you generate 10,000 records, each seed row will be used ~10 times. Consider whether that's appropriate for your use case.
Use Seed Data for Diversity Control
Seed datasets are excellent for controlling the distribution of your synthetic data. Want 30% electronics, 50% clothing, 20% home goods? Curate your seed dataset to match.