mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

Find a file

Nabin Mulepati 52d42fe1b7 Some checks failed Check Colab notebooks / Check Colab Notebooks (push) Has been cancelled Details CI / Validate dispatched SHA (push) Has been cancelled Details CI / Test Engine (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Lint and Format Check (push) Has been cancelled Details CI / Check License Headers (push) Has been cancelled Details CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Config (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Config (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Config (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled Details CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details CI / Coverage Check (Python 3.11) (push) Has been cancelled Details CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled Details CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled Details CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled Details feat: add audio and video context (#701 ) * feat: add audio and video context Add audio/video context config models and canonical media helpers. Translate canonical media blocks for OpenAI-compatible clients while preserving URL media as URLs. Reject unsupported audio/video blocks in the Anthropic adapter. Refs #671 * fix: harden media context review gaps Preserve extensionless HTTP(S) audio and video URLs as URL media, reject local path-looking audio/video context values, and reject provider-specific audio/video blocks in the Anthropic adapter. Refs #671 * test: add audio video context smoke notebook Add a Jupytext source notebook and generated Colab artifact that exercise audio/video context URL, base64, local path rejection, OpenAI-compatible payload translation, and Anthropic unsupported-media handling. Refs #671 * test: make media context notebook end to end Rewrite the audio/video smoke notebook to run a full Data Designer preview against a local OpenAI-compatible HTTP server. Assert the generated dataset, captured endpoint payload, URL/base64 translation, and local path rejection through the interface pipeline. Refs #671 * test: remove media context notebook from docs Move the generated audio/video context E2E notebook out of the PR docs surface and keep it locally under the main checkout's .scratch directory. Refs #671 * harden multimodal media context handling * address media context review notes Remove unused URL-specific media helpers, share the base64 data URI parser in Anthropic translation, align AudioContext validation messaging, and update config docs for audio/video contexts. Refs #671 * docs: update media context guidance * refactor: consolidate media helpers * support local audio and video paths * refactor: combine media path checks * address media context review feedback * remove openai media preflight * sync generated colab notebooks * align media local path autodetection		2026-05-22 11:54:40 -06:00
.agents	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
.claude	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
.github	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
architecture	feat: enable bounded-borrow task admission (#693 )	2026-05-22 10:28:12 -04:00
docs	feat: add audio and video context (#701 )	2026-05-22 11:54:40 -06:00
fern	feat: add audio and video context (#701 )	2026-05-22 11:54:40 -06:00
packages	feat: add audio and video context (#701 )	2026-05-22 11:54:40 -06:00
plans	feat: implement async scheduling admission control (#661 )	2026-05-20 20:58:05 -04:00
scripts	feat: enable bounded-borrow task admission (#693 )	2026-05-22 10:28:12 -04:00
skills/data-designer	fix: prevent skill load failure when data-designer CLI is not installed (#501 )	2026-04-07 17:36:18 -04:00
tests_e2e	feat: implement async scheduling admission control (#661 )	2026-05-20 20:58:05 -04:00
.gitignore	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
.pre-commit-config.yaml	chore: use uv run ruff in pre-commit hooks (#436 )	2026-03-19 10:14:28 -03:00
AGENTS.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
CLAUDE.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md	add email to code of conduct	2025-10-30 14:27:46 -04:00
CONTRIBUTING.md	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
DCO	add code of conduct	2025-10-29 15:51:17 -04:00
DEVELOPMENT.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
greptile.json	chore: reduce Greptile review noise from defensive coding suggestions (#423 )	2026-03-30 17:42:52 -03:00
LICENSE	initial port	2025-10-27 14:29:12 -04:00
Makefile	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
mkdocs.yml	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
pyproject.toml	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
README.md	docs: update generated token badge (#678 )	2026-05-19 15:11:13 -04:00
SECURITY.md	chore: bump pillow and python-multipart for CVEs, add SECURITY.md (#564 )	2026-04-20 18:36:22 -04:00
STYLEGUIDE.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
uv.lock	docs: remove docs code reference (#674 )	2026-05-21 18:29:18 -04:00
VERSIONING.md	docs: fix Fern versioned publishing (#656 )	2026-05-15 17:09:59 -03:00

README.md

🎨 NeMo Data Designer

Generate high-quality synthetic datasets from scratch or using your own seed data.

Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

Generate diverse data using statistical samplers, LLMs, or existing seed datasets
Control relationships between fields with dependency-aware generation
Validate quality with built-in Python, SQL, and custom local and remote validators
Score outputs using LLM-as-a-judge for quality assessment
Iterate quickly with preview mode before full-scale generation

📣 Heads-up: async engine is now the default

Data Designer now runs pipelines on a cell-level async engine that overlaps independent columns and adapts concurrency per (provider, model). On most pipelines this is faster with no config changes; on slow self-hosted endpoints, set inference_parameters.timeout to your real per-request latency. See Architecture & Performance → Async Engine for the behaviors worth knowing about.

If you hit anything unexpected, fall back to the legacy sync engine for one transitional release with DATA_DESIGNER_ASYNC_ENGINE=0, and please open an issue so we can fix the async path.

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

Getting Started – Install, configure, and generate your first dataset
Tutorial Notebooks – Step-by-step interactive tutorials
Column Types – Explore samplers, LLM columns, validators, and more
Validators – Learn how to validate generated data with Python, SQL, and remote validators
Model Configuration – Configure custom models and providers
Person Sampling – Learn how to sample realistic person data with demographic attributes

📝 Documentation transition

Data Designer is gradually moving documentation from MkDocs to Fern. During the transition, maintainers publish both docs builds for a few releases so the Fern site can mature without losing the existing MkDocs release archive.

Contributors should keep editing the existing docs sources under docs/. Tutorial notebook source lives in docs/notebook_source/*.py; generated notebooks and Fern artifacts are not the source of truth.

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤖 Agent Skill

Data Designer has a skill for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on Claude Code at this stage.

Install via skills.sh (be sure to select Claude Code as an additional agent):

npx skills add NVIDIA-NeMo/DataDesigner

After installation, type /data-designer or describe the dataset you want and the skill will kick in.

🤝 Get involved

This repository supports agent-assisted development — see CONTRIBUTING.md for the recommended workflow.

Contributing Guide – How to contribute, including agent-assisted workflows
GitHub Issues – Report bugs or make a feature request

Telemetry

Data Designer collects telemetry to help us improve the library for developers. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Disable with NEMO_TELEMETRY_ENABLED=false. More details →

Top models (YTD)

Aggregate model usage across synthetic data generation jobs, year-to-date 1/1/2026–5/1/2026:

Last updated on May 1, 2026

License

Apache License 2.0 – see LICENSE for details.

Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

Telemetry & privacy

NeMo Data Designer includes an optional function to share anonymous telemetry data with NVIDIA for product improvement. Data collected is limited to names of models used and token counts (input and output). No user or device information is collected. This data is used to prioritize product improvements and will be shared in aggregate with the community. It is not used to track any individual user behavior.

You may opt out of telemetry collection at any time. Opting out applies only to data collection by the NeMo Data Designer library itself.

Use of third-party endpoints, including NVIDIA Build: NeMo Data Designer can be configured to use various inference endpoints, including build.nvidia.com (NVIDIA Build). If you choose to use NVIDIA Build or any other third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within NeMo Data Designer does not extend to data collection by your chosen endpoint. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not submit any confidential information or personal data when using NVIDIA Build.

README.md Unescape Escape