mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

Find a file

dhruvnathawani 1448f9cbda docs: add text-to-sql dev note (#349 ) * docs: add text-to-sql devnote * add diagram, update content * correct inconsistencies * docs: address PR #349 feedback and add BIRD benchmark results PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway #1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post * docs: address second round of PR #349 feedback - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha) * docs: address round 2 PR #349 feedback, replace production block with recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha) * docs: polish Try It Yourself and Summary sections - Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line * docs: add missing sql_dialect sampler to Step 1 code snippet The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. * fix ascii diagram * docs: fix BIRD score framing and MySQL dialect wording - Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them * docs: move text-to-sql images to assets/ convention and update refs * docs: address text-to-sql devnote review comments - Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe) --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Yev Meyer <ymeyer@nvidia.com>		2026-04-14 11:10:14 -07:00
.agents	ci: add PR hygiene automation (linked issue check + stale PR cleanup) (#521 )	2026-04-13 20:26:02 -03:00
.claude	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
.github	ci: bump the all-actions group with 5 updates (#539 )	2026-04-13 20:28:38 -03:00
architecture	chore: plan 427, PR 2 of agent-first development plan (#478 )	2026-04-06 15:26:33 -06:00
docs	docs: add text-to-sql dev note (#349 )	2026-04-14 11:10:14 -07:00
packages	feat: add generic and OpenRouter attribution headers (#542 )	2026-04-14 11:59:49 -04:00
plans	ci: add PR hygiene automation (linked issue check + stale PR cleanup) (#521 )	2026-04-13 20:26:02 -03:00
scripts	fix: update health checks to use new ModelFacade client API (#470 )	2026-03-30 17:27:41 -03:00
skills/data-designer	fix: prevent skill load failure when data-designer CLI is not installed (#501 )	2026-04-07 17:36:18 -04:00
tests_e2e	fix: bump pytest, aiohttp, and cryptography for security CVEs (#535 )	2026-04-13 10:23:13 -04:00
.gitignore	refactor: Decouple ModelFacade from LiteLLM via ModelClient adapter (#373 )	2026-03-11 14:30:40 -06:00
.pre-commit-config.yaml	chore: use uv run ruff in pre-commit hooks (#436 )	2026-03-19 10:14:28 -03:00
AGENTS.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
CLAUDE.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md	add email to code of conduct	2025-10-30 14:27:46 -04:00
CONTRIBUTING.md	ci: add PR hygiene automation (linked issue check + stale PR cleanup) (#521 )	2026-04-13 20:26:02 -03:00
DCO	add code of conduct	2025-10-29 15:51:17 -04:00
DEVELOPMENT.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
greptile.json	chore: reduce Greptile review noise from defensive coding suggestions (#423 )	2026-03-30 17:42:52 -03:00
LICENSE	initial port	2025-10-27 14:29:12 -04:00
Makefile	fix: cache notebook builds to avoid flaky upstream model failures (#370 )	2026-03-05 12:30:14 -03:00
mkdocs.yml	docs: add text-to-sql dev note (#349 )	2026-04-14 11:10:14 -07:00
pyproject.toml	fix: bump pytest, aiohttp, and cryptography for security CVEs (#535 )	2026-04-13 10:23:13 -04:00
README.md	docs: add LiteLLM supply-chain incident notice to README (#516 )	2026-04-09 14:06:50 -04:00
STYLEGUIDE.md	docs: restructure agent and contributor documentation (plan 427, PR 1) (#454 )	2026-03-25 12:38:42 -06:00
uv.lock	fix: bump pytest, aiohttp, and cryptography for security CVEs (#535 )	2026-04-13 10:23:13 -04:00
VERSIONING.md	feat: add dynamic version pinning for inter-package dependencies (#282 )	2026-02-03 11:14:55 -05:00

README.md

🎨 NeMo Data Designer

Generate high-quality synthetic datasets from scratch or using your own seed data.

Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

Generate diverse data using statistical samplers, LLMs, or existing seed datasets
Control relationships between fields with dependency-aware generation
Validate quality with built-in Python, SQL, and custom local and remote validators
Score outputs using LLM-as-a-judge for quality assessment
Iterate quickly with preview mode before full-scale generation

⚠️ Security Notice: LiteLLM Supply-Chain Incident (2026-03-24)

On March 24, 2026, malicious versions of litellm (1.82.7 and 1.82.8) were published to PyPI containing a credential stealer. The compromised packages were available for approximately five hours (10:39 – 16:00 UTC) before being removed.

The only Data Designer releases that could resolve to these versions are v0.2.2 (Dec 2025) and v0.2.3 (Jan 2026), which carried a looser litellm<2 upper bound. These are nearly three months old and have been superseded by eight subsequent releases — both have been yanked from PyPI as a precaution. All other releases (v0.3.0 – v0.5.3) pinned litellm to >=1.73.6,<1.80.12 and were never compatible with 1.82.x. Starting with v0.5.4, litellm is no longer a dependency.

To have been impacted through Data Designer, you would need to have had one of these two old versions explicitly pinned and run a fresh pip install or dependency-cache update that resolved litellm during the five-hour window on March 24. If you believe you may be affected, see BerriAI's incident report for remediation steps.

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

Getting Started – Install, configure, and generate your first dataset
Tutorial Notebooks – Step-by-step interactive tutorials
Column Types – Explore samplers, LLM columns, validators, and more
Validators – Learn how to validate generated data with Python, SQL, and remote validators
Model Configuration – Configure custom models and providers
Person Sampling – Learn how to sample realistic person data with demographic attributes

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤖 Agent Skill

Data Designer has a skill for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on Claude Code at this stage.

Install via skills.sh (be sure to select Claude Code as an additional agent):

npx skills add NVIDIA-NeMo/DataDesigner

After installation, type /data-designer or describe the dataset you want and the skill will kick in.

🤝 Get involved

This repository supports agent-assisted development — see CONTRIBUTING.md for the recommended workflow.

Contributing Guide – How to contribute, including agent-assisted workflows
GitHub Issues – Report bugs or make a feature request

Telemetry

Data Designer collects telemetry to help us improve the library for developers. We collect:

The names of models used
The count of input tokens
The count of output tokens

No user or device information is collected. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Specifically, a model name that is defined a ModelConfig object, is what will be collected. In the below example config:

ModelConfig(
    alias="nv-reasoning",
    model="openai/gpt-oss-20b",
    provider="nvidia",
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0.3,
        top_p=0.9,
        max_tokens=4096,
    ),
)

The value openai/gpt-oss-20b would be collected.

To disable telemetry capture, set NEMO_TELEMETRY_ENABLED=false.

Top Models

This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 2/23/2026 to 3/23/2026.

Last updated on 3/23/2026

License

Apache License 2.0 – see LICENSE for details.

Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

README.md Unescape Escape