mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

Find a file

Johnny Greco 060773c2ee small doc fixes (#67 )		2025-11-21 17:39:15 -05:00
.github/workflows	docs: streamlining tutorials (#61 )	2025-11-21 16:14:48 -03:00
docs	small doc fixes (#67 )	2025-11-21 17:39:15 -05:00
scripts	docs: establish doc templating, building, and strategy (#31 )	2025-11-12 17:04:50 -05:00
src/data_designer	chore: clean up ftue with default model provider errors (#66 )	2025-11-21 14:24:58 -07:00
tests	chore: clean up ftue with default model provider errors (#66 )	2025-11-21 14:24:58 -07:00
.gitignore	docs: establish doc templating, building, and strategy (#31 )	2025-11-12 17:04:50 -05:00
.pre-commit-config.yaml	add and run pre-commit	2025-10-27 18:10:36 -04:00
AGENTS.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CLAUDE.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md	add email to code of conduct	2025-10-30 14:27:46 -04:00
CONTRIBUTING.md	docs: some link fixes (#65 )	2025-11-21 16:33:03 -05:00
DCO	add code of conduct	2025-10-29 15:51:17 -04:00
LICENSE	initial port	2025-10-27 14:29:12 -04:00
Makefile	docs: welcome and concepts/columns (#43 )	2025-11-17 17:07:01 -05:00
mkdocs.yml	docs: streamlining tutorials (#61 )	2025-11-21 16:14:48 -03:00
pyproject.toml	chore: some readme and docs cleanup (#56 )	2025-11-20 15:33:55 -05:00
README.md	small doc fixes (#67 )	2025-11-21 17:39:15 -05:00
uv.lock	chore: split person samplers and use parameters in sql exec (#48 )	2025-11-19 10:30:15 -05:00
VERSIONING.md	hatchling versioning; ci updates	2025-10-28 14:10:56 -04:00

README.md

🎨 NeMo Data Designer

Generate high-quality synthetic datasets from scratch or using your own seed data.

Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

Generate diverse data using statistical samplers, LLMs, or existing seed datasets
Control relationships between fields with dependency-aware generation
Validate quality with built-in Python, SQL, and custom local and remote validators
Score outputs using LLM-as-a-judge for quality assessment
Iterate quickly with preview mode before full-scale generation

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Get your API key from build.nvidia.com or OpenAI:

export NVIDIA_API_KEY="your-api-key-here"
# Or use OpenAI
export OPENAI_API_KEY="your-openai-api-key-here"

3. Start generating data!

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    LLMTextColumnConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
)

# Initialize with default settings
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="""Write a brief product review for a {{ product_category }} item you recently purchased.""",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

Quick Start Guide – Detailed walkthrough with more examples
Tutorial Notebooks – Step-by-step interactive tutorials
Column Types – Explore samplers, LLM columns, validators, and more
Validators – Learn how to validate generated data with Python, SQL, and remote validators
Model Configuration – Configure custom models and providers
Person Sampling – Learn how to sample realistic person data with demographic attributes

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤝 Get involved

Contributing Guide – Help improve Data Designer
GitHub Issues – Report bugs or make a feature request

License

Apache License 2.0 – see LICENSE for details.

Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

README.md Unescape Escape