🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.
Find a file
2025-11-06 16:56:47 -05:00
.github/workflows update branch for signatures (#19) 2025-11-06 16:56:47 -05:00
docs first test 2025-11-03 13:48:41 -03:00
scripts added headers 2025-11-03 13:48:41 -03:00
src/data_designer chore: Make column_type a pydantic field rather than a property (#17) 2025-11-06 16:41:45 -05:00
tests chore: Make column_type a pydantic field rather than a property (#17) 2025-11-06 16:41:45 -05:00
.gitignore ignore it 2025-10-28 14:13:51 -04:00
.pre-commit-config.yaml add and run pre-commit 2025-10-27 18:10:36 -04:00
AGENTS.md add agent instruction files 2025-10-27 18:47:12 -04:00
CLAUDE.md add agent instruction files 2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md add email to code of conduct 2025-10-30 14:27:46 -04:00
CONTRIBUTING.md punctuation 2025-11-03 10:35:57 -05:00
DCO add code of conduct 2025-10-29 15:51:17 -04:00
LICENSE initial port 2025-10-27 14:29:12 -04:00
Makefile missing quote 2025-11-03 16:08:00 -05:00
mkdocs.yml first test 2025-11-03 13:48:41 -03:00
pyproject.toml adding group for docs dependencies 2025-11-03 13:48:41 -03:00
README.md add ci badge [ 2025-10-28 14:21:24 -04:00
uv.lock adding group for docs dependencies 2025-11-03 13:48:41 -03:00
VERSIONING.md hatchling versioning; ci updates 2025-10-28 14:10:56 -04:00

🎨 NeMo Data Designer

CI License

Create synthetic datasets from scratch.

Installation

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

Test your installation:

make test

Example Usage

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
    UniformSamplerParams,
)

data_designer = DataDesigner(artifact_path="./artifacts")

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=1.0,
            max_tokens=1024,
        ),
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)


config_builder.add_column(
    SamplerColumnConfig(
        name="customer",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(age_range=[18, 70]),
    )
)


config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, "
            "{{ customer.state }}. Tell me about your experience working in the "
            "{{ product_category }} department of our company."
        ),
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
    )
)

preview = data_designer.preview(config_builder)

preview.display_sample_record()

A note about about Person Sampling

Note: The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source.

The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following.

Download the datasets from NGC:

ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim

The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets.

Tell DataDesigner where to find the datasets:

data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets")