🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.
Find a file
Johnny Greco b056650311 fixing ci
2025-10-27 19:25:52 -04:00
.github/workflows fixing ci 2025-10-27 19:25:52 -04:00
scripts add check headers option 2025-10-27 19:14:52 -04:00
src/data_designer add and run pre-commit 2025-10-27 18:10:36 -04:00
tests add possible test skip 2025-10-27 19:20:19 -04:00
.gitignore initial port 2025-10-27 14:29:12 -04:00
.pre-commit-config.yaml add and run pre-commit 2025-10-27 18:10:36 -04:00
AGENTS.md add agent instruction files 2025-10-27 18:47:12 -04:00
CLAUDE.md add agent instruction files 2025-10-27 18:47:12 -04:00
DCO.txt add and run pre-commit 2025-10-27 18:10:36 -04:00
LICENSE initial port 2025-10-27 14:29:12 -04:00
Makefile add check headers option 2025-10-27 19:14:52 -04:00
pyproject.toml fixing ci 2025-10-27 19:25:52 -04:00
README.md add and run pre-commit 2025-10-27 18:10:36 -04:00
uv.lock fixing ci 2025-10-27 19:25:52 -04:00

🎨 NeMo Data Designer

License

Create synthetic datasets from scratch.

Installation

git clone https://gitlab-master.nvidia.com/jogreco/data-designer.git
cd data-designer
uv sync

Test your installation:

make test

Example Usage

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
    UniformSamplerParams,
)

data_designer = DataDesigner(artifact_path="./artifacts")

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=1.0,
            max_tokens=1024,
        ),
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)


config_builder.add_column(
    SamplerColumnConfig(
        name="customer",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(age_range=[18, 70]),
    )
)


config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, "
            "{{ customer.state }}. Tell me about your experience working in the "
            "{{ product_category }} department of our company."
        ),
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
    )
)

preview = data_designer.preview(config_builder)

preview.display_sample_record()

A note about about Person Sampling

Note: The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source.

The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following.

Download the datasets from NGC:

ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim

The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets.

Tell DataDesigner where to find the datasets:

data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets")