mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

Find a file

Nabin Mulepati cb0b1c6f6a docs: docs for quickstart, cli, model settings (#37 ) * vibe it baby * clean up * iterate with claude * Save prog * Update info pipeine * Fix tests * Fix typo * remove redundant overload * Add support for multiple default model providers and config * pull user-defined model configs and providers if available * Added tests for default model settings * save progress * refactor cli to be modular and use OOP * new tests for cli components * config_dir > config_path * simplify list * list tests * stranded commit * tests for commands * tests for field.py * tests for form.py * more tests * deleting providers should delete associated model configs * add readme.md for cli * clean up * Fix tests * feat: (FTUE) pull user-defined (via cli) model configs and providers (#24) * added docs for quick start and default model settings * Updates per chat * update quickstart.md * update default-model-settings.md * add check for interface.py as well * move default model config resolution to src/data_designer/__init__.py * Revert "move default model config resolution to src/data_designer/__init__.py" This reverts commit `806a81dc93`. * docs for cli * update default-model-settings.md * docs for model provider * more docs * add new tests for get provider name * add lru cache * remove non doc related changes * PR feedback * update reset info * tip for settings files * update * update info about default inference providers * DATA_DESIGNER_HOME_DIR -> DATA_DESIGNER_HOME --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>		2025-11-18 21:28:03 -07:00
.github/workflows	chore: add 3.10 to ci (#39 )	2025-11-17 10:44:04 -05:00
docs	docs: docs for quickstart, cli, model settings (#37 )	2025-11-18 21:28:03 -07:00
scripts	docs: establish doc templating, building, and strategy (#31 )	2025-11-12 17:04:50 -05:00
src/data_designer	default to composite resolver (#47 )	2025-11-18 20:57:55 -07:00
tests	default to composite resolver (#47 )	2025-11-18 20:57:55 -07:00
.gitignore	docs: establish doc templating, building, and strategy (#31 )	2025-11-12 17:04:50 -05:00
.pre-commit-config.yaml	add and run pre-commit	2025-10-27 18:10:36 -04:00
AGENTS.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CLAUDE.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md	add email to code of conduct	2025-10-30 14:27:46 -04:00
CONTRIBUTING.md	docs: welcome and concepts/columns (#43 )	2025-11-17 17:07:01 -05:00
DCO	add code of conduct	2025-10-29 15:51:17 -04:00
LICENSE	initial port	2025-10-27 14:29:12 -04:00
Makefile	docs: welcome and concepts/columns (#43 )	2025-11-17 17:07:01 -05:00
mkdocs.yml	docs: docs for quickstart, cli, model settings (#37 )	2025-11-18 21:28:03 -07:00
pyproject.toml	docs: welcome and concepts/columns (#43 )	2025-11-17 17:07:01 -05:00
README.md	add ci badge [	2025-10-28 14:21:24 -04:00
uv.lock	docs: welcome and concepts/columns (#43 )	2025-11-17 17:07:01 -05:00
VERSIONING.md	hatchling versioning; ci updates	2025-10-28 14:10:56 -04:00

README.md

🎨 NeMo Data Designer

Create synthetic datasets from scratch.

Installation

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

Test your installation:

make test

Example Usage

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
    UniformSamplerParams,
)

data_designer = DataDesigner(artifact_path="./artifacts")

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=1.0,
            max_tokens=1024,
        ),
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)


config_builder.add_column(
    SamplerColumnConfig(
        name="customer",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(age_range=[18, 70]),
    )
)


config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, "
            "{{ customer.state }}. Tell me about your experience working in the "
            "{{ product_category }} department of our company."
        ),
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
    )
)

preview = data_designer.preview(config_builder)

preview.display_sample_record()

A note about about Person Sampling

Note: The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source.

The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following.

Download the datasets from NGC:

ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim

The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets.

Tell DataDesigner where to find the datasets:

data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets")