🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.
Find a file
Nabin Mulepati cb0b1c6f6a
docs: docs for quickstart, cli, model settings (#37)
* vibe it baby

* clean up

* iterate with claude

* Save prog

* Update info pipeine

* Fix tests

* Fix typo

* remove redundant overload

* Add support for multiple default model providers and config

* pull user-defined model configs and providers if available

* Added tests for default model settings

* save progress

* refactor cli to be modular and use OOP

* new tests for cli components

* config_dir > config_path

* simplify list

* list tests

* stranded commit

* tests for commands

* tests for field.py

* tests for form.py

* more tests

* deleting providers should delete associated model configs

* add readme.md for cli

* clean up

* Fix tests

* feat: (FTUE) pull user-defined (via cli) model configs and providers  (#24)

* added docs for quick start and default model settings

* Updates per chat

* update quickstart.md

* update default-model-settings.md

* add check for interface.py as well

* move default model config resolution to src/data_designer/__init__.py

* Revert "move default model config resolution to src/data_designer/__init__.py"

This reverts commit 806a81dc93.

* docs for cli

* update default-model-settings.md

* docs for model provider

* more docs

* add new tests for get provider name

* add lru cache

* remove non doc related changes

* PR feedback

* update reset info

* tip for settings files

* update

* update info about default inference providers

* DATA_DESIGNER_HOME_DIR -> DATA_DESIGNER_HOME

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2025-11-18 21:28:03 -07:00
.github/workflows chore: add 3.10 to ci (#39) 2025-11-17 10:44:04 -05:00
docs docs: docs for quickstart, cli, model settings (#37) 2025-11-18 21:28:03 -07:00
scripts docs: establish doc templating, building, and strategy (#31) 2025-11-12 17:04:50 -05:00
src/data_designer default to composite resolver (#47) 2025-11-18 20:57:55 -07:00
tests default to composite resolver (#47) 2025-11-18 20:57:55 -07:00
.gitignore docs: establish doc templating, building, and strategy (#31) 2025-11-12 17:04:50 -05:00
.pre-commit-config.yaml add and run pre-commit 2025-10-27 18:10:36 -04:00
AGENTS.md add agent instruction files 2025-10-27 18:47:12 -04:00
CLAUDE.md add agent instruction files 2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md add email to code of conduct 2025-10-30 14:27:46 -04:00
CONTRIBUTING.md docs: welcome and concepts/columns (#43) 2025-11-17 17:07:01 -05:00
DCO add code of conduct 2025-10-29 15:51:17 -04:00
LICENSE initial port 2025-10-27 14:29:12 -04:00
Makefile docs: welcome and concepts/columns (#43) 2025-11-17 17:07:01 -05:00
mkdocs.yml docs: docs for quickstart, cli, model settings (#37) 2025-11-18 21:28:03 -07:00
pyproject.toml docs: welcome and concepts/columns (#43) 2025-11-17 17:07:01 -05:00
README.md add ci badge [ 2025-10-28 14:21:24 -04:00
uv.lock docs: welcome and concepts/columns (#43) 2025-11-17 17:07:01 -05:00
VERSIONING.md hatchling versioning; ci updates 2025-10-28 14:10:56 -04:00

🎨 NeMo Data Designer

CI License

Create synthetic datasets from scratch.

Installation

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

Test your installation:

make test

Example Usage

from data_designer.essentials import (
    CategorySamplerParams,
    DataDesigner,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMTextColumnConfig,
    ModelConfig,
    PersonSamplerParams,
    SamplerColumnConfig,
    SamplerType,
    SubcategorySamplerParams,
    UniformSamplerParams,
)

data_designer = DataDesigner(artifact_path="./artifacts")

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/nvidia-nemotron-nano-9b-v2"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-nano-v2"

# This sets reasoning to False for the nemotron-nano-v2 model.
SYSTEM_PROMPT = "/no_think"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        inference_parameters=InferenceParameters(
            temperature=0.5,
            top_p=1.0,
            max_tokens=1024,
        ),
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)


config_builder.add_column(
    SamplerColumnConfig(
        name="customer",
        sampler_type=SamplerType.PERSON,
        params=PersonSamplerParams(age_range=[18, 70]),
    )
)


config_builder.add_column(
    SamplerColumnConfig(
        name="product_category",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, "
            "{{ customer.state }}. Tell me about your experience working in the "
            "{{ product_category }} department of our company."
        ),
        system_prompt=SYSTEM_PROMPT,
        model_alias=MODEL_ALIAS,
    )
)

preview = data_designer.preview(config_builder)

preview.display_sample_record()

A note about about Person Sampling

Note: The below usage is only temporary. The library's support for the Nemotron-Personas datasets will be evolve as we prepare to open source.

The PII and persona managed datasets have been updated for 25.11. If you want to use our Nemotron-Personas datasets for person / persona sampling, you need to do the following.

Download the datasets from NGC:

ngc registry resource download-version --org nvidian nvidian/nemo-llm/nemotron-personas-datasets:0.0.6-slim

The "slim" version is smaller for fast development. Remove the "-slim" to get the full datasets.

Tell DataDesigner where to find the datasets:

data_designer = DataDesigner(artifact_path="./artifacts", blob_storage_path="/path/to/nemotron-personas-datasets")