mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.

Find a file

Johnny Greco 4c19dba74b feat: agent CLI introspection (simplified) (#415 ) * feat: add agent introspection cli * refactor: remove agent cli schema version * refactor: omit missing builder docstrings from context * refactor: tighten agent cli contract * feat: add schema_text() to ConfigBase for human-readable field summaries ConfigBase.schema_text() returns a concise text representation including the class docstring summary, field names, types, defaults, and descriptions. Field descriptions added to column config types to surface through this method. * refactor: flatten agent CLI into plain functions with text output mode Delete AgentController class and agent_command_defs module. Move all logic into agent_introspection (data) and agent_text_formatter (display) as plain functions. Add --json flag so commands default to human-readable text using schema_text(), with JSON as opt-in. Unify _emit helper, remove include_docstrings parameter, deduplicate catalog calls, and fix N+1 discover_family_types in get_family_schemas. * fix: port stale controller tests and consolidate command descriptions Port test_agent_controller.py to use plain functions instead of deleted AgentController. Extract AGENT_COMMANDS constant as single source for operation descriptions, syncing with main.py help strings. * style: fix ruff formatting in agent_introspection * refactor: centralize agent command definitions Extract AGENT_COMMANDS into agent_command_defs.py so main.py and agent_introspection.py share a single source for command names, help text, and metadata. The new module has no heavy dependencies, keeping --help latency unaffected. * fix: handle default_factory and empty providers in schema_text and introspection - schema_text() now detects default_factory fields and renders e.g. "list()" instead of leaking PydanticUndefined - Guard against IndexError when provider registry has an empty providers list - Add 15 edge-case tests for schema_text covering default_factory, enum defaults, None defaults, scalar defaults, descriptions, and docstrings * refactor: remove JSON output mode from agent CLI commands Text-only output simplifies the interface. Structured output can be added back trivially since the functions already return dicts. * docs: update schema_text docstring to reflect agent focus * fix: include builder section and import_path in agent text output - format_context_text now renders a ## Builder section - format_types_text now includes import_path column in tables * refactor: drop import_path from types tables All config objects are imported via dd.<ClassName>, so the full import path is redundant noise in agent output. * docs: add family definition and import hint to context output * refactor: rename Types section to Families, drop redundant "types" from sub-headers * fix: coerce None to empty string in table cells row.get(col, '') returns None when the key exists with value None, causing str(None) to render "None" in the output. Use `or ''` instead. * refactor: move agent controller tests to utils as introspection integration tests There is no controller layer — these tests exercise functions in agent_introspection.py, so they belong in tests/cli/utils/. * fix: only coerce None to empty string in table cells, not False The previous `or ''` pattern treated all falsy values (including False) as empty. Use an explicit None check so booleans render correctly. * style: address review nits from nabin - Add explicit parentheses to and/or precedence in _build_agent_lazy_group - Rename loop variable l to line in test_schema_text - Move get_family_schema import to module level in test_agent_text_formatter * fix: improve schema_text Literal display, builder signature quotes, and docstring parsing - _format_annotation now renders Literal['value'] instead of bare Literal - _format_signature strips quotes from stringified annotations caused by `from __future__ import annotations` - _get_docstring_summary stops at any Google-style section header, not just Attributes:		2026-03-13 18:26:00 -04:00
.claude	chore: add Claude Code skill for code review (#372 )	2026-03-06 14:33:49 -07:00
.github	fix: cache notebook builds to avoid flaky upstream model failures (#370 )	2026-03-05 12:30:14 -03:00
docs	docs: add Open in Colab badges to tutorial notebooks (#391 )	2026-03-13 17:47:10 -04:00
packages	feat: agent CLI introspection (simplified) (#415 )	2026-03-13 18:26:00 -04:00
plans	docs: trace visualization in display_sample_record (#396 ) (#397 )	2026-03-13 09:48:33 -06:00
scripts	chore: Improve CLI startup with lazy heavy import cleanup (#330 )	2026-02-18 16:24:15 -05:00
tests_e2e	feat: add processor plugin support (#299 )	2026-02-25 16:40:01 -03:00
.gitignore	refactor: Decouple ModelFacade from LiteLLM via ModelClient adapter (#373 )	2026-03-11 14:30:40 -06:00
.pre-commit-config.yaml	refactor: slim package refactor into three subpackages (#240 )	2026-01-27 13:53:20 -05:00
AGENTS.md	chore: improve test guidelines in AGENTS.md (#387 )	2026-03-09 15:39:49 -06:00
CLAUDE.md	add agent instruction files	2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md	add email to code of conduct	2025-10-30 14:27:46 -04:00
CONTRIBUTING.md	docs: some link fixes (#65 )	2025-11-21 16:33:03 -05:00
DCO	add code of conduct	2025-10-29 15:51:17 -04:00
greptile.json	chore: enable status check in greptile.json (#295 )	2026-02-04 13:57:09 -03:00
LICENSE	initial port	2025-10-27 14:29:12 -04:00
Makefile	fix: cache notebook builds to avoid flaky upstream model failures (#370 )	2026-03-05 12:30:14 -03:00
mkdocs.yml	docs: search agent dev note (#350 )	2026-03-12 11:43:39 -07:00
pyproject.toml	fix: add chardet<6 constraint to published engine package (#406 )	2026-03-12 18:34:41 -04:00
README.md	docs: update README token badge to 150+ billion (#367 )	2026-03-04 13:20:26 -05:00
uv.lock	fix: add chardet<6 constraint to published engine package (#406 )	2026-03-12 18:34:41 -04:00
VERSIONING.md	feat: add dynamic version pinning for inter-package dependencies (#282 )	2026-02-03 11:14:55 -05:00

README.md

🎨 NeMo Data Designer

Generate high-quality synthetic datasets from scratch or using your own seed data.

Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

Generate diverse data using statistical samplers, LLMs, or existing seed datasets
Control relationships between fields with dependency-aware generation
Validate quality with built-in Python, SQL, and custom local and remote validators
Score outputs using LLM-as-a-judge for quality assessment
Iterate quickly with preview mode before full-scale generation

Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

Getting Started – Install, configure, and generate your first dataset
Tutorial Notebooks – Step-by-step interactive tutorials
Column Types – Explore samplers, LLM columns, validators, and more
Validators – Learn how to validate generated data with Python, SQL, and remote validators
Model Configuration – Configure custom models and providers
Person Sampling – Learn how to sample realistic person data with demographic attributes

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤝 Get involved

Contributing Guide – Help improve Data Designer
GitHub Issues – Report bugs or make a feature request

Telemetry

Data Designer collects telemetry to help us improve the library for developers. We collect:

The names of models used
The count of input tokens
The count of output tokens

No user or device information is collected. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Specifically, a model name that is defined a ModelConfig object, is what will be collected. In the below example config:

ModelConfig(
    alias="nv-reasoning",
    model="openai/gpt-oss-20b",
    provider="nvidia",
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0.3,
        top_p=0.9,
        max_tokens=4096,
    ),
)

The value openai/gpt-oss-20b would be collected.

To disable telemetry capture, set NEMO_TELEMETRY_ENABLED=false.

Top Models

This chart represents the breakdown of models used for Data Designer across all synthetic data generation jobs from 1/24/2026 to 2/24/2026.

Last updated on 2/24/2026

License

Apache License 2.0 – see LICENSE for details.

Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

README.md Unescape Escape