🎨 NeMo Data Designer: Generate high-quality synthetic data from scratch or from seed data.
Find a file
Nabin Mulepati bbcd7d3995
Some checks are pending
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / Coverage Check (Python 3.11) (push) Waiting to run
CI / End to end test (Python 3.11 on macos-latest) (push) Waiting to run
CI / Lint and Format Check (push) Waiting to run
CI / End to end test (Python 3.10 on macos-latest) (push) Waiting to run
CI / Check License Headers (push) Waiting to run
CI / End to end test (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Waiting to run
CI / Test Config (Python 3.10 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on macos-latest) (push) Waiting to run
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Waiting to run
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.11 on macos-latest) (push) Waiting to run
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.12 on macos-latest) (push) Waiting to run
CI / Test Config (Python 3.10 on ubuntu-latest) (push) Waiting to run
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
fix: harden resume checkpoint handling (#624)
* fix: harden resume checkpoint handling

Persist config identity in metadata, make checkpoints atomic, and reject unsafe resume states so interrupted runs do not mix incompatible or post-processed data.

* fix: close resume edge cases

Let IF_POSSIBLE start fresh for resize configs and mark after-generation processing before mutation so interrupted processors cannot be resumed unsafely.

* refactor: drop dataset directory lock

Single-user CLI/notebook flows don't race on the artifact directory, and
the timestamped-directory fallback already handles the "ran it twice"
case. The lock added complexity (re-entrancy, stale cleanup, the
cached-property trap where IF_POSSIBLE→NEVER moves writes to a
timestamped directory while the lock stays pinned to the original) for
no real protection. Atomic metadata writes still cover the actual hazard
(crash mid-write).

Also fix a pre-existing test bug in
test_initial_actual_num_records_uses_actual_parquet_rows_for_partial_row_group
where the mocked scheduler hit the partial-completion path with
unconfigured Mock attributes.

* fix: address Greptile review on resume edge cases

* Drop the unreachable ResumeMode.IF_POSSIBLE branch in
  _post_generation_processed_resume_result. By the time this helper
  runs, build() has normalised IF_POSSIBLE to ALWAYS or NEVER, so the
  guard now matches reality. Tighten the docstring to document the
  three outcomes (no-op return / fall through / raise).

* Split the post-processed extension/raise into two cases. When
  num_records < prior_target the user just asked for fewer records than
  already exist; the previous "would mix pre- and post-processor
  records" message only describes the extension case. Mirror the
  wording used by _load_resume_state and add a regression test.

* Remove the dead _find_completed_row_group_ids wrapper now that
  _build_async uses _find_completed_row_groups directly. Rename the
  related test to match.

* refactor: unify sync + async resume around filesystem-derived progress

Both engines now derive `num_completed_batches` and `actual_num_records`
from `parquet-files/batch_*.parquet` via `_recover_progress_from_disk`.
`metadata.json` keeps describing the run *configuration* (`buffer_size`,
`target_num_records`, `original_target_num_records`, config fingerprint),
while the filesystem is the source of truth for *progress*. This closes
the sync engine's race window between `move_partial_result_to_final_file_path`
and the metadata write that follows it, matching the crash-recovery the
async engine already had.

The sync engine additionally rejects non-contiguous batch IDs (a hole can
only mean external mutation or a directory written by an incompatible
engine); the async engine continues to tolerate gaps from out-of-order
completion via `allow_holes=True`.

Existing sync resume tests now seed parquet files alongside metadata,
and two new tests cover the unified behaviour: filesystem progress wins
when metadata lags, and sync rejects non-contiguous IDs.

* docs: clarify DatasetCreationResults observability scope on resume

`load_dataset`, `count_records`, `load_analysis`, `export`, and `push_to_hub`
all read from the artifact directory, so they reflect the cumulative dataset
(original + resume rows). `task_traces`, model-usage logs, and telemetry
events are scoped to the current invocation only because the original run's
in-memory state is not persisted. Document this in the class docstring,
the architecture note, and the Fern resume guide.

* docs: explain DeprecationWarning re-raise in create()/preview()

Future readers were puzzled by the ``except DeprecationWarning: raise``
short-circuits before the generic generation-error wrappers. Add a
comment in ``create()`` (with a back-reference from ``preview()``) to
record that strict warning filters (``pytest.warns``,
``-W error::DeprecationWarning``) turn the engine's
``warnings.warn(..., DeprecationWarning)`` calls — most notably the
``allow_resize=True`` deprecation in ``_resolve_async_compatibility`` —
into raised exceptions, and we want them to surface untouched instead of
being swallowed by ``DataDesignerGenerationError``.

* fix: close after-generation crash window and tighten metadata typing on resume

Address review feedback on resume hardening:

* Run after-generation processors unconditionally on the on-disk dataset
  rather than gating on the generation return value. The previous gate
  silently skipped after-generation when resume saw every row group
  already on disk, leaving a crash window between the final parquet write
  and the ``post_generation_state="started"`` marker write: in that
  window the dataset is complete but after-generation never ran, and the
  on-disk parquet files are still clean. The "started" short-circuit
  still rejects the other direction (crashed mid-rewrite, ambiguous
  state), so resume only re-runs after-generation when it is safe to do
  so.

* Raise ``DatasetGenerationError`` (instead of letting a raw
  ``TypeError`` leak out of ``num_records < prior_target``) when a
  post-processed dataset's metadata is missing ``target_num_records``.
  Mirrors the wording used by ``_load_resume_state``.

* Document the new behaviour in ``architecture/dataset-builders.md`` and
  the Fern resume invariants.

Tests:

* ``test_build_resume_complete_dataset_runs_after_generation_when_no_marker``
  covers the closed crash window via the public ``set_processor_runner``
  API.
* ``test_build_resume_post_generation_processed_missing_target_raises_clearly``
  covers the typed-error gap.
2026-05-11 11:44:46 -06:00
.agents fix: quote review-code skill argument hint (#616) 2026-05-07 14:55:10 -04:00
.claude docs: restructure agent and contributor documentation (plan 427, PR 1) (#454) 2026-03-25 12:38:42 -06:00
.github ci: bump NVIDIA-NeMo/FW-CI-templates/.github/workflows/_semantic_pull_request.yml (#621) 2026-05-08 17:10:38 -03:00
architecture fix: harden resume checkpoint handling (#624) 2026-05-11 11:44:46 -06:00
docs feat: let column configs declare all model aliases for the startup health check (#626) 2026-05-11 11:33:50 -06:00
fern fix: harden resume checkpoint handling (#624) 2026-05-11 11:44:46 -06:00
packages fix: harden resume checkpoint handling (#624) 2026-05-11 11:44:46 -06:00
plans ci: add daily audit suites with 5 rotating recipes and scheduled workflow (#543) 2026-04-17 14:48:55 -03:00
scripts fix(config): update OpenRouter vision model id (#630) 2026-05-11 13:49:17 -03:00
skills/data-designer fix: prevent skill load failure when data-designer CLI is not installed (#501) 2026-04-07 17:36:18 -04:00
tests_e2e feat: make async engine the default execution path (#592) 2026-05-04 16:22:13 -03:00
.gitignore docs: migrate documentation from MkDocs to Fern (#581) 2026-05-07 14:12:58 -03:00
.pre-commit-config.yaml chore: use uv run ruff in pre-commit hooks (#436) 2026-03-19 10:14:28 -03:00
AGENTS.md docs: restructure agent and contributor documentation (plan 427, PR 1) (#454) 2026-03-25 12:38:42 -06:00
CLAUDE.md add agent instruction files 2025-10-27 18:47:12 -04:00
CODE_OF_CONDUCT.md add email to code of conduct 2025-10-30 14:27:46 -04:00
CONTRIBUTING.md ci: add PR hygiene automation (linked issue check + stale PR cleanup) (#521) 2026-04-13 20:26:02 -03:00
DCO add code of conduct 2025-10-29 15:51:17 -04:00
DEVELOPMENT.md docs: restructure agent and contributor documentation (plan 427, PR 1) (#454) 2026-03-25 12:38:42 -06:00
greptile.json chore: reduce Greptile review noise from defensive coding suggestions (#423) 2026-03-30 17:42:52 -03:00
LICENSE initial port 2025-10-27 14:29:12 -04:00
Makefile docs: migrate documentation from MkDocs to Fern (#581) 2026-05-07 14:12:58 -03:00
mkdocs.yml docs: graduate plugins out of experimental mode (#603) 2026-05-06 18:12:44 -04:00
pyproject.toml chore: bump lxml and nbconvert to address security advisories (#574) 2026-04-27 12:32:49 -04:00
README.md feat: make async engine the default execution path (#592) 2026-05-04 16:22:13 -03:00
SECURITY.md chore: bump pillow and python-multipart for CVEs, add SECURITY.md (#564) 2026-04-20 18:36:22 -04:00
STYLEGUIDE.md docs: restructure agent and contributor documentation (plan 427, PR 1) (#454) 2026-03-25 12:38:42 -06:00
uv.lock feat(cli): show version update notice (#602) 2026-05-07 15:20:18 -04:00
VERSIONING.md feat: add dynamic version pinning for inter-package dependencies (#282) 2026-02-03 11:14:55 -05:00

🎨 NeMo Data Designer

CI License Python 3.10 - 3.13 NeMo Microservices Code Tokens

Generate high-quality synthetic datasets from scratch or using your own seed data.


Welcome!

Data Designer helps you create synthetic datasets that go beyond simple LLM prompting. Whether you need diverse statistical distributions, meaningful correlations between fields, or validated high-quality outputs, Data Designer provides a flexible framework for building production-grade synthetic data.

What can you do with Data Designer?

  • Generate diverse data using statistical samplers, LLMs, or existing seed datasets
  • Control relationships between fields with dependency-aware generation
  • Validate quality with built-in Python, SQL, and custom local and remote validators
  • Score outputs using LLM-as-a-judge for quality assessment
  • Iterate quickly with preview mode before full-scale generation

📣 Heads-up: async engine is now the default

Data Designer now runs pipelines on a cell-level async engine that overlaps independent columns and adapts concurrency per (provider, model). On most pipelines this is faster with no config changes; on slow self-hosted endpoints, set inference_parameters.timeout to your real per-request latency. See Architecture & Performance → Async Engine for the behaviors worth knowing about.

If you hit anything unexpected, fall back to the legacy sync engine for one transitional release with DATA_DESIGNER_ASYNC_ENGINE=0, and please open an issue so we can fix the async path.


Quick Start

1. Install

pip install data-designer

Or install from source:

git clone https://github.com/NVIDIA-NeMo/DataDesigner.git
cd DataDesigner
make install

2. Set your API key

Start with one of our default model providers:

Grab your API key(s) using the above links and set one or more of the following environment variables:

export NVIDIA_API_KEY="your-api-key-here"

export OPENAI_API_KEY="your-openai-api-key-here"

export OPENROUTER_API_KEY="your-openrouter-api-key-here"

3. Start generating data!

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default settings
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a product category
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books"],
        ),
    )
)

# Generate personalized customer reviews
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="review",
        model_alias="nvidia-text",
        prompt="Write a brief product review for a {{ product_category }} item you recently purchased.",
    )
)

# Preview your dataset
preview = data_designer.preview(config_builder=config_builder)
preview.display_sample_record()

What's next?

📚 Learn more

🔧 Configure models via CLI

data-designer config providers # Configure model providers
data-designer config models    # Set up your model configurations
data-designer config list      # View current settings

🤖 Agent Skill

Data Designer has a skill for coding agents. Just describe the dataset you want, and your agent handles schema design, validation, and generation. While the skill should work with other coding agents that support skills, our development and testing has focused on Claude Code at this stage.

Install via skills.sh (be sure to select Claude Code as an additional agent):

npx skills add NVIDIA-NeMo/DataDesigner

After installation, type /data-designer or describe the dataset you want and the skill will kick in.

🤝 Get involved

This repository supports agent-assisted development — see CONTRIBUTING.md for the recommended workflow.


Telemetry

Data Designer collects telemetry to help us improve the library for developers. This data is not used to track any individual user behavior. It is used to see an aggregation of which models are the most popular for SDG. We will share this usage data with the community.

Disable with NEMO_TELEMETRY_ENABLED=false. More details →

Top models (YTD)

Aggregate model usage across synthetic data generation jobs, year-to-date 1/1/20265/1/2026:

Top models used for synthetic data generation

Last updated on May 1, 2026


License

Apache License 2.0 see LICENSE for details.


Citation

If you use NeMo Data Designer in your research, please cite it using the following BibTeX entry:

@misc{nemo-data-designer,
  author = {The NeMo Data Designer Team, NVIDIA},
  title = {NeMo Data Designer: A framework for generating synthetic data from scratch or based on your own seed data},
  howpublished = {\url{https://github.com/NVIDIA-NeMo/DataDesigner}},
  year = {2025},
  note = {GitHub Repository},
}

Telemetry & privacy

NeMo Data Designer includes an optional function to share anonymous telemetry data with NVIDIA for product improvement. Data collected is limited to names of models used and token counts (input and output). No user or device information is collected. This data is used to prioritize product improvements and will be shared in aggregate with the community. It is not used to track any individual user behavior.

You may opt out of telemetry collection at any time. Opting out applies only to data collection by the NeMo Data Designer library itself.

Use of third-party endpoints, including NVIDIA Build: NeMo Data Designer can be configured to use various inference endpoints, including build.nvidia.com (NVIDIA Build). If you choose to use NVIDIA Build or any other third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within NeMo Data Designer does not extend to data collection by your chosen endpoint. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not submit any confidential information or personal data when using NVIDIA Build.