Elgato_dark/DataDesigner

Fork 0

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Nabin Mulepati 52d42fe1b7

Check Colab notebooks / Check Colab Notebooks (push) Has been cancelled

Details

CI / Validate dispatched SHA (push) Has been cancelled

Details

CI / Test Engine (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Lint and Format Check (push) Has been cancelled

Details

CI / Check License Headers (push) Has been cancelled

Details

CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Config (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled

Details

CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

CI / Coverage Check (Python 3.11) (push) Has been cancelled

Details

CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled

Details

CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled

Details

CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled

Details

feat: add audio and video context (#701 )

* feat: add audio and video context

Add audio/video context config models and canonical media helpers.

Translate canonical media blocks for OpenAI-compatible clients while preserving URL media as URLs. Reject unsupported audio/video blocks in the Anthropic adapter.

Refs #671

* fix: harden media context review gaps

Preserve extensionless HTTP(S) audio and video URLs as URL media, reject local path-looking audio/video context values, and reject provider-specific audio/video blocks in the Anthropic adapter.

Refs #671

* test: add audio video context smoke notebook

Add a Jupytext source notebook and generated Colab artifact that exercise audio/video context URL, base64, local path rejection, OpenAI-compatible payload translation, and Anthropic unsupported-media handling.

Refs #671

* test: make media context notebook end to end

Rewrite the audio/video smoke notebook to run a full Data Designer preview against a local OpenAI-compatible HTTP server. Assert the generated dataset, captured endpoint payload, URL/base64 translation, and local path rejection through the interface pipeline.

Refs #671

* test: remove media context notebook from docs

Move the generated audio/video context E2E notebook out of the PR docs surface and keep it locally under the main checkout's .scratch directory.

Refs #671

* harden multimodal media context handling

* address media context review notes

Remove unused URL-specific media helpers, share the base64 data URI parser in Anthropic translation, align AudioContext validation messaging, and update config docs for audio/video contexts.

Refs #671

* docs: update media context guidance

* refactor: consolidate media helpers

* support local audio and video paths

* refactor: combine media path checks

* address media context review feedback

* remove openai media preflight

* sync generated colab notebooks

* align media local path autodetection

2026-05-22 11:54:40 -06:00

5.1 KiB

Raw Permalink Blame History

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

Setting up the DataDesigner interface
Configuring models and inference parameters
Using built-in samplers (Category, Person, Uniform)
Generating LLM text columns with dependencies
Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs, Jinja Expressions, and Conditional Generation

Explore more advanced data generation capabilities:

Creating structured JSON outputs with schemas
Using Jinja expressions for derived columns
Combining samplers with structured data
Building complex data dependencies
Working with nested data structures
Conditional generation with skip.when

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

Loading and using seed datasets
Sampling from real data distributions
Combining seed data with LLM generation
Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

Processing and converting images to base64 format for model consumption
Using vision-language models (VLMs) to analyze visual documents
Understanding how image, audio, and video context share the same multi_modal_context field, while still requiring model support for each modality
Generating detailed summaries from document images
Inspecting and validating vision-based generation results

5. Generating Images

Generate synthetic image data with Data Designer:

Configuring image-generation models with ImageInferenceParams
Adding image columns with Jinja2 prompts and sampler-driven diversity
Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
Displaying generated images in the notebook

6. Image-to-Image Editing

Chain image generation columns to generate and then edit images:

Generating images from text and then editing them in a follow-up column
Using ImageContext with auto-detection to pass generated images to an editing model
Combining sampled accessories and settings for varied edits
Comparing generated vs edited images in preview and create modes

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

Welcome & Installation - Overview of Data Designer capabilities and installation instructions

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
Validators - Understand how to validate generated data with Python, SQL, and remote validators
Person Sampling - Learn how to sample realistic person data with demographic attributes

5.1 KiB Raw Permalink Blame History

Overview

🚀 Setting Up Your Environment

Local Setup Best Practices

API Keys and Authentication

📚 Tutorial Series

1. The Basics

2. Structured Outputs, Jinja Expressions, and Conditional Generation

3. Seeding with an External Dataset

4. Providing Images as Context

5. Generating Images

6. Image-to-Image Editing

📖 Important Documentation Sections

Getting Started

Core Concepts

5.1 KiB

Raw Permalink Blame History