DataDesigner/docs/notebook_source/_README.md
Nabin Mulepati 52d42fe1b7
Some checks failed
Check Colab notebooks / Check Colab Notebooks (push) Has been cancelled
CI / Validate dispatched SHA (push) Has been cancelled
CI / Test Engine (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Lint and Format Check (push) Has been cancelled
CI / Check License Headers (push) Has been cancelled
CI / Test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.10 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / End to end test (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on macos-latest) (push) Has been cancelled
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Has been cancelled
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Has been cancelled
CI / Coverage Check (Python 3.11) (push) Has been cancelled
CI / End to end test (Python 3.10 on macos-latest) (push) Has been cancelled
CI / End to end test (Python 3.11 on macos-latest) (push) Has been cancelled
CI / Test (Python 3.13 on ubuntu-latest) (push) Has been cancelled
feat: add audio and video context (#701)
* feat: add audio and video context

Add audio/video context config models and canonical media helpers.

Translate canonical media blocks for OpenAI-compatible clients while preserving URL media as URLs. Reject unsupported audio/video blocks in the Anthropic adapter.

Refs #671

* fix: harden media context review gaps

Preserve extensionless HTTP(S) audio and video URLs as URL media, reject local path-looking audio/video context values, and reject provider-specific audio/video blocks in the Anthropic adapter.

Refs #671

* test: add audio video context smoke notebook

Add a Jupytext source notebook and generated Colab artifact that exercise audio/video context URL, base64, local path rejection, OpenAI-compatible payload translation, and Anthropic unsupported-media handling.

Refs #671

* test: make media context notebook end to end

Rewrite the audio/video smoke notebook to run a full Data Designer preview against a local OpenAI-compatible HTTP server. Assert the generated dataset, captured endpoint payload, URL/base64 translation, and local path rejection through the interface pipeline.

Refs #671

* test: remove media context notebook from docs

Move the generated audio/video context E2E notebook out of the PR docs surface and keep it locally under the main checkout's .scratch directory.

Refs #671

* harden multimodal media context handling

* address media context review notes

Remove unused URL-specific media helpers, share the base64 data URI parser in Anthropic translation, align AudioContext validation messaging, and update config docs for audio/video contexts.

Refs #671

* docs: update media context guidance

* refactor: consolidate media helpers

* support local audio and video paths

* refactor: combine media path checks

* address media context review feedback

* remove openai media preflight

* sync generated colab notebooks

* align media local path autodetection
2026-05-22 11:54:40 -06:00

5.1 KiB

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

  • Setting up the DataDesigner interface
  • Configuring models and inference parameters
  • Using built-in samplers (Category, Person, Uniform)
  • Generating LLM text columns with dependencies
  • Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs, Jinja Expressions, and Conditional Generation

Explore more advanced data generation capabilities:

  • Creating structured JSON outputs with schemas
  • Using Jinja expressions for derived columns
  • Combining samplers with structured data
  • Building complex data dependencies
  • Working with nested data structures
  • Conditional generation with skip.when

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

  • Loading and using seed datasets
  • Sampling from real data distributions
  • Combining seed data with LLM generation
  • Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

  • Processing and converting images to base64 format for model consumption
  • Using vision-language models (VLMs) to analyze visual documents
  • Understanding how image, audio, and video context share the same multi_modal_context field, while still requiring model support for each modality
  • Generating detailed summaries from document images
  • Inspecting and validating vision-based generation results

5. Generating Images

Generate synthetic image data with Data Designer:

  • Configuring image-generation models with ImageInferenceParams
  • Adding image columns with Jinja2 prompts and sampler-driven diversity
  • Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
  • Displaying generated images in the notebook

6. Image-to-Image Editing

Chain image generation columns to generate and then edit images:

  • Generating images from text and then editing them in a follow-up column
  • Using ImageContext with auto-detection to pass generated images to an editing model
  • Combining sampled accessories and settings for varied edits
  • Comparing generated vs edited images in preview and create modes

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

  • Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
  • Validators - Understand how to validate generated data with Python, SQL, and remote validators
  • Person Sampling - Learn how to sample realistic person data with demographic attributes