* feat: add audio and video context Add audio/video context config models and canonical media helpers. Translate canonical media blocks for OpenAI-compatible clients while preserving URL media as URLs. Reject unsupported audio/video blocks in the Anthropic adapter. Refs #671 * fix: harden media context review gaps Preserve extensionless HTTP(S) audio and video URLs as URL media, reject local path-looking audio/video context values, and reject provider-specific audio/video blocks in the Anthropic adapter. Refs #671 * test: add audio video context smoke notebook Add a Jupytext source notebook and generated Colab artifact that exercise audio/video context URL, base64, local path rejection, OpenAI-compatible payload translation, and Anthropic unsupported-media handling. Refs #671 * test: make media context notebook end to end Rewrite the audio/video smoke notebook to run a full Data Designer preview against a local OpenAI-compatible HTTP server. Assert the generated dataset, captured endpoint payload, URL/base64 translation, and local path rejection through the interface pipeline. Refs #671 * test: remove media context notebook from docs Move the generated audio/video context E2E notebook out of the PR docs surface and keep it locally under the main checkout's .scratch directory. Refs #671 * harden multimodal media context handling * address media context review notes Remove unused URL-specific media helpers, share the base64 data URI parser in Anthropic translation, align AudioContext validation messaging, and update config docs for audio/video contexts. Refs #671 * docs: update media context guidance * refactor: consolidate media helpers * support local audio and video paths * refactor: combine media path checks * address media context review feedback * remove openai media preflight * sync generated colab notebooks * align media local path autodetection
5.1 KiB
Overview
Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.
🚀 Setting Up Your Environment
Local Setup Best Practices
First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:
=== "uv (Recommended)"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Launch Jupyter
uv run jupyter notebook
```
=== "pip + venv"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter
# Launch Jupyter
jupyter notebook
```
API Keys and Authentication
Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:
# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"
# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"
For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.
📚 Tutorial Series
The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:
1. The Basics
Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:
- Setting up the
DataDesignerinterface - Configuring models and inference parameters
- Using built-in samplers (Category, Person, Uniform)
- Generating LLM text columns with dependencies
- Understanding the generation workflow
Start here if you're new to Data Designer!
2. Structured Outputs, Jinja Expressions, and Conditional Generation
Explore more advanced data generation capabilities:
- Creating structured JSON outputs with schemas
- Using Jinja expressions for derived columns
- Combining samplers with structured data
- Building complex data dependencies
- Working with nested data structures
- Conditional generation with
skip.when
3. Seeding with an External Dataset
Learn how to leverage existing datasets to guide synthetic data generation:
- Loading and using seed datasets
- Sampling from real data distributions
- Combining seed data with LLM generation
- Creating realistic synthetic data based on existing patterns
4. Providing Images as Context
Learn how to use vision-language models to generate text descriptions from images:
- Processing and converting images to base64 format for model consumption
- Using vision-language models (VLMs) to analyze visual documents
- Understanding how image, audio, and video context share the same
multi_modal_contextfield, while still requiring model support for each modality - Generating detailed summaries from document images
- Inspecting and validating vision-based generation results
5. Generating Images
Generate synthetic image data with Data Designer:
- Configuring image-generation models with
ImageInferenceParams - Adding image columns with Jinja2 prompts and sampler-driven diversity
- Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
- Displaying generated images in the notebook
6. Image-to-Image Editing
Chain image generation columns to generate and then edit images:
- Generating images from text and then editing them in a follow-up column
- Using
ImageContextwith auto-detection to pass generated images to an editing model - Combining sampled accessories and settings for varied edits
- Comparing generated vs edited images in preview and create modes
📖 Important Documentation Sections
Before diving into the tutorials, familiarize yourself with these key documentation sections:
Getting Started
- Welcome & Installation - Overview of Data Designer capabilities and installation instructions
Core Concepts
Understanding these concepts will help you make the most of the tutorials:
- Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
- Validators - Understand how to validate generated data with Python, SQL, and remote validators
- Person Sampling - Learn how to sample realistic person data with demographic attributes