4.9 KiB
Overview
Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.
🚀 Setting Up Your Environment
Local Setup Best Practices
First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:
=== "uv (Recommended)"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Launch Jupyter
uv run jupyter notebook
```
=== "pip + venv"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter
# Launch Jupyter
jupyter notebook
```
API Keys and Authentication
Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:
# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"
# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"
For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.
📚 Tutorial Series
The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:
1. The Basics
Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:
- Setting up the
DataDesignerinterface - Configuring models and inference parameters
- Using built-in samplers (Category, Person, Uniform)
- Generating LLM text columns with dependencies
- Understanding the generation workflow
Start here if you're new to Data Designer!
2. Structured Outputs and Jinja Expressions
Explore more advanced data generation capabilities:
- Creating structured JSON outputs with schemas
- Using Jinja expressions for derived columns
- Combining samplers with structured data
- Building complex data dependencies
- Working with nested data structures
3. Seeding with an External Dataset
Learn how to leverage existing datasets to guide synthetic data generation:
- Loading and using seed datasets
- Sampling from real data distributions
- Combining seed data with LLM generation
- Creating realistic synthetic data based on existing patterns
4. Providing Images as Context
Learn how to use vision-language models to generate text descriptions from images:
- Processing and converting images to base64 format for model consumption
- Using vision-language models (VLMs) to analyze visual documents
- Generating detailed summaries from document images
- Inspecting and validating vision-based generation results
5. Generating Images
Generate synthetic image data with Data Designer:
- Configuring image-generation models with
ImageInferenceParams - Adding image columns with Jinja2 prompts and sampler-driven diversity
- Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
- Displaying generated images in the notebook
📖 Important Documentation Sections
Before diving into the tutorials, familiarize yourself with these key documentation sections:
Getting Started
- Welcome & Installation - Overview of Data Designer capabilities and installation instructions
Core Concepts
Understanding these concepts will help you make the most of the tutorials:
- Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
- Validators - Understand how to validate generated data with Python, SQL, and remote validators
- Person Sampling - Learn how to sample realistic person data with demographic attributes
Code Reference
Quick reference guides for the main configuration objects:
- column_configs - All column configuration types
- config_builder - The
DataDesignerConfigBuilderAPI - data_designer_config - Main configuration schema
- validator_params - Validator configuration options