DataDesigner/docs/notebook_source/_README.md
Nabin Mulepati 3b4e296baf
feat: add OpenRouter as one of the default providers (#161)
* Add openrouter as a default provider

* Update docs
2026-01-06 10:22:18 -07:00

4.6 KiB

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Quick Start, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

  • Setting up the DataDesigner interface
  • Configuring models and inference parameters
  • Using built-in samplers (Category, Person, Uniform)
  • Generating LLM text columns with dependencies
  • Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs and Jinja Expressions

Explore more advanced data generation capabilities:

  • Creating structured JSON outputs with schemas
  • Using Jinja expressions for derived columns
  • Combining samplers with structured data
  • Building complex data dependencies
  • Working with nested data structures

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

  • Loading and using seed datasets
  • Sampling from real data distributions
  • Combining seed data with LLM generation
  • Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

  • Processing and converting images to base64 format for model consumption
  • Using vision-language models (VLMs) to analyze visual documents
  • Generating detailed summaries from document images
  • Inspecting and validating vision-based generation results

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

  • Installation - Detailed installation instructions for various setups
  • Welcome Guide - Overview of Data Designer capabilities and architecture

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

  • Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
  • Validators - Understand how to validate generated data with Python, SQL, and remote validators
  • Person Sampling - Learn how to sample realistic person data with demographic attributes

Code Reference

Quick reference guides for the main configuration objects: