mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

feat: add OpenRouter as one of the default providers (#161 )

* Add openrouter as a default provider

* Update docs

2026-01-06 10:22:18 -07:00

4.6 KiB

Raw Blame History

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Quick Start, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

Setting up the DataDesigner interface
Configuring models and inference parameters
Using built-in samplers (Category, Person, Uniform)
Generating LLM text columns with dependencies
Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs and Jinja Expressions

Explore more advanced data generation capabilities:

Creating structured JSON outputs with schemas
Using Jinja expressions for derived columns
Combining samplers with structured data
Building complex data dependencies
Working with nested data structures

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

Loading and using seed datasets
Sampling from real data distributions
Combining seed data with LLM generation
Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

Processing and converting images to base64 format for model consumption
Using vision-language models (VLMs) to analyze visual documents
Generating detailed summaries from document images
Inspecting and validating vision-based generation results

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

Installation - Detailed installation instructions for various setups
Welcome Guide - Overview of Data Designer capabilities and architecture

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
Validators - Understand how to validate generated data with Python, SQL, and remote validators
Person Sampling - Learn how to sample realistic person data with demographic attributes

Code Reference

Quick reference guides for the main configuration objects:

column_configs - All column configuration types
config_builder - The DataDesignerConfigBuilder API
data_designer_config - Main configuration schema
validator_params - Validator configuration options

4.6 KiB Raw Blame History