DataDesigner/docs/notebook_source/_README.md

134 lines
4.9 KiB
Markdown
Raw Permalink Normal View History

# Overview
Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.
## 🚀 Setting Up Your Environment
### Local Setup Best Practices
First, download the tutorial [from the release assets](https://github.com/NVIDIA-NeMo/DataDesigner/releases/latest/download/data_designer_tutorial.zip).
To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:
=== "uv (Recommended)"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Launch Jupyter
uv run jupyter notebook
```
=== "pip + venv"
```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial
# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter
# Launch Jupyter
jupyter notebook
```
### API Keys and Authentication
Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:
```bash
# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"
# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"
```
For more information, check the [Welcome](../index.md), [Default Model Settings](../concepts/models/default-model-settings.md) and how to [Configure Model Settings Using The CLI](../concepts/models/configure-model-settings-with-the-cli.md).
## 📚 Tutorial Series
The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:
### [1. The Basics](1-the-basics.ipynb)
Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:
- Setting up the `DataDesigner` interface
- Configuring models and inference parameters
- Using built-in samplers (Category, Person, Uniform)
- Generating LLM text columns with dependencies
- Understanding the generation workflow
**Start here if you're new to Data Designer!**
feat: add skip.when conditional column generation (#502) * plan: add skip_when for conditional column generation (#479) Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: remove HopChain example from skip_when plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: replace HopChain example with generic product review example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: add open questions on skip sentinel value and row filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: major revision — SkipConfig model, sync engine support, decouple propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor * updates * plan: document get_required_columns for skip propagation - Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor * plan: centralize __skipped__ handling in skip_provenance - Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor * plan: align doc title with SkipConfig / skip.when Drop legacy skip_when naming in headings and #362 cross-reference. Refs #479 Made-with: Cursor * plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization - SkipConfig._validate_when_syntax now checks find_undeclared_variables is non-empty, rejecting expressions without {{ }} delimiters that would silently skip every row - evaluate_skip_when centralizes try/except so both sync and async engines get identical fail-safe behavior on eval errors - evaluate_skip_when takes a single pre-deserialized record; caller runs deserialize_json_values once and passes to both skip eval and generator (no double deserialization, no redundant parameter) - Update _should_skip_cell, async _run_cell, Files Modified table, and verification section accordingly Refs #479 Made-with: Cursor * plan: add get_side_effect_columns accessor to execution graph spec Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 70463789 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor * add skip.when conditional column generation Introduce SkipConfig on SingleColumnConfig to gate column generation with a Jinja2 expression. Columns can be skipped by expression or by upstream propagation (propagate_skip flag). - SkipConfig: Pydantic model with config-time syntax/delimiter/variable validation and cached column extraction from the Jinja2 AST - skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment with fail-safe error handling (skip on expected failures) - skip_provenance: centralized __skipped__ record tracking shared by sync builder, async scheduler, and buffer managers - DAG/ExecutionGraph: skip.columns wired as dependency edges in both topological sort and static execution graph - Validation: validate_skip_references checks reference existence, sampler/seed scope, and allow_resize conflicts - Sync builder: cell-by-cell and full-column skip with merge-back - Async scheduler: cell and batch skip with live-buffer provenance Made-with: Cursor * fix review findings for skip.when implementation - Add skip evaluation to _fan_out_with_async (was missing, causing skipped rows to still be sent to the LLM) - Preserve __skipped__ provenance on non-skipped records after full-column generation so multi-hop propagation works - Use single live-buffer reference in _run_batch skip loop for consistency with _run_cell - Move Template import to TYPE_CHECKING and reorder import blocks - Replace O(n²) sum() with itertools.chain in dag.py - Add set_required_columns/set_propagate_skip/set_skip_config setters to ExecutionGraph for symmetry with existing API Made-with: Cursor * add conditional generation with skip recipe and refactor skip helpers Add a new recipe demonstrating skip.when patterns (expression gate, propagation, opt-out) with a customer support ticket pipeline. Also extract _should_skip_record in async_scheduler, remove the redundant propagate_skip param from should_skip_by_propagation, and pass a precomputed all_side_effects set through the DAG sort. Made-with: Cursor * updates * fixes * remove recipe > inject conditional gen into existing tutorial * regen colab notebooks * fix: handle missing execution graph in _column_can_skip Return False when the graph has not been initialized instead of raising, since skip logic cannot apply before generators are set up. Made-with: Cursor * parametrize some tests * public before private * slight refactor for readability * parametrize some tests * minor fixes * reanme internla skip tracker key name * clarify intent in comment * when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it * remove inline import * minor refactor for clarity * fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch Two bugs in the sequential engine's _run_full_column_generator: 1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False fallthrough), breaking propagate_skip for downstream columns when an independent FULL_COLUMN generator ran between skip-setting and propagating columns. 2. _column_can_skip returned True for allow_resize=True columns via propagation, causing the skip-aware merge path to raise on the 1:1 row-count check for 1:N generators. - Add restore_skip_metadata helper to skip_tracker.py - Guard _column_can_skip against allow_resize=True columns - Refactor _run_full_column_generator into three focused methods - Remove dead allow_resize / _log_resize_if_changed from skip path - Remove redundant _require_graph() calls in skip helpers - Add single_column_config_by_name cached property - Add integration tests for both bugs and unit tests for the helper Made-with: Cursor * address review comments on skip.when PR (#502) - Extract shared skip decision logic (_should_skip_cell / _should_skip_record) into should_skip_column_for_record() in skip_evaluator.py so both sync and async engines call the same function (andreatgretel review comment) - Extend SkipConfig self-reference validation to cover side-effect columns (e.g. review__trace on the review column) — previously only checked self.name, now checks self.name | self.side_effect_columns - Add async engine integration tests for skip paths: cell-by-cell with propagation and full-column batch skip (exercises _run_cell / _run_batch) - Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default propagate_skip=True so it actually exercises the allow_resize guard - Move get_skipped_column_names from skip_tracker to skip_evaluator (sole production consumer) Made-with: Cursor * address cr feedback * Fix issue with full column generating messing up order of skipped rows * add skip conditional generation edge case tests - test_skip_evaluator: parametrized should_skip_column_for_record covering propagation, expression gates, short-circuiting, and disabled propagation - test_execution_graph: skip metadata accessors (get_skip_config, should_propagate_skip, get_required_columns, get_side_effect_columns, resolve_side_effect, skip.when DAG edges) - test_dataset_builder: chained transitive propagation (4 levels), two independent skip gates, custom skip.value, row count preservation Made-with: Cursor * fix: make expression jinja validator private Rename assert_expression_valid_jinja to _assert_expression_valid_jinja to match the private naming convention used by other model validators. Made-with: Cursor --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
### [2. Structured Outputs, Jinja Expressions, and Conditional Generation](2-structured-outputs-and-jinja-expressions.ipynb)
Explore more advanced data generation capabilities:
- Creating structured JSON outputs with schemas
- Using Jinja expressions for derived columns
- Combining samplers with structured data
- Building complex data dependencies
- Working with nested data structures
feat: add skip.when conditional column generation (#502) * plan: add skip_when for conditional column generation (#479) Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: remove HopChain example from skip_when plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: replace HopChain example with generic product review example Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: add open questions on skip sentinel value and row filtering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * plan: major revision — SkipConfig model, sync engine support, decouple propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor * updates * plan: document get_required_columns for skip propagation - Explain why propagation must not use get_upstream_columns() once skip.when adds DAG edges; add _required_columns and get_required_columns() to the execution graph plan - Point async _run_cell at get_required_columns for parity with sync - Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for DataFrames; tighten resolved-questions wording - Extend DAG/graph verification with gating_col regression case Refs #479 Made-with: Cursor * plan: centralize __skipped__ handling in skip_provenance - Document new skip_provenance.py (key constant, read/write/strip API) - Point sync builder, async scheduler, and batch buffers at shared helpers - Strip metadata before every DataFrame from buffer dicts, including FULL_COLUMN active subsets - Split §3 into skip_evaluator vs skip_provenance; extend verification Refs #479 Made-with: Cursor * plan: align doc title with SkipConfig / skip.when Drop legacy skip_when naming in headings and #362 cross-reference. Refs #479 Made-with: Cursor * plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization - SkipConfig._validate_when_syntax now checks find_undeclared_variables is non-empty, rejecting expressions without {{ }} delimiters that would silently skip every row - evaluate_skip_when centralizes try/except so both sync and async engines get identical fail-safe behavior on eval errors - evaluate_skip_when takes a single pre-deserialized record; caller runs deserialize_json_values once and passes to both skip eval and generator (no double deserialization, no redundant parameter) - Update _should_skip_cell, async _run_cell, Files Modified table, and verification section accordingly Refs #479 Made-with: Cursor * plan: add get_side_effect_columns accessor to execution graph spec Document _side_effects_by_producer inverse map and get_side_effect_columns() accessor on ExecutionGraph, needed by _write_skip_to_record / apply_skip_to_record to clear __trace, __reasoning_content, etc. on skip. Added to both Step 2b metadata section and Files Modified table. The __skipped__ leak into active_df (greptile's other P1) was already fixed in 70463789 via strip_skip_metadata_from_records. Refs #479 Made-with: Cursor * add skip.when conditional column generation Introduce SkipConfig on SingleColumnConfig to gate column generation with a Jinja2 expression. Columns can be skipped by expression or by upstream propagation (propagate_skip flag). - SkipConfig: Pydantic model with config-time syntax/delimiter/variable validation and cached column extraction from the Jinja2 AST - skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment with fail-safe error handling (skip on expected failures) - skip_provenance: centralized __skipped__ record tracking shared by sync builder, async scheduler, and buffer managers - DAG/ExecutionGraph: skip.columns wired as dependency edges in both topological sort and static execution graph - Validation: validate_skip_references checks reference existence, sampler/seed scope, and allow_resize conflicts - Sync builder: cell-by-cell and full-column skip with merge-back - Async scheduler: cell and batch skip with live-buffer provenance Made-with: Cursor * fix review findings for skip.when implementation - Add skip evaluation to _fan_out_with_async (was missing, causing skipped rows to still be sent to the LLM) - Preserve __skipped__ provenance on non-skipped records after full-column generation so multi-hop propagation works - Use single live-buffer reference in _run_batch skip loop for consistency with _run_cell - Move Template import to TYPE_CHECKING and reorder import blocks - Replace O(n²) sum() with itertools.chain in dag.py - Add set_required_columns/set_propagate_skip/set_skip_config setters to ExecutionGraph for symmetry with existing API Made-with: Cursor * add conditional generation with skip recipe and refactor skip helpers Add a new recipe demonstrating skip.when patterns (expression gate, propagation, opt-out) with a customer support ticket pipeline. Also extract _should_skip_record in async_scheduler, remove the redundant propagate_skip param from should_skip_by_propagation, and pass a precomputed all_side_effects set through the DAG sort. Made-with: Cursor * updates * fixes * remove recipe > inject conditional gen into existing tutorial * regen colab notebooks * fix: handle missing execution graph in _column_can_skip Return False when the graph has not been initialized instead of raising, since skip logic cannot apply before generators are set up. Made-with: Cursor * parametrize some tests * public before private * slight refactor for readability * parametrize some tests * minor fixes * reanme internla skip tracker key name * clarify intent in comment * when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it * remove inline import * minor refactor for clarity * fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch Two bugs in the sequential engine's _run_full_column_generator: 1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False fallthrough), breaking propagate_skip for downstream columns when an independent FULL_COLUMN generator ran between skip-setting and propagating columns. 2. _column_can_skip returned True for allow_resize=True columns via propagation, causing the skip-aware merge path to raise on the 1:1 row-count check for 1:N generators. - Add restore_skip_metadata helper to skip_tracker.py - Guard _column_can_skip against allow_resize=True columns - Refactor _run_full_column_generator into three focused methods - Remove dead allow_resize / _log_resize_if_changed from skip path - Remove redundant _require_graph() calls in skip helpers - Add single_column_config_by_name cached property - Add integration tests for both bugs and unit tests for the helper Made-with: Cursor * address review comments on skip.when PR (#502) - Extract shared skip decision logic (_should_skip_cell / _should_skip_record) into should_skip_column_for_record() in skip_evaluator.py so both sync and async engines call the same function (andreatgretel review comment) - Extend SkipConfig self-reference validation to cover side-effect columns (e.g. review__trace on the review column) — previously only checked self.name, now checks self.name | self.side_effect_columns - Add async engine integration tests for skip paths: cell-by-cell with propagation and full-column batch skip (exercises _run_cell / _run_batch) - Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default propagate_skip=True so it actually exercises the allow_resize guard - Move get_skipped_column_names from skip_tracker to skip_evaluator (sole production consumer) Made-with: Cursor * address cr feedback * Fix issue with full column generating messing up order of skipped rows * add skip conditional generation edge case tests - test_skip_evaluator: parametrized should_skip_column_for_record covering propagation, expression gates, short-circuiting, and disabled propagation - test_execution_graph: skip metadata accessors (get_skip_config, should_propagate_skip, get_required_columns, get_side_effect_columns, resolve_side_effect, skip.when DAG edges) - test_dataset_builder: chained transitive propagation (4 levels), two independent skip gates, custom skip.value, row count preservation Made-with: Cursor * fix: make expression jinja validator private Rename assert_expression_valid_jinja to _assert_expression_valid_jinja to match the private naming convention used by other model validators. Made-with: Cursor --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
- Conditional generation with `skip.when`
### [3. Seeding with an External Dataset](3-seeding-with-a-dataset.ipynb)
Learn how to leverage existing datasets to guide synthetic data generation:
- Loading and using seed datasets
- Sampling from real data distributions
- Combining seed data with LLM generation
- Creating realistic synthetic data based on existing patterns
### [4. Providing Images as Context](4-providing-images-as-context.ipynb)
Learn how to use vision-language models to generate text descriptions from images:
- Processing and converting images to base64 format for model consumption
- Using vision-language models (VLMs) to analyze visual documents
- Generating detailed summaries from document images
- Inspecting and validating vision-based generation results
### [5. Generating Images](5-generating-images.ipynb)
Generate synthetic image data with Data Designer:
- Configuring image-generation models with `ImageInferenceParams`
- Adding image columns with Jinja2 prompts and sampler-driven diversity
- Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
- Displaying generated images in the notebook
### [6. Image-to-Image Editing](6-editing-images-with-image-context.ipynb)
Chain image generation columns to generate and then edit images:
- Generating images from text and then editing them in a follow-up column
- Using `ImageContext` with auto-detection to pass generated images to an editing model
- Combining sampled accessories and settings for varied edits
- Comparing generated vs edited images in preview and create modes
## 📖 Important Documentation Sections
Before diving into the tutorials, familiarize yourself with these key documentation sections:
### Getting Started
- **[Welcome & Installation](../index.md)** - Overview of Data Designer capabilities and installation instructions
### Core Concepts
Understanding these concepts will help you make the most of the tutorials:
- **[Columns](../concepts/columns.md)** - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
- **[Validators](../concepts/validators.md)** - Understand how to validate generated data with Python, SQL, and remote validators
- **[Person Sampling](../concepts/person_sampling.md)** - Learn how to sample realistic person data with demographic attributes