DataDesigner/docs/notebook_source/_README.md
Nabin Mulepati a9af365e8e
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)

Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: remove HopChain example from skip_when plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: replace HopChain example with generic product review example

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: add open questions on skip sentinel value and row filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: major revision — SkipConfig model, sync engine support, decouple propagation

- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
  bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
  _fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
  and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan

Made-with: Cursor

* updates

* plan: document get_required_columns for skip propagation

- Explain why propagation must not use get_upstream_columns() once
  skip.when adds DAG edges; add _required_columns and
  get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
  DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case

Refs #479

Made-with: Cursor

* plan: centralize __skipped__ handling in skip_provenance

- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
  FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification

Refs #479

Made-with: Cursor

* plan: align doc title with SkipConfig / skip.when

Drop legacy skip_when naming in headings and #362 cross-reference.

Refs #479

Made-with: Cursor

* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization

- SkipConfig._validate_when_syntax now checks find_undeclared_variables
  is non-empty, rejecting expressions without {{ }} delimiters that
  would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
  engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
  runs deserialize_json_values once and passes to both skip eval and
  generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
  and verification section accordingly

Refs #479

Made-with: Cursor

* plan: add get_side_effect_columns accessor to execution graph spec

Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.

The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.

Refs #479

Made-with: Cursor

* add skip.when conditional column generation

Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).

- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
  validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
  with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
  sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
  topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
  sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance

Made-with: Cursor

* fix review findings for skip.when implementation

- Add skip evaluation to _fan_out_with_async (was missing, causing
  skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
  full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
  consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
  setters to ExecutionGraph for symmetry with existing API

Made-with: Cursor

* add conditional generation with skip recipe and refactor skip helpers

Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.

Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.

Made-with: Cursor

* updates

* fixes

* remove recipe > inject conditional gen into existing tutorial

* regen colab notebooks

* fix: handle missing execution graph in _column_can_skip

Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.

Made-with: Cursor

* parametrize some tests

* public before private

* slight refactor for readability

* parametrize some tests

* minor fixes

* reanme internla skip tracker key name

* clarify intent in comment

* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it

* remove inline import

* minor refactor for clarity

* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch

Two bugs in the sequential engine's _run_full_column_generator:

1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
   three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
   fallthrough), breaking propagate_skip for downstream columns when an
   independent FULL_COLUMN generator ran between skip-setting and
   propagating columns.

2. _column_can_skip returned True for allow_resize=True columns via
   propagation, causing the skip-aware merge path to raise on the 1:1
   row-count check for 1:N generators.

- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper

Made-with: Cursor

* address review comments on skip.when PR (#502)

- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
  into should_skip_column_for_record() in skip_evaluator.py so both sync and
  async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
  (e.g. review__trace on the review column) — previously only checked
  self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
  propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
  propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
  production consumer)

Made-with: Cursor

* address cr feedback

* Fix issue with full column  generating messing up order of skipped rows

* add skip conditional generation edge case tests

- test_skip_evaluator: parametrized should_skip_column_for_record covering
  propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
  should_propagate_skip, get_required_columns, get_side_effect_columns,
  resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
  two independent skip gates, custom skip.value, row count preservation

Made-with: Cursor

* fix: make expression jinja validator private

Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.

Made-with: Cursor

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:31:50 -06:00

5.4 KiB

Overview

Welcome to the Data Designer tutorial series! These hands-on notebooks will guide you through the core concepts and features of Data Designer, from basic synthetic data generation to advanced techniques like structured outputs and dataset seeding.

🚀 Setting Up Your Environment

Local Setup Best Practices

First, download the tutorial from the release assets. To run the tutorial notebooks locally, we recommend using a virtual environment to manage dependencies:

=== "uv (Recommended)"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Launch Jupyter
uv run jupyter notebook
```

=== "pip + venv"

```bash
# Extract tutorial notebooks
unzip data_designer_tutorial.zip
cd data_designer_tutorial

# Create Python virtual environment and install required packages
python -m venv venv
source venv/bin/activate
pip install data-designer jupyter

# Launch Jupyter
jupyter notebook
```

API Keys and Authentication

Data Designer is able to interface with various LLM providers. You'll need to set up API keys for the models you want to use:

# For NVIDIA API Catalog (build.nvidia.com)
export NVIDIA_API_KEY="your-api-key-here"

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"

# For OpenRouter
export OPENROUTER_API_KEY="your-api-key-here"

For more information, check the Welcome, Default Model Settings and how to Configure Model Settings Using The CLI.

📚 Tutorial Series

The tutorials are designed to be completed in sequence, building upon concepts introduced in previous notebooks:

1. The Basics

Learn the fundamentals of Data Designer by generating a simple product review dataset. This notebook covers:

  • Setting up the DataDesigner interface
  • Configuring models and inference parameters
  • Using built-in samplers (Category, Person, Uniform)
  • Generating LLM text columns with dependencies
  • Understanding the generation workflow

Start here if you're new to Data Designer!

2. Structured Outputs, Jinja Expressions, and Conditional Generation

Explore more advanced data generation capabilities:

  • Creating structured JSON outputs with schemas
  • Using Jinja expressions for derived columns
  • Combining samplers with structured data
  • Building complex data dependencies
  • Working with nested data structures
  • Conditional generation with skip.when

3. Seeding with an External Dataset

Learn how to leverage existing datasets to guide synthetic data generation:

  • Loading and using seed datasets
  • Sampling from real data distributions
  • Combining seed data with LLM generation
  • Creating realistic synthetic data based on existing patterns

4. Providing Images as Context

Learn how to use vision-language models to generate text descriptions from images:

  • Processing and converting images to base64 format for model consumption
  • Using vision-language models (VLMs) to analyze visual documents
  • Generating detailed summaries from document images
  • Inspecting and validating vision-based generation results

5. Generating Images

Generate synthetic image data with Data Designer:

  • Configuring image-generation models with ImageInferenceParams
  • Adding image columns with Jinja2 prompts and sampler-driven diversity
  • Preview (base64 in dataframe) vs create (images saved to disk, paths in dataframe)
  • Displaying generated images in the notebook

6. Image-to-Image Editing

Chain image generation columns to generate and then edit images:

  • Generating images from text and then editing them in a follow-up column
  • Using ImageContext with auto-detection to pass generated images to an editing model
  • Combining sampled accessories and settings for varied edits
  • Comparing generated vs edited images in preview and create modes

📖 Important Documentation Sections

Before diving into the tutorials, familiarize yourself with these key documentation sections:

Getting Started

Core Concepts

Understanding these concepts will help you make the most of the tutorials:

  • Columns - Learn about different column types (Sampler, LLM, Expression, Validation, etc.)
  • Validators - Understand how to validate generated data with Python, SQL, and remote validators
  • Person Sampling - Learn how to sample realistic person data with demographic attributes

Code Reference

Quick reference guides for the main configuration objects: