mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
334 lines
9.7 KiB
Python
334 lines
9.7 KiB
Python
# ---
|
||
# jupyter:
|
||
# jupytext:
|
||
# text_representation:
|
||
# extension: .py
|
||
# format_name: percent
|
||
# format_version: '1.3'
|
||
# jupytext_version: 1.18.1
|
||
# kernelspec:
|
||
# display_name: .venv
|
||
# language: python
|
||
# name: python3
|
||
# ---
|
||
|
||
# %% [markdown]
|
||
# # 🎨 Data Designer Tutorial: The Basics
|
||
#
|
||
# #### 📚 What you'll learn
|
||
#
|
||
# This notebook demonstrates the basics of Data Designer by generating a simple product review dataset.
|
||
#
|
||
|
||
# %% [markdown]
|
||
# ### 📦 Import Data Designer
|
||
#
|
||
# - `data_designer.config` provides access to the configuration API.
|
||
#
|
||
# - `DataDesigner` is the main interface for data generation.
|
||
#
|
||
|
||
# %%
|
||
import data_designer.config as dd
|
||
from data_designer.interface import DataDesigner
|
||
|
||
# %% [markdown]
|
||
# ### ⚙️ Initialize the Data Designer interface
|
||
#
|
||
# - `DataDesigner` is the main object responsible for managing the data generation process.
|
||
#
|
||
# - When initialized without arguments, the [default model providers](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) are used.
|
||
#
|
||
|
||
# %%
|
||
data_designer = DataDesigner()
|
||
|
||
# %% [markdown]
|
||
# ### 🎛️ Define model configurations
|
||
#
|
||
# - Each `ModelConfig` defines a model that can be used during the generation process.
|
||
#
|
||
# - The "model alias" is used to reference the model in the Data Designer config (as we will see below).
|
||
#
|
||
# - The "model provider" is the external service that hosts the model (see the [model config](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/models/default-model-settings/) docs for more details).
|
||
#
|
||
# - By default, we use [build.nvidia.com](https://build.nvidia.com/models) as the model provider.
|
||
#
|
||
|
||
# %%
|
||
# This name is set in the model provider configuration.
|
||
MODEL_PROVIDER = "nvidia"
|
||
|
||
# The model ID is from build.nvidia.com.
|
||
MODEL_ID = "nvidia/nemotron-3-nano-30b-a3b"
|
||
|
||
# We choose this alias to be descriptive for our use case.
|
||
MODEL_ALIAS = "nemotron-nano-v3"
|
||
|
||
model_configs = [
|
||
dd.ModelConfig(
|
||
alias=MODEL_ALIAS,
|
||
model=MODEL_ID,
|
||
provider=MODEL_PROVIDER,
|
||
inference_parameters=dd.ChatCompletionInferenceParams(
|
||
temperature=1.0,
|
||
top_p=1.0,
|
||
max_tokens=2048,
|
||
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
|
||
),
|
||
)
|
||
]
|
||
|
||
# %% [markdown]
|
||
# ### 🏗️ Initialize the Data Designer Config Builder
|
||
#
|
||
# - The Data Designer config defines the dataset schema and generation process.
|
||
#
|
||
# - The config builder provides an intuitive interface for building this configuration.
|
||
#
|
||
# - The list of model configs is provided to the builder at initialization.
|
||
#
|
||
|
||
# %%
|
||
config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
|
||
|
||
# %% [markdown]
|
||
# ## 🎲 Getting started with sampler columns
|
||
#
|
||
# - Sampler columns offer non-LLM based generation of synthetic data.
|
||
#
|
||
# - They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.
|
||
#
|
||
# <br>
|
||
#
|
||
# You can view available samplers using the config builder's `info` property:
|
||
#
|
||
|
||
# %%
|
||
config_builder.info.display("samplers")
|
||
|
||
# %% [markdown]
|
||
# Let's start designing our product review dataset by adding product category and subcategory columns.
|
||
#
|
||
|
||
# %%
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="product_category",
|
||
sampler_type=dd.SamplerType.CATEGORY,
|
||
params=dd.CategorySamplerParams(
|
||
values=[
|
||
"Electronics",
|
||
"Clothing",
|
||
"Home & Kitchen",
|
||
"Books",
|
||
"Home Office",
|
||
],
|
||
),
|
||
)
|
||
)
|
||
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="product_subcategory",
|
||
sampler_type=dd.SamplerType.SUBCATEGORY,
|
||
params=dd.SubcategorySamplerParams(
|
||
category="product_category",
|
||
values={
|
||
"Electronics": [
|
||
"Smartphones",
|
||
"Laptops",
|
||
"Headphones",
|
||
"Cameras",
|
||
"Accessories",
|
||
],
|
||
"Clothing": [
|
||
"Men's Clothing",
|
||
"Women's Clothing",
|
||
"Winter Coats",
|
||
"Activewear",
|
||
"Accessories",
|
||
],
|
||
"Home & Kitchen": [
|
||
"Appliances",
|
||
"Cookware",
|
||
"Furniture",
|
||
"Decor",
|
||
"Organization",
|
||
],
|
||
"Books": [
|
||
"Fiction",
|
||
"Non-Fiction",
|
||
"Self-Help",
|
||
"Textbooks",
|
||
"Classics",
|
||
],
|
||
"Home Office": [
|
||
"Desks",
|
||
"Chairs",
|
||
"Storage",
|
||
"Office Supplies",
|
||
"Lighting",
|
||
],
|
||
},
|
||
),
|
||
)
|
||
)
|
||
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="target_age_range",
|
||
sampler_type=dd.SamplerType.CATEGORY,
|
||
params=dd.CategorySamplerParams(values=["18-25", "25-35", "35-50", "50-65", "65+"]),
|
||
)
|
||
)
|
||
|
||
# Optionally validate that the columns are configured correctly.
|
||
data_designer.validate(config_builder)
|
||
|
||
# %% [markdown]
|
||
# Next, let's add samplers to generate data related to the customer and their review.
|
||
#
|
||
|
||
# %%
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="customer",
|
||
sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
|
||
params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
|
||
)
|
||
)
|
||
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="number_of_stars",
|
||
sampler_type=dd.SamplerType.UNIFORM,
|
||
params=dd.UniformSamplerParams(low=1, high=5),
|
||
convert_to="int", # Convert the sampled float to an integer.
|
||
)
|
||
)
|
||
|
||
config_builder.add_column(
|
||
dd.SamplerColumnConfig(
|
||
name="review_style",
|
||
sampler_type=dd.SamplerType.CATEGORY,
|
||
params=dd.CategorySamplerParams(
|
||
values=["rambling", "brief", "detailed", "structured with bullet points"],
|
||
weights=[1, 2, 2, 1],
|
||
),
|
||
)
|
||
)
|
||
|
||
data_designer.validate(config_builder)
|
||
|
||
# %% [markdown]
|
||
# ## 🦜 LLM-generated columns
|
||
#
|
||
# - The real power of Data Designer comes from leveraging LLMs to generate text, code, and structured data.
|
||
#
|
||
# - When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.
|
||
#
|
||
# - As we see below, nested json fields can be accessed using dot notation.
|
||
#
|
||
|
||
# %%
|
||
config_builder.add_column(
|
||
dd.LLMTextColumnConfig(
|
||
name="product_name",
|
||
prompt=(
|
||
"You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
|
||
"Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
|
||
"on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
|
||
"{{ target_age_range }} years old. Respond with only the product name, no other text."
|
||
),
|
||
model_alias=MODEL_ALIAS,
|
||
)
|
||
)
|
||
|
||
config_builder.add_column(
|
||
dd.LLMTextColumnConfig(
|
||
name="customer_review",
|
||
prompt=(
|
||
"You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
|
||
"You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
|
||
"Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
|
||
"The style of the review should be '{{ review_style }}'. "
|
||
"Respond with only the review, no other text."
|
||
),
|
||
model_alias=MODEL_ALIAS,
|
||
)
|
||
)
|
||
|
||
data_designer.validate(config_builder)
|
||
|
||
# %% [markdown]
|
||
# ### 🔁 Iteration is key – preview the dataset!
|
||
#
|
||
# 1. Use the `preview` method to generate a sample of records quickly.
|
||
#
|
||
# 2. Inspect the results for quality and format issues.
|
||
#
|
||
# 3. Adjust column configurations, prompts, or parameters as needed.
|
||
#
|
||
# 4. Re-run the preview until satisfied.
|
||
#
|
||
|
||
# %%
|
||
preview = data_designer.preview(config_builder, num_records=2)
|
||
|
||
# %%
|
||
# Run this cell multiple times to cycle through the 2 preview records.
|
||
preview.display_sample_record()
|
||
|
||
# %%
|
||
# The preview dataset is available as a pandas DataFrame.
|
||
preview.dataset
|
||
|
||
# %% [markdown]
|
||
# ### 📊 Analyze the generated data
|
||
#
|
||
# - Data Designer automatically generates a basic statistical analysis of the generated data.
|
||
#
|
||
# - This analysis is available via the `analysis` property of generation result objects.
|
||
#
|
||
|
||
# %%
|
||
# Print the analysis as a table.
|
||
preview.analysis.to_report()
|
||
|
||
# %% [markdown]
|
||
# ### 🆙 Scale up!
|
||
#
|
||
# - Happy with your preview data?
|
||
#
|
||
# - Use the `create` method to submit larger Data Designer generation jobs.
|
||
#
|
||
|
||
# %%
|
||
results = data_designer.create(config_builder, num_records=10, dataset_name="tutorial-1")
|
||
|
||
# %%
|
||
# Load the generated dataset as a pandas DataFrame.
|
||
dataset = results.load_dataset()
|
||
|
||
dataset.head()
|
||
|
||
# %%
|
||
# Load the analysis results into memory.
|
||
analysis = results.load_analysis()
|
||
|
||
analysis.to_report()
|
||
|
||
# %% [markdown]
|
||
# ## ⏭️ Next Steps
|
||
#
|
||
# Now that you've seen the basics of Data Designer, check out the following notebooks to learn more about:
|
||
#
|
||
# - [Structured outputs, jinja expressions, and conditional generation](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/2-structured-outputs-and-jinja-expressions/)
|
||
#
|
||
# - [Seeding synthetic data generation with an external dataset](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/3-seeding-with-a-dataset/)
|
||
#
|
||
# - [Providing images as context](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/4-providing-images-as-context/)
|
||
#
|
||
# - [Generating images](https://nvidia-nemo.github.io/DataDesigner/latest/notebooks/5-generating-images/)
|
||
#
|