DataDesigner/docs/notebook_source
Nabin Mulepati a9af365e8e
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)

Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: remove HopChain example from skip_when plan

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: replace HopChain example with generic product review example

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: add open questions on skip sentinel value and row filtering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* plan: major revision — SkipConfig model, sync engine support, decouple propagation

- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
  bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
  _fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
  and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan

Made-with: Cursor

* updates

* plan: document get_required_columns for skip propagation

- Explain why propagation must not use get_upstream_columns() once
  skip.when adds DAG edges; add _required_columns and
  get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
  DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case

Refs #479

Made-with: Cursor

* plan: centralize __skipped__ handling in skip_provenance

- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
  FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification

Refs #479

Made-with: Cursor

* plan: align doc title with SkipConfig / skip.when

Drop legacy skip_when naming in headings and #362 cross-reference.

Refs #479

Made-with: Cursor

* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization

- SkipConfig._validate_when_syntax now checks find_undeclared_variables
  is non-empty, rejecting expressions without {{ }} delimiters that
  would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
  engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
  runs deserialize_json_values once and passes to both skip eval and
  generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
  and verification section accordingly

Refs #479

Made-with: Cursor

* plan: add get_side_effect_columns accessor to execution graph spec

Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.

The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.

Refs #479

Made-with: Cursor

* add skip.when conditional column generation

Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).

- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
  validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
  with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
  sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
  topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
  sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance

Made-with: Cursor

* fix review findings for skip.when implementation

- Add skip evaluation to _fan_out_with_async (was missing, causing
  skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
  full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
  consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
  setters to ExecutionGraph for symmetry with existing API

Made-with: Cursor

* add conditional generation with skip recipe and refactor skip helpers

Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.

Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.

Made-with: Cursor

* updates

* fixes

* remove recipe > inject conditional gen into existing tutorial

* regen colab notebooks

* fix: handle missing execution graph in _column_can_skip

Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.

Made-with: Cursor

* parametrize some tests

* public before private

* slight refactor for readability

* parametrize some tests

* minor fixes

* reanme internla skip tracker key name

* clarify intent in comment

* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it

* remove inline import

* minor refactor for clarity

* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch

Two bugs in the sequential engine's _run_full_column_generator:

1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
   three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
   fallthrough), breaking propagate_skip for downstream columns when an
   independent FULL_COLUMN generator ran between skip-setting and
   propagating columns.

2. _column_can_skip returned True for allow_resize=True columns via
   propagation, causing the skip-aware merge path to raise on the 1:1
   row-count check for 1:N generators.

- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper

Made-with: Cursor

* address review comments on skip.when PR (#502)

- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
  into should_skip_column_for_record() in skip_evaluator.py so both sync and
  async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
  (e.g. review__trace on the review column) — previously only checked
  self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
  propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
  propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
  production consumer)

Made-with: Cursor

* address cr feedback

* Fix issue with full column  generating messing up order of skipped rows

* add skip conditional generation edge case tests

- test_skip_evaluator: parametrized should_skip_column_for_record covering
  propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
  should_propagate_skip, get_required_columns, get_side_effect_columns,
  resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
  two independent skip gates, custom skip.value, row count preservation

Made-with: Cursor

* fix: make expression jinja validator private

Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.

Made-with: Cursor

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:31:50 -06:00
..
1-the-basics.py feat: add skip.when conditional column generation (#502) 2026-04-15 09:31:50 -06:00
2-structured-outputs-and-jinja-expressions.py feat: add skip.when conditional column generation (#502) 2026-04-15 09:31:50 -06:00
3-seeding-with-a-dataset.py feat: add image generation support with multi-modal context (#317) 2026-02-12 14:00:28 -07:00
4-providing-images-as-context.py chore: simplify tutorial 4 image dataset and use default model config (#403) 2026-03-13 12:26:41 -06:00
5-generating-images.py docs: add image generation documentation and image-to-image editing tutorial (#319) 2026-02-12 14:38:52 -07:00
6-editing-images-with-image-context.py fix: repair notebook CI (dead model, missing API key, pyarrow type bug) (#348) 2026-02-23 13:27:47 -03:00
_pyproject.toml chore: moving notebooks to jupytext and cleaning up workflows (#91) 2025-12-03 17:29:07 -03:00
_README.md feat: add skip.when conditional column generation (#502) 2026-04-15 09:31:50 -06:00
README.md fix: small typo on text file (#95) 2025-12-03 18:31:35 -03:00

📓 Notebooks in .py Format

In this folder you can find all our tutorial notebooks in .py format. They can be converted to actual Jupyter notebooks by typing

make convert-execute-notebooks

from the root of the repository. This will not only convert but also execute all of the notebooks -- for that to work, make sure you went through our Quick Start and have API keys set. A new folder docs/notebooks will be created, including README.md and pyproject.toml files.

Alternatively, you can use Jupytext directly

uv run --group notebooks --group docs jupytext --to ipynb *.py

🔄 Converting Jupyter notebooks to .py

If you want to contribute with your own notebook, you can use the following command to generate .py files in the same format as the ones in this folder:

uv run jupytext --to py [notebook-name].ipynb -o [notebook-name].py