2026-01-28 13:47:34 +00:00
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2025-10-27 18:29:12 +00:00
# SPDX-License-Identifier: Apache-2.0
from unittest . mock import Mock , patch
2026-02-19 20:21:42 +00:00
import pytest
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
from data_designer . config . base import SkipConfig
2025-11-11 20:36:52 +00:00
from data_designer . config . column_configs import (
2025-10-27 18:29:12 +00:00
ExpressionColumnConfig ,
LLMCodeColumnConfig ,
LLMJudgeColumnConfig ,
LLMTextColumnConfig ,
SamplerColumnConfig ,
Score ,
2026-05-18 23:15:47 +00:00
SeedDatasetColumnConfig ,
2025-10-27 18:29:12 +00:00
ValidationColumnConfig ,
)
from data_designer . config . models import ImageContext , ModalityDataType
2025-12-10 23:59:30 +00:00
from data_designer . config . processors import (
DropColumnsProcessorConfig ,
SchemaTransformProcessorConfig ,
)
2025-10-27 18:29:12 +00:00
from data_designer . config . utils . code_lang import CodeLang
2026-01-08 17:48:14 +00:00
from data_designer . config . validator_params import CodeValidatorParams
from data_designer . engine . validation import (
2025-10-27 18:29:12 +00:00
Violation ,
ViolationLevel ,
ViolationType ,
rich_print_violations ,
validate_code_validation ,
validate_columns_not_all_dropped ,
validate_data_designer_config ,
2026-02-19 20:21:42 +00:00
validate_drop_columns_processor ,
2025-10-27 18:29:12 +00:00
validate_expression_references ,
validate_prompt_templates ,
2025-12-10 23:59:30 +00:00
validate_schema_transform_processor ,
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
validate_skip_references ,
2025-10-27 18:29:12 +00:00
)
STUB_MODEL_ALIAS = " stub-alias "
VALID_COLUMNS = [
SamplerColumnConfig (
name = " random_number " ,
sampler_type = " uniform " ,
params = { " low " : 0 , " high " : 10 } ,
) ,
LLMTextColumnConfig (
name = " valid_reference " ,
prompt = " Why is {{ random_number }} your favorite number? " ,
model_alias = STUB_MODEL_ALIAS ,
) ,
LLMCodeColumnConfig (
name = " code_column_python " ,
prompt = " Generate some python about {{ valid_reference }}. " ,
code_lang = " python " ,
model_alias = STUB_MODEL_ALIAS ,
) ,
]
INVALID_COLUMNS = [
LLMTextColumnConfig (
name = " text_no_references " ,
prompt = " Generate a name for the person " ,
model_alias = STUB_MODEL_ALIAS ,
) ,
LLMTextColumnConfig (
name = " text_invalid_reference " ,
prompt = " Generate a name for the person: {{ this_column_does_not_exist }} " ,
model_alias = STUB_MODEL_ALIAS ,
) ,
LLMJudgeColumnConfig (
name = " judge_no_references " ,
prompt = " Judge the name for the person. " ,
scores = [ Mock ( spec = Score ) ] ,
model_alias = STUB_MODEL_ALIAS ,
) ,
LLMJudgeColumnConfig (
name = " judge_invalid_reference " ,
prompt = " Judge the name for the person: {{ this_column_does_not_exist }} " ,
scores = [ Mock ( spec = Score ) ] ,
model_alias = STUB_MODEL_ALIAS ,
) ,
ValidationColumnConfig (
name = " code_validation_python " ,
target_columns = [ " code_column_missing " ] ,
validator_type = " code " ,
validator_params = CodeValidatorParams ( code_lang = CodeLang . SQL_ANSI ) ,
) ,
ValidationColumnConfig (
name = " code_validation_ansi " ,
target_columns = [ " code_column_python " ] ,
validator_type = " code " ,
validator_params = CodeValidatorParams ( code_lang = CodeLang . SQL_ANSI ) ,
) ,
ValidationColumnConfig (
name = " code_validation_not_code " ,
target_columns = [ " text_no_references " ] ,
validator_type = " code " ,
validator_params = CodeValidatorParams ( code_lang = CodeLang . PYTHON ) ,
) ,
]
COLUMNS = VALID_COLUMNS + INVALID_COLUMNS
2025-11-04 23:09:55 +00:00
PROCESSOR_CONFIGS = [
DropColumnsProcessorConfig (
2025-12-10 23:59:30 +00:00
name = " drop_columns_processor " ,
2025-11-04 23:09:55 +00:00
column_names = [ " inexistent_column " ] ,
2025-12-10 23:59:30 +00:00
) ,
SchemaTransformProcessorConfig (
name = " schema_transform_processor_invalid_reference " ,
template = { " text " : " {{ invalid_reference }} " } ,
) ,
2025-11-04 23:09:55 +00:00
]
2025-10-27 18:29:12 +00:00
ALLOWED_REFERENCE = [ c . name for c in COLUMNS ]
2026-01-08 17:48:14 +00:00
@patch ( " data_designer.engine.validation.validate_prompt_templates " )
@patch ( " data_designer.engine.validation.validate_code_validation " )
@patch ( " data_designer.engine.validation.validate_expression_references " )
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
@patch ( " data_designer.engine.validation.validate_skip_references " )
2026-01-08 17:48:14 +00:00
@patch ( " data_designer.engine.validation.validate_columns_not_all_dropped " )
@patch ( " data_designer.engine.validation.validate_drop_columns_processor " )
@patch ( " data_designer.engine.validation.validate_schema_transform_processor " )
2025-10-27 18:29:12 +00:00
def test_validate_data_designer_config (
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
mock_validate_schema_transform_processor : Mock ,
mock_validate_drop_columns_processor : Mock ,
mock_validate_columns_not_all_dropped : Mock ,
mock_validate_skip_references : Mock ,
mock_validate_expression_references : Mock ,
mock_validate_code_validation : Mock ,
mock_validate_prompt_templates : Mock ,
) - > None :
2025-10-27 18:29:12 +00:00
mock_validate_columns_not_all_dropped . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . ALL_COLUMNS_DROPPED ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
mock_validate_expression_references . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . EXPRESSION_REFERENCE_MISSING ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
mock_validate_code_validation . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . CODE_COLUMN_MISSING ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
mock_validate_prompt_templates . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . PROMPT_WITHOUT_REFERENCES ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
2025-11-04 23:09:55 +00:00
mock_validate_drop_columns_processor . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . INVALID_COLUMN ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
2025-12-10 23:59:30 +00:00
mock_validate_schema_transform_processor . return_value = [
Violation (
column = " text " ,
type = ViolationType . INVALID_REFERENCE ,
message = " Ancillary dataset processor attempts to reference columns ' invalid_reference ' in the template for ' text ' , but the columns are not defined in the dataset. " ,
level = ViolationLevel . ERROR ,
)
]
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
mock_validate_skip_references . return_value = [
Violation (
column = " test_column " ,
type = ViolationType . SKIP_REFERENCE_MISSING ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
2025-12-10 23:59:30 +00:00
2025-11-04 23:09:55 +00:00
violations = validate_data_designer_config ( COLUMNS , PROCESSOR_CONFIGS , ALLOWED_REFERENCE )
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
assert len ( violations ) == 7
2025-10-27 18:29:12 +00:00
mock_validate_columns_not_all_dropped . assert_called_once ( )
mock_validate_expression_references . assert_called_once ( )
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
mock_validate_skip_references . assert_called_once ( )
2025-10-27 18:29:12 +00:00
mock_validate_code_validation . assert_called_once ( )
mock_validate_prompt_templates . assert_called_once ( )
2025-11-04 23:09:55 +00:00
mock_validate_drop_columns_processor . assert_called_once ( )
2025-12-10 23:59:30 +00:00
mock_validate_schema_transform_processor . assert_called_once ( )
2025-10-27 18:29:12 +00:00
def test_validate_prompt_templates ( ) :
violations = validate_prompt_templates ( COLUMNS , ALLOWED_REFERENCE )
assert len ( violations ) == 4
assert violations [ 0 ] . type == ViolationType . PROMPT_WITHOUT_REFERENCES
assert violations [ 1 ] . type == ViolationType . INVALID_REFERENCE
assert violations [ 2 ] . type == ViolationType . PROMPT_WITHOUT_REFERENCES
assert violations [ 3 ] . type == ViolationType . INVALID_REFERENCE
def test_validate_code_validation ( ) :
violations = validate_code_validation ( COLUMNS )
assert len ( violations ) == 3
assert violations [ 0 ] . type == ViolationType . CODE_COLUMN_MISSING
assert violations [ 1 ] . type == ViolationType . CODE_LANG_MISMATCH
assert violations [ 2 ] . type == ViolationType . CODE_COLUMN_NOT_CODE
def test_validate_detect_f_string_syntax ( ) :
columns = VALID_COLUMNS
columns . append (
LLMTextColumnConfig (
name = " f_string_ref " ,
prompt = " Why is {random_number} your favorite number? {{ valid_reference }} " ,
model_alias = STUB_MODEL_ALIAS ,
)
)
violations = validate_prompt_templates ( columns , [ c . name for c in columns ] )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . F_STRING_SYNTAX
assert violations [ 0 ] . level == ViolationLevel . WARNING
def test_validate_column_config_with_multi_modal_context ( ) :
column = LLMTextColumnConfig (
name = " image_description " ,
prompt = " Describe the image in no less that 10 sentences. " ,
model_alias = STUB_MODEL_ALIAS ,
multi_modal_context = [ ImageContext ( column_name = " image_url " , data_type = ModalityDataType . URL ) ] ,
)
violations = validate_prompt_templates ( [ column ] , [ column . name ] )
# there should be no violations because the prompt does not reference any columns and it's not necessary
# when multi modal context is provided
assert len ( violations ) == 0
def test_validate_columns_not_all_dropped ( ) :
violations = validate_columns_not_all_dropped (
[
SamplerColumnConfig (
name = " random_number " ,
sampler_type = " uniform " ,
params = { " low " : 0 , " high " : 10 } ,
drop = True ,
) ,
LLMTextColumnConfig (
name = " valid_reference " ,
prompt = " Why is {{ random_number }} your favorite number? " ,
model_alias = STUB_MODEL_ALIAS ,
drop = True ,
) ,
]
)
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . ALL_COLUMNS_DROPPED
2026-05-18 23:15:47 +00:00
def test_validate_columns_not_all_dropped_allows_seeded_processor_only_config ( ) :
violations = validate_columns_not_all_dropped (
[ SeedDatasetColumnConfig ( name = " seed_text " ) ] ,
processor_configs = [
SchemaTransformProcessorConfig ( name = " format " , template = { " text " : " {{ seed_text }} " } ) ,
] ,
)
assert violations == [ ]
def test_validate_columns_not_all_dropped_rejects_seeded_processor_only_config_with_no_output_columns ( ) :
violations = validate_columns_not_all_dropped (
[ SeedDatasetColumnConfig ( name = " seed_text " ) ] ,
processor_configs = [
DropColumnsProcessorConfig ( name = " drop_seed " , column_names = [ " seed_text " ] ) ,
] ,
)
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . ALL_COLUMNS_DROPPED
def test_validate_columns_not_all_dropped_allows_generated_columns_dropped_by_processors ( ) :
violations = validate_columns_not_all_dropped (
[
LLMTextColumnConfig ( name = " question " , prompt = " Generate a question. " , model_alias = STUB_MODEL_ALIAS ) ,
LLMTextColumnConfig ( name = " answer " , prompt = " Answer {{ question }}. " , model_alias = STUB_MODEL_ALIAS ) ,
] ,
processor_configs = [
DropColumnsProcessorConfig ( name = " drop_raw " , column_names = [ " question " , " answer " ] ) ,
SchemaTransformProcessorConfig ( name = " format " , template = { " messages " : " {{ question }} {{ answer }} " } ) ,
] ,
)
assert violations == [ ]
def test_validate_columns_not_all_dropped_still_rejects_seed_only_config ( ) :
violations = validate_columns_not_all_dropped ( [ SeedDatasetColumnConfig ( name = " seed_text " ) ] )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . ALL_COLUMNS_DROPPED
2025-10-27 18:29:12 +00:00
def test_validate_expression_references ( ) :
violations = validate_expression_references (
[
ExpressionColumnConfig (
name = " expression_column " ,
expr = " {{ random_number }} " ,
dtype = " int " ,
) ,
] ,
allowed_references = [ " some_other_column " ] ,
)
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . EXPRESSION_REFERENCE_MISSING
2025-12-10 23:59:30 +00:00
def test_validate_schema_transform_processor ( ) :
violations = validate_schema_transform_processor ( COLUMNS , PROCESSOR_CONFIGS )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . INVALID_REFERENCE
assert violations [ 0 ] . column is None
assert (
violations [ 0 ] . message
== " Ancillary dataset processor attempts to reference columns ' invalid_reference ' in the template for ' text ' , but the columns are not defined in the dataset. "
)
assert violations [ 0 ] . level == ViolationLevel . ERROR
2026-02-19 20:21:42 +00:00
@pytest.mark.parametrize (
" extract_reasoning, expected_violations " ,
[
( True , 0 ) ,
( False , 1 ) ,
] ,
)
def test_validate_drop_columns_processor_reasoning_column ( extract_reasoning , expected_violations ) :
columns = [
LLMTextColumnConfig (
name = " answer " ,
prompt = " Answer the question. " ,
model_alias = STUB_MODEL_ALIAS ,
extract_reasoning_content = extract_reasoning ,
) ,
]
processor_configs = [
DropColumnsProcessorConfig (
name = " drop_reasoning " ,
column_names = [ " answer__reasoning_content " ] ,
) ,
]
violations = validate_drop_columns_processor ( columns , processor_configs )
assert len ( violations ) == expected_violations
@pytest.mark.parametrize (
" pattern, expected_violations, expected_level " ,
[
( " *__reasoning_content " , 0 , None ) ,
( " zzz_* " , 1 , ViolationLevel . WARNING ) ,
] ,
)
def test_validate_drop_columns_processor_glob ( pattern , expected_violations , expected_level ) :
columns = [
LLMTextColumnConfig (
name = " answer " ,
prompt = " Answer the question. " ,
model_alias = STUB_MODEL_ALIAS ,
extract_reasoning_content = True ,
) ,
]
processor_configs = [
DropColumnsProcessorConfig ( name = " drop_glob " , column_names = [ pattern ] ) ,
]
violations = validate_drop_columns_processor ( columns , processor_configs )
assert len ( violations ) == expected_violations
if expected_level :
assert violations [ 0 ] . level == expected_level
2026-01-08 17:48:14 +00:00
@patch ( " data_designer.engine.validation.Console.print " )
2025-10-27 18:29:12 +00:00
def test_rich_print_violations ( mock_console_print ) :
rich_print_violations ( [ ] )
mock_console_print . assert_not_called ( )
rich_print_violations (
[
Violation (
column = " test_column " ,
type = ViolationType . EXPRESSION_REFERENCE_MISSING ,
message = " test error message " ,
level = ViolationLevel . ERROR ,
)
]
)
mock_console_print . assert_called_once ( )
feat: add skip.when conditional column generation (#502)
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:31:50 +00:00
def test_validate_skip_references_missing_column ( ) - > None :
columns = [
LLMTextColumnConfig (
name = " with_skip " ,
prompt = " test {{ real_col }} " ,
model_alias = STUB_MODEL_ALIAS ,
skip = SkipConfig ( when = " {{ ghost }} " ) ,
) ,
]
violations = validate_skip_references ( columns , allowed_references = [ " real_col " ] )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . SKIP_REFERENCE_MISSING
assert violations [ 0 ] . column == " with_skip "
def test_validate_skip_references_valid ( ) - > None :
columns = [
LLMTextColumnConfig (
name = " with_skip " ,
prompt = " test {{ gate }} " ,
model_alias = STUB_MODEL_ALIAS ,
skip = SkipConfig ( when = " {{ gate == 0 }} " ) ,
) ,
]
violations = validate_skip_references ( columns , allowed_references = [ " gate " , " with_skip " ] )
assert len ( violations ) == 0
def test_validate_skip_on_sampler_seed ( ) - > None :
col = SamplerColumnConfig . model_construct (
name = " sampler_with_skip " ,
column_type = " sampler " ,
sampler_type = " uniform " ,
params = { " low " : 0 , " high " : 10 } ,
skip = SkipConfig ( when = " {{ y }} " ) ,
drop = False ,
allow_resize = False ,
propagate_skip = True ,
)
violations = validate_skip_references ( [ col ] , allowed_references = [ " y " ] )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . SKIP_ON_SAMPLER_SEED
assert violations [ 0 ] . column == " sampler_with_skip "
def test_validate_skip_with_allow_resize ( ) - > None :
col = LLMTextColumnConfig . model_construct (
name = " with_skip " ,
column_type = " llm-text " ,
prompt = " test {{ gate }} " ,
model_alias = STUB_MODEL_ALIAS ,
skip = SkipConfig ( when = " {{ gate == 0 }} " ) ,
allow_resize = True ,
drop = False ,
propagate_skip = True ,
)
violations = validate_skip_references ( [ col ] , allowed_references = [ " gate " ] )
assert len ( violations ) == 1
assert violations [ 0 ] . type == ViolationType . SKIP_WITH_ALLOW_RESIZE
assert violations [ 0 ] . column == " with_skip "