* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: simplify tutorial 4 image dataset and use default model config
Switch from the large ColPali dataset (52 GB) to rokmr/pets (~23 MB)
for faster downloads in the vision tutorial. Use the default
nvidia-vision model alias instead of a custom ModelConfig block.
* regen colab notebooks
* fix: repair notebook CI by replacing dead vision model and adding missing API key
- Replace `meta/llama-4-scout-17b-16e-instruct` (no longer serving on
build.nvidia.com) with `nvidia/nemotron-nano-12b-v2-vl` (project default)
in tutorial notebook 4
- Add `OPENROUTER_API_KEY` to the `build-notebooks` workflow so notebooks
5 and 6 (which use OpenRouter for image generation) can authenticate
- Regenerate colab notebooks to reflect the model change
* fix: handle pyarrow list types in notebook 6 display_image
When image columns are loaded from parquet with pyarrow backend,
list values are pyarrow ListScalars, not Python lists. The
isinstance(x, list) check fails, causing the whole ListScalar to be
treated as a single path string (producing filenames ending in
`png')]`). Use isinstance(x, str) instead to correctly handle any
iterable type.
- Convert notebook 3 from string-based columns to class specs (dd.SamplerColumnConfig, etc.)
- Fix grammar: "is the main object is responsible" → "is the main object responsible"
- Remove stray "A" at end of URL in notebook 2
- Remove empty markdown cell in notebook 4
- Add missing data_designer.validate() call in notebook 4
- Regenerate colab notebooks from source
* docs: add deployment, performance tuning guides and streamline getting started
- Add deployment-options.md: Library vs. Microservice decision guide
- Add inference-architecture.md: Separation of concerns with LLM servers
- Add performance-tuning.md: Concurrency and batching optimization guide
- Streamline index.md: Merge installation, add quick example, simplify
- Remove quick-start.md: Content merged into welcome page
- Remove installation.md: Content merged into welcome page
- Update model docs: Add concurrency control sections and cross-references
- Update mkdocs.yml: Add new Architecture section to navigation
* docs: add tasteful emojis to new documentation pages
* docs: consolidate redundant concurrency and troubleshooting content
- Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md
- Remove duplicate Concurrency Control section from model-configs.md
- Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md
- Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md)
- performance-tuning.md is now the authoritative source for tuning guidance
* Simplified doc additions
* Switched default model to nemotron 3 nano
* Addressed feedback
* Added first blog draft
* Add generation type to ModelConfig
* pass tests
* added generate_text_embeddings
* tests
* remove sensitive=True old artifact no longer needed
* Slight refactor
* slight refactor
* Added embedding generator
* chunk_separator -> chunk_pattern
* update tests
* rename for consistency
* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters
* Remove purpose from consolidated kwargs
* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters
* Type as WithModelGeneration
* Add image generation modality
* update return type for generate_kwargs
* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters
* remove regex based chunking from embedding generator
* Remove image generation for now
* more tests and updates
* column_type_is_llm_generated -> column_type_is_model_generated
* change set to list: fix flaky tests
* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type
* Update docs
* fix deprecation warning originating from cli model settings
* update display of inference parameters in cli list
* save prog on inference parameter
* updates for the ocnfig builder
* update cli readme
* update cli for inference parmeters
* update inference parameter names
* flip order of vars
* WithCompletion -> WithChatCompletion
* specify InferenceParamsT
* Update columns.md with EmbeddingColumnConfig info
* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout
* DRY out some stuff in field.py
* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency
* Add nvidia-embedding and openai-embedding to default model configs
* Fix typo in docs
* Make generate collab notebooks
* fine-tune -> adjust
* Add example notebook showing how to use image contexts
* change 101 -> tutorial
* update _README.md with info on the new tutorial
* add reference in mkdocs.yml
* simplify vlm tutorial
* update num_records on tutorials. Update .gitignore
* update readme info
* add models module to code reference
* fix links to generated ipynb
* change vlm in example tutorial to llama4-scout
* adding basic jupytext structure
Co-authored-by: Johnny Greco <jogreco@nvidia.com>
* few fixes
* first test for ci
* adding error intentionally to check workflow behavior
* test calling from other workflows
* typo
* trying as job instead
* couple of fixes
* checking path
* trying to fix path
* wrapping up
---------
Co-authored-by: Johnny Greco <jogreco@nvidia.com>