* feat: add audio and video context
Add audio/video context config models and canonical media helpers.
Translate canonical media blocks for OpenAI-compatible clients while preserving URL media as URLs. Reject unsupported audio/video blocks in the Anthropic adapter.
Refs #671
* fix: harden media context review gaps
Preserve extensionless HTTP(S) audio and video URLs as URL media, reject local path-looking audio/video context values, and reject provider-specific audio/video blocks in the Anthropic adapter.
Refs #671
* test: add audio video context smoke notebook
Add a Jupytext source notebook and generated Colab artifact that exercise audio/video context URL, base64, local path rejection, OpenAI-compatible payload translation, and Anthropic unsupported-media handling.
Refs #671
* test: make media context notebook end to end
Rewrite the audio/video smoke notebook to run a full Data Designer preview against a local OpenAI-compatible HTTP server. Assert the generated dataset, captured endpoint payload, URL/base64 translation, and local path rejection through the interface pipeline.
Refs #671
* test: remove media context notebook from docs
Move the generated audio/video context E2E notebook out of the PR docs surface and keep it locally under the main checkout's .scratch directory.
Refs #671
* harden multimodal media context handling
* address media context review notes
Remove unused URL-specific media helpers, share the base64 data URI parser in Anthropic translation, align AudioContext validation messaging, and update config docs for audio/video contexts.
Refs #671
* docs: update media context guidance
* refactor: consolidate media helpers
* support local audio and video paths
* refactor: combine media path checks
* address media context review feedback
* remove openai media preflight
* sync generated colab notebooks
* align media local path autodetection
* chore: add __init__.py to engine namespace subpackages
Griffe (used by mkdocstrings) skips directories without __init__.py
when resolving module paths, which prevented the new plugins code
reference from rendering SeedReader, FileSystemSeedReader, and
Processor. Adding empty __init__.py files in engine/resources/,
engine/processing/, and engine/processing/processors/ aligns with
the convention already used in engine/mcp/, engine/models/, etc.
* docs: flesh out docstrings on plugin extension-point classes
Plugin authors now see meaningful descriptions for every field and
method on the bases rendered in the plugins code reference:
- Plugin and PluginType: class docstrings + Attributes tables for
fields and enum members; fix typo in config_qualified_name field
description.
- SingleColumnConfig: document allow_resize.
- ProcessorConfig: document processor_type discriminator.
- SeedSource: document seed_type discriminator.
- FileSystemSeedSource: add class docstring + Attributes table for
path / file_pattern / recursive.
- ColumnGeneratorFullColumn and ColumnGeneratorCellByCell: add
class docstrings explaining when to use each base, plus method
docstrings on the abstract generate() implementations.
* docs: graduate plugins out of experimental mode
Restructures plugin documentation around the now-stable extension
points (column generator, seed reader, processor) and treats plugins
as a first-class story for customizing Data Designer.
- Add code_reference/plugins.md: single-stop reference for the Plugin
object and the config + implementation base classes used by all
three plugin types.
- Add code_reference/generators.md: column generator implementation
base classes, separated from column configs.
- Surface SingleColumnConfig in code_reference/column_configs.md.
- Add plugins/implement.md ("Build Your Own"): per-type implementation
instructions across column generators, seed readers, and processors.
- Add plugins/processor.md: complete processor plugin package example.
- Rewrite plugins/overview.md: open with why plugins exist, drop the
internal-helpers note (PluginRegistry / PluginManager), and focus
the guide on what plugin builders need.
- Refresh plugins/available.md (Catalog) and
plugins/filesystem_seed_reader.md to match the new structure.
- Delete plugins/example.md (replaced by per-type guides).
- Reorder Code Reference nav alphabetically and add the new pages.
- Minor link / wording fixes in concepts/processors.md and
concepts/deployment-options.md.
* docs: simplify plugin docs structure
Replace the overview's how-to walkthrough and the per-type plugin
guides with a single Build Your Own page that covers all three
plugin types side-by-side. Add a dedicated Using Models in Plugins
guide and a seed_readers code reference, and trim the overview down
to what the plugin types are, how to use one, and how discovery
works.
- Rename plugins/implement.md to plugins/build_your_own.md.
- Delete plugins/filesystem_seed_reader.md and plugins/processor.md
(their content is now in build_your_own.md and the per-type code
references).
- Add plugins/models.md for model-backed column generator authoring.
- Add code_reference/seed_readers.md for seed reader implementation
base classes.
- Rewrite plugins/overview.md: shorter intro, type bullets link to
the relevant code reference, drop the multi-step "How do you
create plugins" walkthrough in favor of a single Build a Plugin
pointer, tighten Discovery troubleshooting.
- Refresh plugins/available.md (Available Plugins): point to the
DataDesignerPlugins catalog and explain how to request a community
listing.
- Update cross-page links in concepts/processors.md,
concepts/seed-datasets.md, recipes/plugin_development/markdown_seed_reader.md,
code_reference/plugins.md, and code_reference/generators.md to
match the new structure.
- Update mkdocs.yml nav: rename to Build Your Own, add Using Models,
add seed_readers code reference.
* docs: scroll wide tables horizontally instead of wrapping
Code-heavy reference tables (plugin bases, column generators, etc.)
were wrapping aggressively on narrow viewports, breaking long
identifiers across multiple lines. Switch the table container to
horizontal overflow and prevent code cells from wrapping so
identifiers stay readable.
* docs: address PR #603 review feedback
- Add an Implementation base section to code_reference/processors.md
rendering the engine-side Processor class. This justifies the
engine/processing/__init__.py files added earlier and gives
processor plugin authors an auto-rendered API reference, matching
the pattern used by code_reference/generators.md and seed_readers.md.
- build_your_own.md: replace the placeholder "x" emoji on the
IndexMultiplier example with the actual multiplication sign.
- build_your_own.md: drop the manual `re.compile + apply(lambda)`
pattern in the regex-filter processor in favor of the idiomatic
`Series.str.contains(..., regex=True)`.
- build_your_own.md: add a kernel-restart caveat after the editable
install instructions — PluginRegistry caches discovery on first
import, so notebooks need a fresh kernel to pick up freshly
installed plugins.
- build_your_own.md: state explicitly what `assert_valid_plugin`
checks (config base + plugin-type-appropriate impl base).
- code_reference/plugins.md: link out to the processors code
reference alongside generators and seed_readers.
* docs: split code reference by package
* docs: add interface code reference
* docs: add code reference overviews
* docs: refine code reference pages
* docs: improve code reference tables
* docs: correct reference docstrings
* docs: embed plugin catalog table
* docs: note plugin discovery restart caveat
* docs: explain generator base class choice
* docs: mention async cell generator examples
* docs: clarify plugin model usage
* docs: clarify plugin model aliases
* docs: address plugin review feedback
* docs: update available plugins page
* plan: add skip_when for conditional column generation (#479)
Adds implementation plan for a `skip_when` field on `SingleColumnConfig`
that enables conditional column generation. When the Jinja2 expression
evaluates truthy, the cell is set to None and the generator is skipped.
Skips auto-propagate through the DAG to downstream columns.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: remove HopChain example from skip_when plan
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: replace HopChain example with generic product review example
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: add open questions on skip sentinel value and row filtering
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* plan: major revision — SkipConfig model, sync engine support, decouple propagation
- Introduce SkipConfig(when, value) as nested model on SingleColumnConfig
- Move propagate_skip to SingleColumnConfig as independent field, fixing
bug where columns with no SkipConfig couldn't participate in propagation
- Add full sync engine implementation (Steps 4a-4d) covering both
_fan_out_with_threads and _run_full_column_generator dispatch paths
- Add serialization boundary stripping for both DatasetBatchManager (sync)
and RowGroupBufferManager (async)
- Simplify architecture diagrams for readability
- Update all references, design decisions, verification plan
Made-with: Cursor
* updates
* plan: document get_required_columns for skip propagation
- Explain why propagation must not use get_upstream_columns() once
skip.when adds DAG edges; add _required_columns and
get_required_columns() to the execution graph plan
- Point async _run_cell at get_required_columns for parity with sync
- Clarify DropSkippedRowsProcessorConfig vs stripping __skipped__ for
DataFrames; tighten resolved-questions wording
- Extend DAG/graph verification with gating_col regression case
Refs #479
Made-with: Cursor
* plan: centralize __skipped__ handling in skip_provenance
- Document new skip_provenance.py (key constant, read/write/strip API)
- Point sync builder, async scheduler, and batch buffers at shared helpers
- Strip metadata before every DataFrame from buffer dicts, including
FULL_COLUMN active subsets
- Split §3 into skip_evaluator vs skip_provenance; extend verification
Refs #479
Made-with: Cursor
* plan: align doc title with SkipConfig / skip.when
Drop legacy skip_when naming in headings and #362 cross-reference.
Refs #479
Made-with: Cursor
* plan: address review — delimiter validation, centralized error handling, caller-owns-deserialization
- SkipConfig._validate_when_syntax now checks find_undeclared_variables
is non-empty, rejecting expressions without {{ }} delimiters that
would silently skip every row
- evaluate_skip_when centralizes try/except so both sync and async
engines get identical fail-safe behavior on eval errors
- evaluate_skip_when takes a single pre-deserialized record; caller
runs deserialize_json_values once and passes to both skip eval and
generator (no double deserialization, no redundant parameter)
- Update _should_skip_cell, async _run_cell, Files Modified table,
and verification section accordingly
Refs #479
Made-with: Cursor
* plan: add get_side_effect_columns accessor to execution graph spec
Document _side_effects_by_producer inverse map and
get_side_effect_columns() accessor on ExecutionGraph, needed by
_write_skip_to_record / apply_skip_to_record to clear __trace,
__reasoning_content, etc. on skip. Added to both Step 2b metadata
section and Files Modified table.
The __skipped__ leak into active_df (greptile's other P1) was already
fixed in 70463789 via strip_skip_metadata_from_records.
Refs #479
Made-with: Cursor
* add skip.when conditional column generation
Introduce SkipConfig on SingleColumnConfig to gate column generation
with a Jinja2 expression. Columns can be skipped by expression or by
upstream propagation (propagate_skip flag).
- SkipConfig: Pydantic model with config-time syntax/delimiter/variable
validation and cached column extraction from the Jinja2 AST
- skip_evaluator: runtime expression evaluation via NativeSandboxedEnvironment
with fail-safe error handling (skip on expected failures)
- skip_provenance: centralized __skipped__ record tracking shared by
sync builder, async scheduler, and buffer managers
- DAG/ExecutionGraph: skip.columns wired as dependency edges in both
topological sort and static execution graph
- Validation: validate_skip_references checks reference existence,
sampler/seed scope, and allow_resize conflicts
- Sync builder: cell-by-cell and full-column skip with merge-back
- Async scheduler: cell and batch skip with live-buffer provenance
Made-with: Cursor
* fix review findings for skip.when implementation
- Add skip evaluation to _fan_out_with_async (was missing, causing
skipped rows to still be sent to the LLM)
- Preserve __skipped__ provenance on non-skipped records after
full-column generation so multi-hop propagation works
- Use single live-buffer reference in _run_batch skip loop for
consistency with _run_cell
- Move Template import to TYPE_CHECKING and reorder import blocks
- Replace O(n²) sum() with itertools.chain in dag.py
- Add set_required_columns/set_propagate_skip/set_skip_config
setters to ExecutionGraph for symmetry with existing API
Made-with: Cursor
* add conditional generation with skip recipe and refactor skip helpers
Add a new recipe demonstrating skip.when patterns (expression gate,
propagation, opt-out) with a customer support ticket pipeline.
Also extract _should_skip_record in async_scheduler, remove the
redundant propagate_skip param from should_skip_by_propagation, and
pass a precomputed all_side_effects set through the DAG sort.
Made-with: Cursor
* updates
* fixes
* remove recipe > inject conditional gen into existing tutorial
* regen colab notebooks
* fix: handle missing execution graph in _column_can_skip
Return False when the graph has not been initialized instead of raising,
since skip logic cannot apply before generators are set up.
Made-with: Cursor
* parametrize some tests
* public before private
* slight refactor for readability
* parametrize some tests
* minor fixes
* reanme internla skip tracker key name
* clarify intent in comment
* when skipped _run_cell should return skipped value even though the consumer doesn't currenlty care about it
* remove inline import
* minor refactor for clarity
* fix: preserve skip metadata across replace_buffer and exclude allow_resize from skip branch
Two bugs in the sequential engine's _run_full_column_generator:
1. replace_buffer(df.to_dict()) erased __internal_skipped_columns in
three code paths (MultiColumnConfig, non-skip-aware, has_skipped=False
fallthrough), breaking propagate_skip for downstream columns when an
independent FULL_COLUMN generator ran between skip-setting and
propagating columns.
2. _column_can_skip returned True for allow_resize=True columns via
propagation, causing the skip-aware merge path to raise on the 1:1
row-count check for 1:N generators.
- Add restore_skip_metadata helper to skip_tracker.py
- Guard _column_can_skip against allow_resize=True columns
- Refactor _run_full_column_generator into three focused methods
- Remove dead allow_resize / _log_resize_if_changed from skip path
- Remove redundant _require_graph() calls in skip helpers
- Add single_column_config_by_name cached property
- Add integration tests for both bugs and unit tests for the helper
Made-with: Cursor
* address review comments on skip.when PR (#502)
- Extract shared skip decision logic (_should_skip_cell / _should_skip_record)
into should_skip_column_for_record() in skip_evaluator.py so both sync and
async engines call the same function (andreatgretel review comment)
- Extend SkipConfig self-reference validation to cover side-effect columns
(e.g. review__trace on the review column) — previously only checked
self.name, now checks self.name | self.side_effect_columns
- Add async engine integration tests for skip paths: cell-by-cell with
propagation and full-column batch skip (exercises _run_cell / _run_batch)
- Fix test_allow_resize_column_not_blocked_by_upstream_skip to use default
propagate_skip=True so it actually exercises the allow_resize guard
- Move get_skipped_column_names from skip_tracker to skip_evaluator (sole
production consumer)
Made-with: Cursor
* address cr feedback
* Fix issue with full column generating messing up order of skipped rows
* add skip conditional generation edge case tests
- test_skip_evaluator: parametrized should_skip_column_for_record covering
propagation, expression gates, short-circuiting, and disabled propagation
- test_execution_graph: skip metadata accessors (get_skip_config,
should_propagate_skip, get_required_columns, get_side_effect_columns,
resolve_side_effect, skip.when DAG edges)
- test_dataset_builder: chained transitive propagation (4 levels),
two independent skip gates, custom skip.value, row count preservation
Made-with: Cursor
* fix: make expression jinja validator private
Rename assert_expression_valid_jinja to _assert_expression_valid_jinja
to match the private naming convention used by other model validators.
Made-with: Cursor
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: simplify tutorial 4 image dataset and use default model config
Switch from the large ColPali dataset (52 GB) to rokmr/pets (~23 MB)
for faster downloads in the vision tutorial. Use the default
nvidia-vision model alias instead of a custom ModelConfig block.
* regen colab notebooks
* fix: repair notebook CI by replacing dead vision model and adding missing API key
- Replace `meta/llama-4-scout-17b-16e-instruct` (no longer serving on
build.nvidia.com) with `nvidia/nemotron-nano-12b-v2-vl` (project default)
in tutorial notebook 4
- Add `OPENROUTER_API_KEY` to the `build-notebooks` workflow so notebooks
5 and 6 (which use OpenRouter for image generation) can authenticate
- Regenerate colab notebooks to reflect the model change
* fix: handle pyarrow list types in notebook 6 display_image
When image columns are loaded from parquet with pyarrow backend,
list values are pyarrow ListScalars, not Python lists. The
isinstance(x, list) check fails, causing the whole ListScalar to be
treated as a single path string (producing filenames ending in
`png')]`). Use isinstance(x, str) instead to correctly handle any
iterable type.
- Convert notebook 3 from string-based columns to class specs (dd.SamplerColumnConfig, etc.)
- Fix grammar: "is the main object is responsible" → "is the main object responsible"
- Remove stray "A" at end of URL in notebook 2
- Remove empty markdown cell in notebook 4
- Add missing data_designer.validate() call in notebook 4
- Regenerate colab notebooks from source
* docs: add deployment, performance tuning guides and streamline getting started
- Add deployment-options.md: Library vs. Microservice decision guide
- Add inference-architecture.md: Separation of concerns with LLM servers
- Add performance-tuning.md: Concurrency and batching optimization guide
- Streamline index.md: Merge installation, add quick example, simplify
- Remove quick-start.md: Content merged into welcome page
- Remove installation.md: Content merged into welcome page
- Update model docs: Add concurrency control sections and cross-references
- Update mkdocs.yml: Add new Architecture section to navigation
* docs: add tasteful emojis to new documentation pages
* docs: consolidate redundant concurrency and troubleshooting content
- Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md
- Remove duplicate Concurrency Control section from model-configs.md
- Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md
- Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md)
- performance-tuning.md is now the authoritative source for tuning guidance
* Simplified doc additions
* Switched default model to nemotron 3 nano
* Addressed feedback
* Added first blog draft
* Add generation type to ModelConfig
* pass tests
* added generate_text_embeddings
* tests
* remove sensitive=True old artifact no longer needed
* Slight refactor
* slight refactor
* Added embedding generator
* chunk_separator -> chunk_pattern
* update tests
* rename for consistency
* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters
* Remove purpose from consolidated kwargs
* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters
* Type as WithModelGeneration
* Add image generation modality
* update return type for generate_kwargs
* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters
* remove regex based chunking from embedding generator
* Remove image generation for now
* more tests and updates
* column_type_is_llm_generated -> column_type_is_model_generated
* change set to list: fix flaky tests
* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type
* Update docs
* fix deprecation warning originating from cli model settings
* update display of inference parameters in cli list
* save prog on inference parameter
* updates for the ocnfig builder
* update cli readme
* update cli for inference parmeters
* update inference parameter names
* flip order of vars
* WithCompletion -> WithChatCompletion
* specify InferenceParamsT
* Update columns.md with EmbeddingColumnConfig info
* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout
* DRY out some stuff in field.py
* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency
* Add nvidia-embedding and openai-embedding to default model configs
* Fix typo in docs
* Make generate collab notebooks
* fine-tune -> adjust
* Add example notebook showing how to use image contexts
* change 101 -> tutorial
* update _README.md with info on the new tutorial
* add reference in mkdocs.yml
* simplify vlm tutorial
* update num_records on tutorials. Update .gitignore
* update readme info
* add models module to code reference
* fix links to generated ipynb
* change vlm in example tutorial to llama4-scout
* adding basic jupytext structure
Co-authored-by: Johnny Greco <jogreco@nvidia.com>
* few fixes
* first test for ci
* adding error intentionally to check workflow behavior
* test calling from other workflows
* typo
* trying as job instead
* couple of fixes
* checking path
* trying to fix path
* wrapping up
---------
Co-authored-by: Johnny Greco <jogreco@nvidia.com>