* feat: add Streamable HTTP transport support for remote MCP providers (#357)
Add `streamable_http` as a supported transport type for `MCPProvider`,
enabling connections to MCP servers that use the Streamable HTTP protocol
(e.g. Tavily remote endpoints). Previously only SSE transport was supported,
causing silent 5-minute timeouts when connecting to incompatible endpoints.
- Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]`
(default remains `"sse"` for backwards compatibility)
- Route `streamable_http` providers through `streamablehttp_client` from
the MCP SDK in `MCPIOService._get_or_create_session()`
- Handle variable-length context manager results from MCP transport clients
- Add `DataDesigner.list_mcp_tool_names()` for discovering available tools
- Update CLI form builder and controller to support the new transport option
- Add tests for streamable_http config, session creation, and form builder
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* updates
* simplify import
* address greptile comments
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add processor plugin support
Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).
- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation
* test: add processor plugin registration test
Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.
* test: simplify processor plugin registration test
* move ProcessorConfig to base and convert demo to e2e test
- Move ProcessorConfig from processors.py to config.base to guard
against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/
* move plan to plans/299/
* feat: add allow_resize for 1:N and N:1 generation patterns
Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.
Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests
* docs: add allow_resize to custom columns documentation
* refactor: consolidate buffer API and elevate allow_resize to base config
- Merge update_records and replace_buffer into a single replace_buffer
method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns
* test: add chained resize and multi-batch integration tests
- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)
* extend allow_resize to cell-by-cell (return dict or list[dict])
- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
_RESIZE_SPECS with inline factory calls; ids ordered like specs
* reorder allow_resize specs and add edge-case tests
- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
_resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)
* tidy allow_resize: drop validator, shared stub, explicit flag
- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call
* fix: add missing f prefix to error message in custom.py
* docs(plugins): add section on setting allow_resize=True for resize plugins
* fix: address PR review comments on allow_resize
- Replace getattr with direct attribute access where config is always
SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"
* fix: remove duplicate tool_alias log, fix test docstring
- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory
* fix: avoid duplicate undeclared-column warning in _validate_output
Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.
* fix: use lazy.pd instead of pd for runtime pandas usage in tests
The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
* first attempt
* iterating a bit
* some improvements + multiturn example
* adapting to new monorepo structure
* refining
* fixed test
* fixing license headers
* adding docs
* adding test for failed generation
* allowing strategy to be picked
* renaming argument
* lint
* remove recommendation
* renaming for consistency
* addressing comments pt1
* addressing comments pt2
* addressing comments pt3
* adding a mock for development
* addressing greptile comments
* revamping
* docs: streamline custom columns documentation
* docs: simplify CustomColumnConfig docstring
Remove verbose code example and detailed function signatures from
docstring to match the pattern of other config classes in the file.
* test: clean up custom column tests
- Remove tests for private _custom_column_metadata attribute
- Combine redundant generator creation tests
- Reuse stub_resource_provider and stub_model_facade fixtures
* test: consolidate custom column tests
Reduce from 26 to 11 tests while maintaining coverage:
- Combine redundant config/decorator/creation tests
- Use parametrized tests for error conditions
- Remove duplicate validation tests for full_column strategy
- Simplify section headers
* refactor: deduplicate CustomColumnGenerator logic
Merge cell-by-cell and full-column code paths:
- _generate_cell_by_cell + _generate_full_column -> _generate
- _validate_output_columns + _validate_output_columns_df -> _validate_output
* chore: merge example files into single notebook-style example.py
Combine example.py, example_multiturn.py, and example_benchmark_strategies.py
into a single file with #%% cell markers for Jupyter/VS Code notebook mode.
* addressing greptile comments
* refactor: reuse generate_text in generate_text_batch
* refactor: replace CustomColumnContext with models dict
- Remove CustomColumnContext class; users now receive models dict directly
- Add DataDesigner.get_models() for experimentation outside pipeline
- Make parser optional in ModelFacade.generate() (defaults to identity)
- Validate parameter names: row/df, generator_params, models
- Update examples, tests, and docs for new API
* fix: address PR review comments from Nabin and greptile
- Make decorator metadata public (custom_column_metadata)
- Simplify get_generation_strategy() to directly return config value
- Use !r formatting in error messages
- Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports)
- Remove redundant error logging before re-raise
- Validate max 3 positional parameters
- Use GenerationStrategy enum in example instead of string
* fix: replace lambda with module-level identity function in facade
Use pickleable _identity function instead of lambda x: x for the
default parser argument, ensuring compatibility with multiprocessing.
* fix: restore inherited attributes in LLM column docstrings
Restores the "Inherited Attributes" sections that were unintentionally
removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and
LLMJudgeColumnConfig docstrings.
* docs: clarify model_aliases is required for LLM access
Updated documentation and docstrings to clarify that model_aliases
populates the models dict (not just health checks).
* fix: address PR review comments from nabinchha
- clarify model_aliases requirement in docs
- add note about model alias validation during health check
- combine two loops into one in _run_model_health_check_if_needed
- add signature validation at decoration time
- enforce decorated functions in CustomColumnConfig validator
- simplify generator to only validate strategy-specific first param
* fix: address remaining PR review comments
- remove example.py (development artifact)
- fix get_models return type to dict[str, ModelFacade]
* test: update tests for decoration-time validation
- expect ValidationError instead of InvalidConfigError for non-callable
- split param validation test into decoration-time and runtime tests
* docs: add deployment, performance tuning guides and streamline getting started
- Add deployment-options.md: Library vs. Microservice decision guide
- Add inference-architecture.md: Separation of concerns with LLM servers
- Add performance-tuning.md: Concurrency and batching optimization guide
- Streamline index.md: Merge installation, add quick example, simplify
- Remove quick-start.md: Content merged into welcome page
- Remove installation.md: Content merged into welcome page
- Update model docs: Add concurrency control sections and cross-references
- Update mkdocs.yml: Add new Architecture section to navigation
* docs: add tasteful emojis to new documentation pages
* docs: consolidate redundant concurrency and troubleshooting content
- Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md
- Remove duplicate Concurrency Control section from model-configs.md
- Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md
- Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md)
- performance-tuning.md is now the authoritative source for tuning guidance
* Simplified doc additions
* Switched default model to nemotron 3 nano
* Addressed feedback
* Added first blog draft
Add support for five high-priority programming languages to Data Designer's
code generation capabilities:
- **Bash**: Universal DevOps and automation scripting
- **C, C++, C#**: Systems programming and enterprise development
- **COBOL**: Legacy mainframe systems and modernization
These languages address critical enterprise use cases including legacy code
maintenance, systems programming, and infrastructure automation.
Changes:
- Add new CodeLang enum values for bash, c, cpp, csharp, cobol
- Update code_lang_to_syntax_lexer() with Pygments lexer mappings
- Update documentation to reflect new supported languages
- Update tests to account for 21 total supported languages (up from 16)
Co-authored-by: Johnny Greco <jogreco@nvidia.com>
Add support for capturing full conversation traces during LLM generation,
enabling debugging and fine-tuning dataset creation.
Changes:
- Add `with_trace` field to LLMTextColumnConfig for per-column trace control
- Add `debug_override_save_all_column_traces` to RunConfig for global trace
- Introduce ChatMessage dataclass for structured message representation
- Update ModelFacade.generate() to return full message trace
- Rename trace column postfix from `__reasoning_trace` to `__trace`
- Add comprehensive traces documentation
Traces capture system/user/assistant messages in order, enabling visibility
into the full generation conversation including correction retries.
* Add generation type to ModelConfig
* pass tests
* added generate_text_embeddings
* tests
* remove sensitive=True old artifact no longer needed
* Slight refactor
* slight refactor
* Added embedding generator
* chunk_separator -> chunk_pattern
* update tests
* rename for consistency
* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters
* Remove purpose from consolidated kwargs
* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters
* Type as WithModelGeneration
* Add image generation modality
* update return type for generate_kwargs
* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters
* remove regex based chunking from embedding generator
* Remove image generation for now
* more tests and updates
* column_type_is_llm_generated -> column_type_is_model_generated
* change set to list: fix flaky tests
* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type
* Update docs
* fix deprecation warning originating from cli model settings
* update display of inference parameters in cli list
* save prog on inference parameter
* updates for the ocnfig builder
* update cli readme
* update cli for inference parmeters
* update inference parameter names
* flip order of vars
* WithCompletion -> WithChatCompletion
* specify InferenceParamsT
* Update columns.md with EmbeddingColumnConfig info
* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout
* DRY out some stuff in field.py
* docs for custom model settings
* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency
* Add nvidia-embedding and openai-embedding to default model configs
* Fix typo in docs
* Make generate collab notebooks
* Address PR comments
* Add generation type to ModelConfig
* pass tests
* added generate_text_embeddings
* tests
* remove sensitive=True old artifact no longer needed
* Slight refactor
* slight refactor
* Added embedding generator
* chunk_separator -> chunk_pattern
* update tests
* rename for consistency
* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters
* Remove purpose from consolidated kwargs
* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters
* Type as WithModelGeneration
* Add image generation modality
* update return type for generate_kwargs
* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters
* remove regex based chunking from embedding generator
* Remove image generation for now
* more tests and updates
* column_type_is_llm_generated -> column_type_is_model_generated
* change set to list: fix flaky tests
* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type
* Update docs
* fix deprecation warning originating from cli model settings
* update display of inference parameters in cli list
* save prog on inference parameter
* updates for the ocnfig builder
* update cli readme
* update cli for inference parmeters
* update inference parameter names
* flip order of vars
* WithCompletion -> WithChatCompletion
* specify InferenceParamsT
* Update columns.md with EmbeddingColumnConfig info
* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout
* DRY out some stuff in field.py
* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency
* Add nvidia-embedding and openai-embedding to default model configs
* Fix typo in docs
* Make generate collab notebooks
* fine-tune -> adjust
* Update persona docs
* Updated person sampling docs based on feedback
* remove nemotron personas sampling
* Remove nemotron personas sampling
* Update docs/concepts/person_sampling.md
---------
Co-authored-by: Johnny Greco <jogreco@nvidia.com>