Commit graph

39 commits

Author SHA1 Message Date
Nabin Mulepati
e4857f62fa
feat: add Streamable HTTP transport support for remote MCP providers (#358)
* feat: add Streamable HTTP transport support for remote MCP providers (#357)

Add `streamable_http` as a supported transport type for `MCPProvider`,
enabling connections to MCP servers that use the Streamable HTTP protocol
(e.g. Tavily remote endpoints). Previously only SSE transport was supported,
causing silent 5-minute timeouts when connecting to incompatible endpoints.

- Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]`
  (default remains `"sse"` for backwards compatibility)
- Route `streamable_http` providers through `streamablehttp_client` from
  the MCP SDK in `MCPIOService._get_or_create_session()`
- Handle variable-length context manager results from MCP transport clients
- Add `DataDesigner.list_mcp_tool_names()` for discovering available tools
- Update CLI form builder and controller to support the new transport option
- Add tests for streamable_http config, session creation, and form builder

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* updates

* simplify import

* address greptile comments

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 08:11:54 -07:00
Andre Manoel
982ce79ca9
feat: add processor plugin support (#299)
* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/
2026-02-25 16:40:01 -03:00
Andre Manoel
70dc48884e
feat: add allow_resize for 1:N and N:1 generation patterns (#286)
* feat: add allow_resize for 1:N and N:1 generation patterns

Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.

Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests

* docs: add allow_resize to custom columns documentation

* refactor: consolidate buffer API and elevate allow_resize to base config

- Merge update_records and replace_buffer into a single replace_buffer
  method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
  so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns

* test: add chained resize and multi-batch integration tests

- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)

* extend allow_resize to cell-by-cell (return dict or list[dict])

- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
  validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
  replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
  _RESIZE_SPECS with inline factory calls; ids ordered like specs

* reorder allow_resize specs and add edge-case tests

- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
  cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
  _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
  multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)

* tidy allow_resize: drop validator, shared stub, explicit flag

- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call

* fix: add missing f prefix to error message in custom.py

* docs(plugins): add section on setting allow_resize=True for resize plugins

* fix: address PR review comments on allow_resize

- Replace getattr with direct attribute access where config is always
  SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
  multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"

* fix: remove duplicate tool_alias log, fix test docstring

- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory

* fix: avoid duplicate undeclared-column warning in _validate_output

Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.

* fix: use lazy.pd instead of pd for runtime pandas usage in tests

The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
2026-02-18 18:39:31 -03:00
Nabin Mulepati
d8d1e668b0
docs: add image generation documentation and image-to-image editing tutorial (#319) 2026-02-12 14:38:52 -07:00
Andre Manoel
429b558588
refactor: callback-based processor design (#294) 2026-02-11 21:32:24 -03:00
Kirit Thadaka
4cfc1669bd
docs: Added documentation for seed datasets (#300)
* Added images for deployment options

* Add seed datasets documentation

- New concepts page explaining seed datasets
- Covers seed sources (LocalFile, HuggingFace, DataFrame)
- Documents sampling and selection strategies
- Includes complete example and best practices

* Incorporated greptile feedback

* Update docs/concepts/seed-datasets.md

Co-authored-by: Johnny Greco <jogreco@nvidia.com>

* Update docs/concepts/seed-datasets.md

Co-authored-by: Johnny Greco <jogreco@nvidia.com>

* Addressed feedback

* Addressed comments

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2026-02-05 14:29:05 -08:00
Kirit Thadaka
6dc35b2875
Added images for deployment options (#297) 2026-02-04 14:22:56 -08:00
Andre Manoel
62bae42dc2
feat: Add CustomColumnGenerator for user-defined column generation (#254)
* first attempt

* iterating a bit

* some improvements + multiturn example

* adapting to new monorepo structure

* refining

* fixed test

* fixing license headers

* adding docs

* adding test for failed generation

* allowing strategy to be picked

* renaming argument

* lint

* remove recommendation

* renaming for consistency

* addressing comments pt1

* addressing comments pt2

* addressing comments pt3

* adding a mock for development

* addressing greptile comments

* revamping

* docs: streamline custom columns documentation

* docs: simplify CustomColumnConfig docstring

Remove verbose code example and detailed function signatures from
docstring to match the pattern of other config classes in the file.

* test: clean up custom column tests

- Remove tests for private _custom_column_metadata attribute
- Combine redundant generator creation tests
- Reuse stub_resource_provider and stub_model_facade fixtures

* test: consolidate custom column tests

Reduce from 26 to 11 tests while maintaining coverage:
- Combine redundant config/decorator/creation tests
- Use parametrized tests for error conditions
- Remove duplicate validation tests for full_column strategy
- Simplify section headers

* refactor: deduplicate CustomColumnGenerator logic

Merge cell-by-cell and full-column code paths:
- _generate_cell_by_cell + _generate_full_column -> _generate
- _validate_output_columns + _validate_output_columns_df -> _validate_output

* chore: merge example files into single notebook-style example.py

Combine example.py, example_multiturn.py, and example_benchmark_strategies.py
into a single file with #%% cell markers for Jupyter/VS Code notebook mode.

* addressing greptile comments

* refactor: reuse generate_text in generate_text_batch

* refactor: replace CustomColumnContext with models dict

- Remove CustomColumnContext class; users now receive models dict directly
- Add DataDesigner.get_models() for experimentation outside pipeline
- Make parser optional in ModelFacade.generate() (defaults to identity)
- Validate parameter names: row/df, generator_params, models
- Update examples, tests, and docs for new API

* fix: address PR review comments from Nabin and greptile

- Make decorator metadata public (custom_column_metadata)
- Simplify get_generation_strategy() to directly return config value
- Use !r formatting in error messages
- Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports)
- Remove redundant error logging before re-raise
- Validate max 3 positional parameters
- Use GenerationStrategy enum in example instead of string

* fix: replace lambda with module-level identity function in facade

Use pickleable _identity function instead of lambda x: x for the
default parser argument, ensuring compatibility with multiprocessing.

* fix: restore inherited attributes in LLM column docstrings

Restores the "Inherited Attributes" sections that were unintentionally
removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and
LLMJudgeColumnConfig docstrings.

* docs: clarify model_aliases is required for LLM access

Updated documentation and docstrings to clarify that model_aliases
populates the models dict (not just health checks).

* fix: address PR review comments from nabinchha

- clarify model_aliases requirement in docs
- add note about model alias validation during health check
- combine two loops into one in _run_model_health_check_if_needed
- add signature validation at decoration time
- enforce decorated functions in CustomColumnConfig validator
- simplify generator to only validate strategy-specific first param

* fix: address remaining PR review comments

- remove example.py (development artifact)
- fix get_models return type to dict[str, ModelFacade]

* test: update tests for decoration-time validation

- expect ValidationError instead of InvalidConfigError for non-callable
- split param validation test into decoration-time and runtime tests
2026-02-03 19:23:39 -03:00
Eric W. Tramel
5430bcbe99
Remove debug_trace_override (#290) 2026-02-03 12:09:30 -05:00
Eric W. Tramel
532d21a8d7
feat: add extract_reasoning_content option to LLM columns (#285) 2026-02-03 10:25:24 -05:00
Kirit Thadaka
de7c3ab99a
docs: add deployment, performance tuning guides and streamline gettin… (#277)
* docs: add deployment, performance tuning guides and streamline getting started

- Add deployment-options.md: Library vs. Microservice decision guide
- Add inference-architecture.md: Separation of concerns with LLM servers
- Add performance-tuning.md: Concurrency and batching optimization guide
- Streamline index.md: Merge installation, add quick example, simplify
- Remove quick-start.md: Content merged into welcome page
- Remove installation.md: Content merged into welcome page
- Update model docs: Add concurrency control sections and cross-references
- Update mkdocs.yml: Add new Architecture section to navigation

* docs: add tasteful emojis to new documentation pages

* docs: consolidate redundant concurrency and troubleshooting content

- Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md
- Remove duplicate Concurrency Control section from model-configs.md
- Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md
- Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md)
- performance-tuning.md is now the authoritative source for tuning guidance

* Simplified doc additions

* Switched default model to nemotron 3 nano

* Addressed feedback

* Added first blog draft
2026-02-02 21:03:58 -08:00
Eric W. Tramel
510761107b
feat: Add TraceType enum for granular trace control (#284) 2026-02-02 19:43:51 -05:00
Eric W. Tramel
7248b9fc8f
Update trace normalization to ChatML content blocks (#283) 2026-02-02 18:22:16 -05:00
Eric W. Tramel
e6e58e692e
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) 2026-02-02 09:41:58 -05:00
Kirit Thadaka
9e1c6ec679
feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang (#271)
Add support for five high-priority programming languages to Data Designer's
code generation capabilities:

- **Bash**: Universal DevOps and automation scripting
- **C, C++, C#**: Systems programming and enterprise development
- **COBOL**: Legacy mainframe systems and modernization

These languages address critical enterprise use cases including legacy code
maintenance, systems programming, and infrastructure automation.

Changes:
- Add new CodeLang enum values for bash, c, cpp, csharp, cobol
- Update code_lang_to_syntax_lexer() with Pygments lexer mappings
- Update documentation to reflect new supported languages
- Update tests to account for 21 total supported languages (up from 16)

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2026-01-30 19:52:28 -05:00
Johnny Greco
0d51539aa6
feat: add message trace support for LLM generation (#272)
Add support for capturing full conversation traces during LLM generation,
enabling debugging and fine-tuning dataset creation.

Changes:
- Add `with_trace` field to LLMTextColumnConfig for per-column trace control
- Add `debug_override_save_all_column_traces` to RunConfig for global trace
- Introduce ChatMessage dataclass for structured message representation
- Update ModelFacade.generate() to return full message trace
- Rename trace column postfix from `__reasoning_trace` to `__trace`
- Add comprehensive traces documentation

Traces capture system/user/assistant messages in order, enabling visibility
into the full generation conversation including correction retries.
2026-01-30 17:03:07 -05:00
Nabin Mulepati
b238d06880
feat: allow skipping health checks (#244) 2026-01-28 10:15:00 -07:00
Johnny Greco
ae0665fa16
refactor: slim package refactor into three subpackages (#240)
* remove old structure

* major shuffle

* streamline project configs

* update make commands

* updates to make commands

* remove essentials

* initialize logger in interface

* uv lock

* ignore notepad

* update workflows

* fix e2e project config

* generate colab notebooks

* resolve default model settings in interface

* fix build commands

* update perf import make command

* cleaning up some slop

* update recipes

* move conftest files to tests/

* update subpackage readmes

* streamline config_logging

* use exports

* update perf import usage pattern

* update for IDE behavior with ruff

* remove engine's fixtures file

* add note to about lazy imports

* update dependencies

* update docs

* doc fixes

* uv lock

* updates to catch up with main

* clean up makefile

* remove package gitignores

* define deps only once

* isolate tests

* add test for protetion rule

* create temp dirs for isolated tests

* catch up to main

* update headers

* re apply changes

* better result summaries for isolated tests

* move exports into top-level init

* fix client importlib version syntax

* catch up with main
2026-01-27 13:53:20 -05:00
Johnny Greco
50fc50efc7
docs: Fix mkdocs syntax and update person sampling documentation (#249)
* remove colon

* update person sampling docs
2026-01-27 10:18:42 -05:00
Nabin Mulepati
01f8d887f8
chore: deprecate InferenceParameters (#183)
* deprecate InferenceParameters

* update docs and references
2026-01-08 10:43:02 -07:00
Mike Knepper
1c0bf65cc0
docs: Add extra_headers to model provider docs (#178) 2026-01-07 08:27:36 -06:00
Nabin Mulepati
645c7995b7
Fix documentation on max_tokens (#176) 2026-01-06 16:31:05 -07:00
Nabin Mulepati
3b4e296baf
feat: add OpenRouter as one of the default providers (#161)
* Add openrouter as a default provider

* Update docs
2026-01-06 10:22:18 -07:00
Johnny Greco
b71c6c11a8
docs: fix links and tweak person sampling (#152)
* update person sampling

* update docstring
2025-12-18 10:10:41 -08:00
Johnny Greco
b635e41033
update docs (#151) 2025-12-18 12:43:29 -05:00
Andre Manoel
d50a8aef95
docs: add processors (#147)
* first draft

* adding to code reference as well

* docstrings

* addressing comments

* forgot opening line

* docstring too
2025-12-17 15:47:33 -03:00
Nabin Mulepati
8d4c6c12b4
chore: Update nvidia text default model alias to nano v3 (#133) 2025-12-15 15:03:12 -07:00
Nabin Mulepati
3065179f8a
docs: add documentation on how to configure custom model settings (#124)
* Add generation type to ModelConfig

* pass tests

* added generate_text_embeddings

* tests

* remove sensitive=True old artifact no longer needed

* Slight refactor

* slight refactor

* Added embedding generator

* chunk_separator -> chunk_pattern

* update tests

* rename for consistency

* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters

* Remove purpose from consolidated kwargs

* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters

* Type as WithModelGeneration

* Add image generation modality

* update return type for generate_kwargs

* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters

* remove regex based chunking from embedding generator

* Remove image generation for now

* more tests and updates

* column_type_is_llm_generated -> column_type_is_model_generated

* change set to list: fix flaky tests

* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type

* Update docs

* fix deprecation warning originating from cli model settings

* update display of inference parameters in cli list

* save prog on inference parameter

* updates for the ocnfig builder

* update cli readme

* update cli for inference parmeters

* update inference parameter names

* flip order of vars

* WithCompletion -> WithChatCompletion

* specify InferenceParamsT

* Update columns.md with EmbeddingColumnConfig info

* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout

* DRY out some stuff in field.py

* docs for custom model settings

* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency

* Add nvidia-embedding and openai-embedding to default model configs

* Fix typo in docs

* Make generate collab notebooks

* Address PR comments
2025-12-15 14:00:31 -07:00
Nabin Mulepati
8370e4a00b
feat: support native embedding generation (#106)
* Add generation type to ModelConfig

* pass tests

* added generate_text_embeddings

* tests

* remove sensitive=True old artifact no longer needed

* Slight refactor

* slight refactor

* Added embedding generator

* chunk_separator -> chunk_pattern

* update tests

* rename for consistency

* Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters

* Remove purpose from consolidated kwargs

* WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters

* Type as WithModelGeneration

* Add image generation modality

* update return type for generate_kwargs

* make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters

* remove regex based chunking from embedding generator

* Remove image generation for now

* more tests and updates

* column_type_is_llm_generated -> column_type_is_model_generated

* change set to list: fix flaky tests

* CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type

* Update docs

* fix deprecation warning originating from cli model settings

* update display of inference parameters in cli list

* save prog on inference parameter

* updates for the ocnfig builder

* update cli readme

* update cli for inference parmeters

* update inference parameter names

* flip order of vars

* WithCompletion -> WithChatCompletion

* specify InferenceParamsT

* Update columns.md with EmbeddingColumnConfig info

* make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout

* DRY out some stuff in field.py

* Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency

* Add nvidia-embedding and openai-embedding to default model configs

* Fix typo in docs

* Make generate collab notebooks

* fine-tune -> adjust
2025-12-15 11:03:33 -07:00
Kirit Thadaka
8d7a073e3a
docs: Updated Person Sampling docs (#120)
* Updated Person Sampling docs

* Updated mv command

* Removed versions

* Updated mv command

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2025-12-12 10:43:57 -05:00
Johnny Greco
48fdc8c838
docs: add initial plugin documentation (#107)
* add docstrings

* add analysis modules

* include toc for plugins section

* add plugin docs

* remove scope creep

* Update docs/plugins/example.md

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>

* address feedback

---------

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>
2025-12-11 16:05:11 -05:00
Johnny Greco
57b5f6f798
set up initial recipe section (#114) 2025-12-10 14:51:07 -05:00
Nabin Mulepati
8e3080241b
docs: move models docs to concepts > models (#93) 2025-12-03 14:10:01 -07:00
Kirit Thadaka
4bee6d9088
docs: remove nemotron personas sampling from docs (for now) (#60)
* Update persona docs

* Updated person sampling docs based on feedback

* remove nemotron personas sampling

* Remove nemotron personas sampling

* Update docs/concepts/person_sampling.md

---------

Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2025-11-21 16:39:00 -05:00
Johnny Greco
ec98211862
chore: some readme and docs cleanup (#56)
* update classifiers

* remove commented section for now

* update readme badges and links

* rename persons section to person sampling
2025-11-20 15:33:55 -05:00
Johnny Greco
14dc495341
docs: some documentation cleanup (#52)
* some documentation cleanup

* typo
2025-11-19 17:40:14 -05:00
Johnny Greco
362ec51544
docs: sampler params code ref and more (#50)
* add sampler params code ref

* add persons section

* add person from faker sampler
2025-11-19 16:27:40 -05:00
Andre Manoel
01fbf4d848
docs: validators etc. (#45)
* got a little help from Claude, will still double check everything

* fixing, adding docstrings

* forgotten file + overview to tutorial

* minor

* applying suggestions

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>
Co-authored-by: Johnny Greco <jogreco@nvidia.com>

* addressing comments pt1

* addressing comments pt2

* trying something out

* fix

* typo

* trying again

* rollback workflow, add download links

* minor

* adapting notebooks to use fakersampler

---------

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>
Co-authored-by: Johnny Greco <jogreco@nvidia.com>
2025-11-19 17:39:10 -03:00
Johnny Greco
d4f32456a9
docs: welcome and concepts/columns (#43)
* add mike

* meth -> method; mod -> module in TOC

* messing with dark/light mode default

* staging stuff

* remove code examples from docstrings

* writing

* add columns with style
2025-11-17 17:07:01 -05:00