DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Nabin Mulepati	e4857f62fa	feat: add Streamable HTTP transport support for remote MCP providers (#358 ) * feat: add Streamable HTTP transport support for remote MCP providers (#357) Add `streamable_http` as a supported transport type for `MCPProvider`, enabling connections to MCP servers that use the Streamable HTTP protocol (e.g. Tavily remote endpoints). Previously only SSE transport was supported, causing silent 5-minute timeouts when connecting to incompatible endpoints. - Expand `MCPProvider.provider_type` to `Literal["sse", "streamable_http"]` (default remains `"sse"` for backwards compatibility) - Route `streamable_http` providers through `streamablehttp_client` from the MCP SDK in `MCPIOService._get_or_create_session()` - Handle variable-length context manager results from MCP transport clients - Add `DataDesigner.list_mcp_tool_names()` for discovering available tools - Update CLI form builder and controller to support the new transport option - Add tests for streamable_http config, session creation, and form builder Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * updates * simplify import * address greptile comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 08:11:54 -07:00
Andre Manoel	982ce79ca9	feat: add processor plugin support (#299 ) * feat: add processor plugin support Add PluginType.PROCESSOR to the plugin system, enabling third-party processor plugins via entry points. Includes a demo plugin package with RegexFilterProcessor (process_before_batch) and SemanticDedupProcessor (process_after_generation). - Add PluginType.PROCESSOR with processor_type discriminator - Create processor_types.py for ProcessorConfigT with plugin injection - Register plugin processors in engine ProcessorRegistry - Use RLock in PluginRegistry to prevent deadlocks during discovery - Add demo package: data-designer-demo-processors - Update processor and plugin documentation * test: add processor plugin registration test Verify that processor plugins from PluginRegistry are picked up by create_default_processor_registry and registered correctly. * test: simplify processor plugin registration test * move ProcessorConfig to base and convert demo to e2e test - Move ProcessorConfig from processors.py to config.base to guard against circular deps (alongside SingleColumnConfig) - Delete demo/ directory with regex_filter and semantic_dedup plugins - Add regex_filter as an e2e processor plugin test in tests_e2e/ * move plan to plans/299/	2026-02-25 16:40:01 -03:00
Andre Manoel	70dc48884e	feat: add allow_resize for 1:N and N:1 generation patterns (#286 ) * feat: add allow_resize for 1:N and N:1 generation patterns Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests * docs: add allow_resize to custom columns documentation * refactor: consolidate buffer API and elevate allow_resize to base config - Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns * test: add chained resize and multi-batch integration tests - Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract) * extend allow_resize to cell-by-cell (return dict or list[dict]) - Config: allow allow_resize with CELL_BY_CELL; relax validator - Custom generator: accept dict \| list[dict] when cell_by_cell + allow_resize; validate per row via _validate_cell_output - Builder: collect results by index when cell allow_resize, flatten and replace_buffer; add _log_resize_if_changed and _column_display_name - Docs: ALL_CAPS for strategies, simplify allow_resize table text - Tests: parametrized preview and multibatch; factories with n param; _RESIZE_SPECS with inline factory calls; ids ordered like specs * reorder allow_resize specs and add edge-case tests - Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd, cell_drop_all to _RESIZE_SPECS - Stubs before specs: _resize_full_keep_first, _resize_cell_expand, _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories - Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS - Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3 multibatch cases (5_2, 4_2) first - Handle all-batches-skipped in multibatch test (empty df when path missing) - test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list) * tidy allow_resize: drop validator, shared stub, explicit flag - Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call * fix: add missing f prefix to error message in custom.py * docs(plugins): add section on setting allow_resize=True for resize plugins * fix: address PR review comments on allow_resize - Replace getattr with direct attribute access where config is always SingleColumnConfig (custom.py, cell-by-cell path in builder) - Keep getattr in _run_full_column_generator which also handles multi-column configs without allow_resize - Restructure allow_resize validation branching in CustomColumnGenerator - Fix error message wording: "key" -> "column" * fix: remove duplicate tool_alias log, fix test docstring - Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory * fix: avoid duplicate undeclared-column warning in _validate_output Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time. * fix: use lazy.pd instead of pd for runtime pandas usage in tests The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.	2026-02-18 18:39:31 -03:00
Nabin Mulepati	d8d1e668b0	docs: add image generation documentation and image-to-image editing tutorial (#319 )	2026-02-12 14:38:52 -07:00
Andre Manoel	429b558588	refactor: callback-based processor design (#294 )	2026-02-11 21:32:24 -03:00
Kirit Thadaka	4cfc1669bd	docs: Added documentation for seed datasets (#300 ) * Added images for deployment options * Add seed datasets documentation - New concepts page explaining seed datasets - Covers seed sources (LocalFile, HuggingFace, DataFrame) - Documents sampling and selection strategies - Includes complete example and best practices * Incorporated greptile feedback * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Addressed feedback * Addressed comments --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-02-05 14:29:05 -08:00
Kirit Thadaka	6dc35b2875	Added images for deployment options (#297 )	2026-02-04 14:22:56 -08:00
Andre Manoel	62bae42dc2	feat: Add CustomColumnGenerator for user-defined column generation (#254 ) * first attempt * iterating a bit * some improvements + multiturn example * adapting to new monorepo structure * refining * fixed test * fixing license headers * adding docs * adding test for failed generation * allowing strategy to be picked * renaming argument * lint * remove recommendation * renaming for consistency * addressing comments pt1 * addressing comments pt2 * addressing comments pt3 * adding a mock for development * addressing greptile comments * revamping * docs: streamline custom columns documentation * docs: simplify CustomColumnConfig docstring Remove verbose code example and detailed function signatures from docstring to match the pattern of other config classes in the file. * test: clean up custom column tests - Remove tests for private _custom_column_metadata attribute - Combine redundant generator creation tests - Reuse stub_resource_provider and stub_model_facade fixtures * test: consolidate custom column tests Reduce from 26 to 11 tests while maintaining coverage: - Combine redundant config/decorator/creation tests - Use parametrized tests for error conditions - Remove duplicate validation tests for full_column strategy - Simplify section headers * refactor: deduplicate CustomColumnGenerator logic Merge cell-by-cell and full-column code paths: - _generate_cell_by_cell + _generate_full_column -> _generate - _validate_output_columns + _validate_output_columns_df -> _validate_output * chore: merge example files into single notebook-style example.py Combine example.py, example_multiturn.py, and example_benchmark_strategies.py into a single file with #%% cell markers for Jupyter/VS Code notebook mode. * addressing greptile comments * refactor: reuse generate_text in generate_text_batch * refactor: replace CustomColumnContext with models dict - Remove CustomColumnContext class; users now receive models dict directly - Add DataDesigner.get_models() for experimentation outside pipeline - Make parser optional in ModelFacade.generate() (defaults to identity) - Validate parameter names: row/df, generator_params, models - Update examples, tests, and docs for new API * fix: address PR review comments from Nabin and greptile - Make decorator metadata public (custom_column_metadata) - Simplify get_generation_strategy() to directly return config value - Use !r formatting in error messages - Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports) - Remove redundant error logging before re-raise - Validate max 3 positional parameters - Use GenerationStrategy enum in example instead of string * fix: replace lambda with module-level identity function in facade Use pickleable _identity function instead of lambda x: x for the default parser argument, ensuring compatibility with multiprocessing. * fix: restore inherited attributes in LLM column docstrings Restores the "Inherited Attributes" sections that were unintentionally removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig docstrings. * docs: clarify model_aliases is required for LLM access Updated documentation and docstrings to clarify that model_aliases populates the models dict (not just health checks). * fix: address PR review comments from nabinchha - clarify model_aliases requirement in docs - add note about model alias validation during health check - combine two loops into one in _run_model_health_check_if_needed - add signature validation at decoration time - enforce decorated functions in CustomColumnConfig validator - simplify generator to only validate strategy-specific first param * fix: address remaining PR review comments - remove example.py (development artifact) - fix get_models return type to dict[str, ModelFacade] * test: update tests for decoration-time validation - expect ValidationError instead of InvalidConfigError for non-callable - split param validation test into decoration-time and runtime tests	2026-02-03 19:23:39 -03:00
Eric W. Tramel	5430bcbe99	Remove `debug_trace_override` (#290 )	2026-02-03 12:09:30 -05:00
Eric W. Tramel	532d21a8d7	feat: add extract_reasoning_content option to LLM columns (#285 )	2026-02-03 10:25:24 -05:00
Kirit Thadaka	de7c3ab99a	docs: add deployment, performance tuning guides and streamline gettin… (#277 ) * docs: add deployment, performance tuning guides and streamline getting started - Add deployment-options.md: Library vs. Microservice decision guide - Add inference-architecture.md: Separation of concerns with LLM servers - Add performance-tuning.md: Concurrency and batching optimization guide - Streamline index.md: Merge installation, add quick example, simplify - Remove quick-start.md: Content merged into welcome page - Remove installation.md: Content merged into welcome page - Update model docs: Add concurrency control sections and cross-references - Update mkdocs.yml: Add new Architecture section to navigation * docs: add tasteful emojis to new documentation pages * docs: consolidate redundant concurrency and troubleshooting content - Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md - Remove duplicate Concurrency Control section from model-configs.md - Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md - Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md) - performance-tuning.md is now the authoritative source for tuning guidance * Simplified doc additions * Switched default model to nemotron 3 nano * Addressed feedback * Added first blog draft	2026-02-02 21:03:58 -08:00
Eric W. Tramel	510761107b	feat: Add TraceType enum for granular trace control (#284 )	2026-02-02 19:43:51 -05:00
Eric W. Tramel	7248b9fc8f	Update trace normalization to ChatML content blocks (#283 )	2026-02-02 18:22:16 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Kirit Thadaka	9e1c6ec679	feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang (#271 ) Add support for five high-priority programming languages to Data Designer's code generation capabilities: - Bash: Universal DevOps and automation scripting - C, C++, C#: Systems programming and enterprise development - COBOL: Legacy mainframe systems and modernization These languages address critical enterprise use cases including legacy code maintenance, systems programming, and infrastructure automation. Changes: - Add new CodeLang enum values for bash, c, cpp, csharp, cobol - Update code_lang_to_syntax_lexer() with Pygments lexer mappings - Update documentation to reflect new supported languages - Update tests to account for 21 total supported languages (up from 16) Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-01-30 19:52:28 -05:00
Johnny Greco	0d51539aa6	feat: add message trace support for LLM generation (#272 ) Add support for capturing full conversation traces during LLM generation, enabling debugging and fine-tuning dataset creation. Changes: - Add `with_trace` field to LLMTextColumnConfig for per-column trace control - Add `debug_override_save_all_column_traces` to RunConfig for global trace - Introduce ChatMessage dataclass for structured message representation - Update ModelFacade.generate() to return full message trace - Rename trace column postfix from `__reasoning_trace` to `__trace` - Add comprehensive traces documentation Traces capture system/user/assistant messages in order, enabling visibility into the full generation conversation including correction retries.	2026-01-30 17:03:07 -05:00
Nabin Mulepati	b238d06880	feat: allow skipping health checks (#244 )	2026-01-28 10:15:00 -07:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00
Johnny Greco	50fc50efc7	docs: Fix mkdocs syntax and update person sampling documentation (#249 ) * remove colon * update person sampling docs	2026-01-27 10:18:42 -05:00
Nabin Mulepati	01f8d887f8	chore: deprecate InferenceParameters (#183 ) * deprecate InferenceParameters * update docs and references	2026-01-08 10:43:02 -07:00
Mike Knepper	1c0bf65cc0	docs: Add extra_headers to model provider docs (#178 )	2026-01-07 08:27:36 -06:00
Nabin Mulepati	645c7995b7	Fix documentation on max_tokens (#176 )	2026-01-06 16:31:05 -07:00
Nabin Mulepati	3b4e296baf	feat: add OpenRouter as one of the default providers (#161 ) * Add openrouter as a default provider * Update docs	2026-01-06 10:22:18 -07:00
Johnny Greco	b71c6c11a8	docs: fix links and tweak person sampling (#152 ) * update person sampling * update docstring	2025-12-18 10:10:41 -08:00
Johnny Greco	b635e41033	update docs (#151 )	2025-12-18 12:43:29 -05:00
Andre Manoel	d50a8aef95	docs: add processors (#147 ) * first draft * adding to code reference as well * docstrings * addressing comments * forgot opening line * docstring too	2025-12-17 15:47:33 -03:00
Nabin Mulepati	8d4c6c12b4	chore: Update nvidia text default model alias to nano v3 (#133 )	2025-12-15 15:03:12 -07:00
Nabin Mulepati	3065179f8a	docs: add documentation on how to configure custom model settings (#124 ) * Add generation type to ModelConfig * pass tests * added generate_text_embeddings * tests * remove sensitive=True old artifact no longer needed * Slight refactor * slight refactor * Added embedding generator * chunk_separator -> chunk_pattern * update tests * rename for consistency * Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters * Remove purpose from consolidated kwargs * WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters * Type as WithModelGeneration * Add image generation modality * update return type for generate_kwargs * make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters * remove regex based chunking from embedding generator * Remove image generation for now * more tests and updates * column_type_is_llm_generated -> column_type_is_model_generated * change set to list: fix flaky tests * CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type * Update docs * fix deprecation warning originating from cli model settings * update display of inference parameters in cli list * save prog on inference parameter * updates for the ocnfig builder * update cli readme * update cli for inference parmeters * update inference parameter names * flip order of vars * WithCompletion -> WithChatCompletion * specify InferenceParamsT * Update columns.md with EmbeddingColumnConfig info * make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout * DRY out some stuff in field.py * docs for custom model settings * Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency * Add nvidia-embedding and openai-embedding to default model configs * Fix typo in docs * Make generate collab notebooks * Address PR comments	2025-12-15 14:00:31 -07:00
Nabin Mulepati	8370e4a00b	feat: support native embedding generation (#106 ) * Add generation type to ModelConfig * pass tests * added generate_text_embeddings * tests * remove sensitive=True old artifact no longer needed * Slight refactor * slight refactor * Added embedding generator * chunk_separator -> chunk_pattern * update tests * rename for consistency * Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters * Remove purpose from consolidated kwargs * WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters * Type as WithModelGeneration * Add image generation modality * update return type for generate_kwargs * make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters * remove regex based chunking from embedding generator * Remove image generation for now * more tests and updates * column_type_is_llm_generated -> column_type_is_model_generated * change set to list: fix flaky tests * CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type * Update docs * fix deprecation warning originating from cli model settings * update display of inference parameters in cli list * save prog on inference parameter * updates for the ocnfig builder * update cli readme * update cli for inference parmeters * update inference parameter names * flip order of vars * WithCompletion -> WithChatCompletion * specify InferenceParamsT * Update columns.md with EmbeddingColumnConfig info * make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout * DRY out some stuff in field.py * Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency * Add nvidia-embedding and openai-embedding to default model configs * Fix typo in docs * Make generate collab notebooks * fine-tune -> adjust	2025-12-15 11:03:33 -07:00
Kirit Thadaka	8d7a073e3a	docs: Updated Person Sampling docs (#120 ) * Updated Person Sampling docs * Updated mv command * Removed versions * Updated mv command --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2025-12-12 10:43:57 -05:00
Johnny Greco	48fdc8c838	docs: add initial plugin documentation (#107 ) * add docstrings * add analysis modules * include toc for plugins section * add plugin docs * remove scope creep * Update docs/plugins/example.md Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com> * address feedback --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>	2025-12-11 16:05:11 -05:00
Johnny Greco	57b5f6f798	set up initial recipe section (#114 )	2025-12-10 14:51:07 -05:00
Nabin Mulepati	8e3080241b	docs: move models docs to concepts > models (#93 )	2025-12-03 14:10:01 -07:00
Kirit Thadaka	4bee6d9088	docs: remove nemotron personas sampling from docs (for now) (#60 ) * Update persona docs * Updated person sampling docs based on feedback * remove nemotron personas sampling * Remove nemotron personas sampling * Update docs/concepts/person_sampling.md --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2025-11-21 16:39:00 -05:00
Johnny Greco	ec98211862	chore: some readme and docs cleanup (#56 ) * update classifiers * remove commented section for now * update readme badges and links * rename persons section to person sampling	2025-11-20 15:33:55 -05:00
Johnny Greco	14dc495341	docs: some documentation cleanup (#52 ) * some documentation cleanup * typo	2025-11-19 17:40:14 -05:00
Johnny Greco	362ec51544	docs: sampler params code ref and more (#50 ) * add sampler params code ref * add persons section * add person from faker sampler	2025-11-19 16:27:40 -05:00
Andre Manoel	01fbf4d848	docs: validators etc. (#45 ) * got a little help from Claude, will still double check everything * fixing, adding docstrings * forgotten file + overview to tutorial * minor * applying suggestions Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com> Co-authored-by: Johnny Greco <jogreco@nvidia.com> * addressing comments pt1 * addressing comments pt2 * trying something out * fix * typo * trying again * rollback workflow, add download links * minor * adapting notebooks to use fakersampler --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com> Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2025-11-19 17:39:10 -03:00
Johnny Greco	d4f32456a9	docs: welcome and concepts/columns (#43 ) * add mike * meth -> method; mod -> module in TOC * messing with dark/light mode default * staging stuff * remove code examples from docstrings * writing * add columns with style	2025-11-17 17:07:01 -05:00

39 commits