DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Nabin Mulepati	d8d1e668b0	docs: add image generation documentation and image-to-image editing tutorial (#319 )	2026-02-12 14:38:52 -07:00
Nabin Mulepati	8e2fd3286f	feat: add image generation support with multi-modal context (#317 )	2026-02-12 14:00:28 -07:00
Andre Manoel	429b558588	refactor: callback-based processor design (#294 )	2026-02-11 21:32:24 -03:00
Eric W. Tramel	d9f6559cf9	docs: deep research trajectories with NDD and MCP tool use (#326 )	2026-02-11 19:01:17 -05:00
Kirit Thadaka	565fe4ebb9	Updated url (#325 )	2026-02-11 14:43:38 -08:00
Kirit Thadaka	b03201086b	docs: New post on SDG design principles (#318 ) * Added cat emoji sequence * Added post on SDG * Updated post * Added image * refined post * Added one line on personas	2026-02-11 08:27:13 -08:00
Johnny Greco	11143c788f	docs: restructure plugin docs with multi-file layout and seed reader type (#302 ) * docs: restructure plugin docs with multi-file layout and seed reader type - Update plugin overview to document both column generator and seed reader plugin types - Restructure example plugin to use separate config.py, impl.py, and plugin.py files instead of a single-file approach - Add sections for plugin validation and multiple plugins per package - Document required config class methods (get_column_emoji, required_columns, side_effect_columns) * docs: clarify benefits of multi-file plugin structure Expand explanation to mention circular dependency prevention as a key reason for separating config, impl, and plugin modules. * docs: fix import ordering in plugin example * import spacing * better example column name * add a bit to the comment * Updated plugin docs * update plugin overview call-to-action wording --------- Co-authored-by: Kirit93 <kthadaka@nvidia.com>	2026-02-09 16:03:56 -05:00
Kirit Thadaka	6dd7dca9ba	docs: updated usage chart (#304 ) * updated usage chart * Updated readme	2026-02-05 20:09:05 -08:00
Kirit Thadaka	4cfc1669bd	docs: Added documentation for seed datasets (#300 ) * Added images for deployment options * Add seed datasets documentation - New concepts page explaining seed datasets - Covers seed sources (LocalFile, HuggingFace, DataFrame) - Documents sampling and selection strategies - Includes complete example and best practices * Incorporated greptile feedback * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Addressed feedback * Addressed comments --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-02-05 14:29:05 -08:00
Johnny Greco	f74f25872c	chore: quiet tool call logs and add tool usage statistics (#293 ) * add tool usage statistics tracking - Add ToolUsageStats class with metrics for tool calls, turns, and statistical aggregates (mean/stddev per generation) - Extend ModelUsageStats to include tool_usage tracking - Update ModelFacade.generate() to track total tool calls and turns - Update tests with tool_call_count method and new assertions * silence noisy mcp logs * log message updates * add tools enabled info message * exclude empty tool_usage from usage stats output * add tool usage summary logging after column generation - Track tool usage snapshots before/after column processing - Log mean tool calls per generation for columns with tools enabled - Add get_tool_usage_snapshot/get_tool_usage_delta methods to ModelRegistry - Remove unused extra_info parameter from progress_tracker.log_start() - Add comprehensive tests for ToolUsageStats * pretty format model usage logs * reuse stubs and fixtures * add merge method to ToolUsageStats for accurate stats aggregation The previous implementation used extend() to combine tool usage stats, but extend() is designed for single generation data. This caused incorrect stddev calculations when merging stats from multiple sources. - Add ToolUsageStats.merge() that properly combines sum-of-squares - Update ModelUsageStats.extend() to use merge() for tool usage - Add tests verifying stddev accuracy after merging * fix tool usage stats missing generations_with_tools count When tracking tool usage after generation, the ToolUsageStats was created without setting generations_with_tools, causing the usage summary to report zeros for calls/gen and turns/gen metrics. * fix tool usage delta objects returning incorrect stddev values - Simplify facade API to use tool_usage.extend() directly - Return NaN for stddev when sum of squares wasn't tracked - Add docstring to get_tool_usage_delta explaining NaN behavior - Add comprehensive tests for stddev variance calculation * fix tool usage delta stddev by including sum of squares in deltas Convert sum_of_squares_turns and sum_of_squares_calls from private attributes to public fields, enabling them to be included in delta calculations. This allows get_tool_usage_delta to return objects that compute accurate stddev values instead of NaN. * fix test to use get_tool_usage_snapshot for accurate stddev tracking The test was manually constructing a ToolUsageStats snapshot without sum_of_squares fields, causing stddev to be NaN. Now uses the proper snapshot method that includes all fields needed for delta calculations. * use nvidia-reasoning by default * mean -> average in log message * refactor log indentation to use centralized LOG_INDENT constant - Add LOG_INDENT constant to logging.py for consistent indentation - Replace hardcoded " \|-- " strings across all log statements - Add tool alias and MCP provider info to pre-generation logs - Improve model usage log format for better consistency - Update tests to match new log formats * simplify usage stats dict access in model registry Remove defensive .get() calls and unnecessary type casts since the usage statistics dictionary structure is now guaranteed. * walrus baby * simplify tool usage tracking and reduce log verbosity - Remove mean/stddev calculations from ToolUsageStats in favor of simple counts and generation ratios - Add total_generations field to track all tool-enabled generations - Simplify registry log output to show generations ratio (with_tools/total) - Remove per-column tool usage snapshot/delta logging from column builder - Track tool usage for all tool-enabled generations, not just those with calls * format inference parameters as multi-line log output - Add get_formatted_params() method to BaseInferenceParams - Add LOG_DOUBLE_INDENT constant for nested indentation - Update log_pre_generation() to display each parameter on its own line * update tests to use LOG_INDENT constants Align test assertions with the centralized log indentation constants introduced in the logging module refactor. * two-space consistency	2026-02-05 10:14:02 -05:00
Kirit Thadaka	624f87f6fe	docs: Add RQA dataset blog post and improve blog navigation (#296 ) * Add RQA dataset blog post and improve blog navigation - Add new blog post about RQA (Reasoning Question-Answer) dataset - Add excerpt separator for blog index blurbs - Configure left nav to show individual blog posts - Add navigation.indexes feature for better section handling - Update authors.yml with new contributors * Update avatar. * Update Eric avatar. * Fix formatting. * Fix formatting. --------- Co-authored-by: Dane Corneil <dane.corneil@gretel.ai> Co-authored-by: Eric W. Tramel <eric.tramel@gmail.com>	2026-02-04 14:28:12 -08:00
Kirit Thadaka	6dc35b2875	Added images for deployment options (#297 )	2026-02-04 14:22:56 -08:00
Johnny Greco	4e89c2f9f3	standardize recipe script metadata (#292 )	2026-02-04 10:43:27 -05:00
Andre Manoel	62bae42dc2	feat: Add CustomColumnGenerator for user-defined column generation (#254 ) * first attempt * iterating a bit * some improvements + multiturn example * adapting to new monorepo structure * refining * fixed test * fixing license headers * adding docs * adding test for failed generation * allowing strategy to be picked * renaming argument * lint * remove recommendation * renaming for consistency * addressing comments pt1 * addressing comments pt2 * addressing comments pt3 * adding a mock for development * addressing greptile comments * revamping * docs: streamline custom columns documentation * docs: simplify CustomColumnConfig docstring Remove verbose code example and detailed function signatures from docstring to match the pattern of other config classes in the file. * test: clean up custom column tests - Remove tests for private _custom_column_metadata attribute - Combine redundant generator creation tests - Reuse stub_resource_provider and stub_model_facade fixtures * test: consolidate custom column tests Reduce from 26 to 11 tests while maintaining coverage: - Combine redundant config/decorator/creation tests - Use parametrized tests for error conditions - Remove duplicate validation tests for full_column strategy - Simplify section headers * refactor: deduplicate CustomColumnGenerator logic Merge cell-by-cell and full-column code paths: - _generate_cell_by_cell + _generate_full_column -> _generate - _validate_output_columns + _validate_output_columns_df -> _validate_output * chore: merge example files into single notebook-style example.py Combine example.py, example_multiturn.py, and example_benchmark_strategies.py into a single file with #%% cell markers for Jupyter/VS Code notebook mode. * addressing greptile comments * refactor: reuse generate_text in generate_text_batch * refactor: replace CustomColumnContext with models dict - Remove CustomColumnContext class; users now receive models dict directly - Add DataDesigner.get_models() for experimentation outside pipeline - Make parser optional in ModelFacade.generate() (defaults to identity) - Validate parameter names: row/df, generator_params, models - Update examples, tests, and docs for new API * fix: address PR review comments from Nabin and greptile - Make decorator metadata public (custom_column_metadata) - Simplify get_generation_strategy() to directly return config value - Use !r formatting in error messages - Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports) - Remove redundant error logging before re-raise - Validate max 3 positional parameters - Use GenerationStrategy enum in example instead of string * fix: replace lambda with module-level identity function in facade Use pickleable _identity function instead of lambda x: x for the default parser argument, ensuring compatibility with multiprocessing. * fix: restore inherited attributes in LLM column docstrings Restores the "Inherited Attributes" sections that were unintentionally removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig docstrings. * docs: clarify model_aliases is required for LLM access Updated documentation and docstrings to clarify that model_aliases populates the models dict (not just health checks). * fix: address PR review comments from nabinchha - clarify model_aliases requirement in docs - add note about model alias validation during health check - combine two loops into one in _run_model_health_check_if_needed - add signature validation at decoration time - enforce decorated functions in CustomColumnConfig validator - simplify generator to only validate strategy-specific first param * fix: address remaining PR review comments - remove example.py (development artifact) - fix get_models return type to dict[str, ModelFacade] * test: update tests for decoration-time validation - expect ValidationError instead of InvalidConfigError for non-callable - split param validation test into decoration-time and runtime tests	2026-02-03 19:23:39 -03:00
Johnny Greco	87119a545b	refactor: move SingleColumnConfig to config.base module (#287 ) * create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test	2026-02-03 14:04:04 -05:00
Eric W. Tramel	5430bcbe99	Remove `debug_trace_override` (#290 )	2026-02-03 12:09:30 -05:00
Eric W. Tramel	532d21a8d7	feat: add extract_reasoning_content option to LLM columns (#285 )	2026-02-03 10:25:24 -05:00
Andre Manoel	b6d400ef7d	chore: update tutorial notebooks to use dd. notation consistently (#288 ) - Convert notebook 3 from string-based columns to class specs (dd.SamplerColumnConfig, etc.) - Fix grammar: "is the main object is responsible" → "is the main object responsible" - Remove stray "A" at end of URL in notebook 2 - Remove empty markdown cell in notebook 4 - Add missing data_designer.validate() call in notebook 4 - Regenerate colab notebooks from source	2026-02-03 12:03:32 -03:00
Kirit Thadaka	de7c3ab99a	docs: add deployment, performance tuning guides and streamline gettin… (#277 ) * docs: add deployment, performance tuning guides and streamline getting started - Add deployment-options.md: Library vs. Microservice decision guide - Add inference-architecture.md: Separation of concerns with LLM servers - Add performance-tuning.md: Concurrency and batching optimization guide - Streamline index.md: Merge installation, add quick example, simplify - Remove quick-start.md: Content merged into welcome page - Remove installation.md: Content merged into welcome page - Update model docs: Add concurrency control sections and cross-references - Update mkdocs.yml: Add new Architecture section to navigation * docs: add tasteful emojis to new documentation pages * docs: consolidate redundant concurrency and troubleshooting content - Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md - Remove duplicate Concurrency Control section from model-configs.md - Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md - Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md) - performance-tuning.md is now the authoritative source for tuning guidance * Simplified doc additions * Switched default model to nemotron 3 nano * Addressed feedback * Added first blog draft	2026-02-02 21:03:58 -08:00
Eric W. Tramel	510761107b	feat: Add TraceType enum for granular trace control (#284 )	2026-02-02 19:43:51 -05:00
Eric W. Tramel	7248b9fc8f	Update trace normalization to ChatML content blocks (#283 )	2026-02-02 18:22:16 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Kirit Thadaka	9e1c6ec679	feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang (#271 ) Add support for five high-priority programming languages to Data Designer's code generation capabilities: - Bash: Universal DevOps and automation scripting - C, C++, C#: Systems programming and enterprise development - COBOL: Legacy mainframe systems and modernization These languages address critical enterprise use cases including legacy code maintenance, systems programming, and infrastructure automation. Changes: - Add new CodeLang enum values for bash, c, cpp, csharp, cobol - Update code_lang_to_syntax_lexer() with Pygments lexer mappings - Update documentation to reflect new supported languages - Update tests to account for 21 total supported languages (up from 16) Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-01-30 19:52:28 -05:00
Johnny Greco	0d51539aa6	feat: add message trace support for LLM generation (#272 ) Add support for capturing full conversation traces during LLM generation, enabling debugging and fine-tuning dataset creation. Changes: - Add `with_trace` field to LLMTextColumnConfig for per-column trace control - Add `debug_override_save_all_column_traces` to RunConfig for global trace - Introduce ChatMessage dataclass for structured message representation - Update ModelFacade.generate() to return full message trace - Rename trace column postfix from `__reasoning_trace` to `__trace` - Add comprehensive traces documentation Traces capture system/user/assistant messages in order, enabling visibility into the full generation conversation including correction retries.	2026-01-30 17:03:07 -05:00
Nabin Mulepati	b238d06880	feat: allow skipping health checks (#244 )	2026-01-28 10:15:00 -07:00
Johnny Greco	ae0665fa16	refactor: slim package refactor into three subpackages (#240 ) * remove old structure * major shuffle * streamline project configs * update make commands * updates to make commands * remove essentials * initialize logger in interface * uv lock * ignore notepad * update workflows * fix e2e project config * generate colab notebooks * resolve default model settings in interface * fix build commands * update perf import make command * cleaning up some slop * update recipes * move conftest files to tests/ * update subpackage readmes * streamline config_logging * use exports * update perf import usage pattern * update for IDE behavior with ruff * remove engine's fixtures file * add note to about lazy imports * update dependencies * update docs * doc fixes * uv lock * updates to catch up with main * clean up makefile * remove package gitignores * define deps only once * isolate tests * add test for protetion rule * create temp dirs for isolated tests * catch up to main * update headers * re apply changes * better result summaries for isolated tests * move exports into top-level init * fix client importlib version syntax * catch up with main	2026-01-27 13:53:20 -05:00
Johnny Greco	50fc50efc7	docs: Fix mkdocs syntax and update person sampling documentation (#249 ) * remove colon * update person sampling docs	2026-01-27 10:18:42 -05:00
Eric W. Tramel	613509f323	feat: Elevate non-LLM concurrency limits to `RunConfig` (#242 )	2026-01-26 11:11:36 -05:00
Kirit Thadaka	0ab3613b83	docs: Updated recipe card (#153 ) * Updated recipe card * Apply suggestions from code review --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-01-22 11:44:01 -05:00
Johnny Greco	3d9f5185d7	refactor: remove task metadata property (#216 ) * remove metadata * docs and tests * don't need that test * use static method for generation strategy * update docs * add docstring	2026-01-15 14:12:11 -05:00
Kirit Thadaka	ab660d01d1	docs: Added top models pie chart (#217 ) * Added top models pie chart * Updated image and added description	2026-01-14 11:54:05 -08:00
Johnny Greco	d962c86843	fix: update example runner command with notebooks dep group (#204 ) * update dep groups; use in makefile * add quotes to packages in pip command	2026-01-13 11:49:31 -05:00
Johnny Greco	910d22dfa0	chore: add make commands to run examples as e2e tests (#199 ) * update makefile * fix bug	2026-01-12 15:37:00 -05:00
Johnny Greco	69cd989285	refactor: update required resources treatment and use subclasses over mixins (#184 ) * removing required resources * fix tests * add get required resources method to base column generator * move classification functions to engine; remove required resources * drop single from subclass names * update model config logging * fix unit test * typo * update type hint * move tests	2026-01-09 14:42:09 -05:00
Mike Knepper	7b5ea13f8b	Fix stray validate calls in notebooks (#192 )	2026-01-08 15:46:20 -06:00
Mike Knepper	8e69ab0336	refactor: Plugins rename task to impl (#189 )	2026-01-08 13:34:05 -06:00
Mike Knepper	6bf7698bc2	refactor: Overhaul to seed datasets (#167 )	2026-01-08 11:48:14 -06:00
Nabin Mulepati	01f8d887f8	chore: deprecate InferenceParameters (#183 ) * deprecate InferenceParameters * update docs and references	2026-01-08 10:43:02 -07:00
Mike Knepper	1c0bf65cc0	docs: Add extra_headers to model provider docs (#178 )	2026-01-07 08:27:36 -06:00
Nabin Mulepati	645c7995b7	Fix documentation on max_tokens (#176 )	2026-01-06 16:31:05 -07:00
Nabin Mulepati	3b4e296baf	feat: add OpenRouter as one of the default providers (#161 ) * Add openrouter as a default provider * Update docs	2026-01-06 10:22:18 -07:00
Mike Knepper	36a174af04	refactor: plugin system updates (#168 )	2026-01-06 10:29:47 -06:00
Johnny Greco	b71c6c11a8	docs: fix links and tweak person sampling (#152 ) * update person sampling * update docstring	2025-12-18 10:10:41 -08:00
Johnny Greco	b635e41033	update docs (#151 )	2025-12-18 12:43:29 -05:00
Johnny Greco	0a60f869c1	docs: just some tutorial notebook tweaks and a docstring update (#150 ) * update doctstring * notebook tweaks * generate colab notebooks	2025-12-18 12:01:50 -05:00
Johnny Greco	6e6efc009f	docs: some updates for nano3 (#149 ) * some fixes * generate colab notebooks	2025-12-17 18:24:39 -05:00
Andre Manoel	d50a8aef95	docs: add processors (#147 ) * first draft * adding to code reference as well * docstrings * addressing comments * forgot opening line * docstring too	2025-12-17 15:47:33 -03:00
Nabin Mulepati	8d4c6c12b4	chore: Update nvidia text default model alias to nano v3 (#133 )	2025-12-15 15:03:12 -07:00
Nabin Mulepati	3065179f8a	docs: add documentation on how to configure custom model settings (#124 ) * Add generation type to ModelConfig * pass tests * added generate_text_embeddings * tests * remove sensitive=True old artifact no longer needed * Slight refactor * slight refactor * Added embedding generator * chunk_separator -> chunk_pattern * update tests * rename for consistency * Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters * Remove purpose from consolidated kwargs * WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters * Type as WithModelGeneration * Add image generation modality * update return type for generate_kwargs * make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters * remove regex based chunking from embedding generator * Remove image generation for now * more tests and updates * column_type_is_llm_generated -> column_type_is_model_generated * change set to list: fix flaky tests * CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type * Update docs * fix deprecation warning originating from cli model settings * update display of inference parameters in cli list * save prog on inference parameter * updates for the ocnfig builder * update cli readme * update cli for inference parmeters * update inference parameter names * flip order of vars * WithCompletion -> WithChatCompletion * specify InferenceParamsT * Update columns.md with EmbeddingColumnConfig info * make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout * DRY out some stuff in field.py * docs for custom model settings * Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency * Add nvidia-embedding and openai-embedding to default model configs * Fix typo in docs * Make generate collab notebooks * Address PR comments	2025-12-15 14:00:31 -07:00
Nabin Mulepati	8370e4a00b	feat: support native embedding generation (#106 ) * Add generation type to ModelConfig * pass tests * added generate_text_embeddings * tests * remove sensitive=True old artifact no longer needed * Slight refactor * slight refactor * Added embedding generator * chunk_separator -> chunk_pattern * update tests * rename for consistency * Restructure InferenceParameters -> CompletionInferenceParameters, BaseInferenceParameters, EmbeddingInferenceParameters * Remove purpose from consolidated kwargs * WithModelConfiguration.inference_parameters should should be typed with BaseInferenceParameters * Type as WithModelGeneration * Add image generation modality * update return type for generate_kwargs * make generation_type a field of ModelConfig as opposed to a prop resolved based on the type of InferenceParameters * remove regex based chunking from embedding generator * Remove image generation for now * more tests and updates * column_type_is_llm_generated -> column_type_is_model_generated * change set to list: fix flaky tests * CompletionInferenceParameters -> ChatCompletionInferenceParameters for consistency with generation_type * Update docs * fix deprecation warning originating from cli model settings * update display of inference parameters in cli list * save prog on inference parameter * updates for the ocnfig builder * update cli readme * update cli for inference parmeters * update inference parameter names * flip order of vars * WithCompletion -> WithChatCompletion * specify InferenceParamsT * Update columns.md with EmbeddingColumnConfig info * make generation_type a descriminator field in inference params. add configuration support for max_parallel_requests and timeout * DRY out some stuff in field.py * Update nomenclature. prompt tokens -> input tokens, completion tokens -> output tokens in column statistics for consistency * Add nvidia-embedding and openai-embedding to default model configs * Fix typo in docs * Make generate collab notebooks * fine-tune -> adjust	2025-12-15 11:03:33 -07:00

1 2

82 commits