DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Nabin Mulepati	d8d1e668b0	docs: add image generation documentation and image-to-image editing tutorial (#319 )	2026-02-12 14:38:52 -07:00
Nabin Mulepati	8e2fd3286f	feat: add image generation support with multi-modal context (#317 )	2026-02-12 14:00:28 -07:00
Andre Manoel	429b558588	refactor: callback-based processor design (#294 )	2026-02-11 21:32:24 -03:00
Eric W. Tramel	d9f6559cf9	docs: deep research trajectories with NDD and MCP tool use (#326 )	2026-02-11 19:01:17 -05:00
Kirit Thadaka	565fe4ebb9	Updated url (#325 )	2026-02-11 14:43:38 -08:00
Johnny Greco	631f1f970e	fix: trim LLM response content before parsing (#322 ) Strip leading and trailing whitespace from final LLM response content so parsers receive normalized text. Add a parametrized unit test that covers multiple whitespace patterns.	2026-02-11 15:45:15 -05:00
Andre Manoel	f8b7c905e8	fix: include CUSTOM type in execution DAG and warn on generator errors (#324 ) * fix: include CUSTOM type in execution DAG classification Custom columns have required_columns and side_effect_columns but were excluded from the DAG, causing incorrect execution order when they depend on or are depended upon by other columns. Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com> * add warning when custom generator function fails Log a warning in cell-by-cell mode so users know the record will be skipped. In full-column mode the error message is already descriptive enough via the DatasetGenerationError chain. Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com> --------- Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com>	2026-02-11 17:21:33 -03:00
Johnny Greco	1514720596	feat: support loading config files from HTTP(S) URLs (#323 ) * support loading config files from http urls - allow config builder and CLI loader to load YAML/JSON configs from HTTP(S) URLs - reject unsupported URL extensions and remote Python module URLs - update CLI help text and add tests for URL success/failure paths * harden remote config loading and deduplicate URL validation - Add size limit (10 MB) when fetching configs from URLs - Validate parsed YAML is a dict before returning - Make is_http_url public and reuse it in CLI validate_url - Replace local CONFIG_FILE_EXTENSIONS with shared constant - Add tests for is_http_url, URL-with-no-extension edge cases * use requests for remote config loading - replace urllib URL fetching with requests and status checks - parse remote payloads via smart_load_yaml for consistent validation - expand tests for HTTP errors, size limits, and non-dict payloads * lower remote config size limit to 1 MB * improve config URL HTTP error reporting Add granular 401/403/404 and generic HTTP status errors for remote config fetching to make failures actionable. Clarify that authenticated config URL loading is not currently supported and update tests for status-aware behavior. * rewrite github blob URLs for remote loading Handle GitHub blob links by rewriting them to raw content URLs for config and dataframe HTTP loaders, preserving query params but avoiding query token leaks in logs. This also fixes extension detection for URLs with query strings and adds coverage for rewrite behavior. * remove validate_url wrapper in favor of is_http_url The validate_url function in cli/utils was just a thin wrapper around is_http_url from io_helpers. Remove it and have callers use is_http_url directly for clarity and reduced indirection. * fix optional type for artifact_path CLI option * fix URL recursion in smart_load_yaml - avoid treating remote payload strings as new URL inputs - add regression test for URL string payloads from remote config * rewrite huggingface blob URLs for remote loading	2026-02-11 15:12:52 -05:00
Johnny Greco	d3c4de76da	feat: add preview, create, and validate CLI commands (#313 ) * feat: add preview, create, and validate CLI commands Add three new top-level CLI commands for the data-designer workflow: - `data-designer preview` - generate preview datasets for fast iteration - `data-designer create` - create full datasets and save to disk - `data-designer validate` - validate configuration files Also includes: - Move wait_for_navigation_key() UI primitive from preview.py to ui.py - Add KeyPressEvent type annotations to all key binding handlers in ui.py - Refactor cli/utils.py into cli/utils/ package with config_loader module - Comprehensive test coverage for all new commands * fix: update pythonjsonlogger import and clean up dev dependencies - Update pythonjsonlogger import to use newer JsonFormatter API - Consolidate dev-dependencies into [dependency-groups] dev section - Remove unnecessary test cli/utils __init__.py * small E * address greptile feedback * organize CLI commands into rich help panels Group top-level commands under "Generation" and "Setup" panels for clearer help output. * refactor config loader to parse files directly and auto-detect config format - Parse YAML/JSON files into dicts before passing to from_config, providing format-specific error messages for parse failures - Auto-detect DataDesignerConfig format (columns at top level) and wrap it into BuilderConfig so users can provide either format - Clean up Python module loading with try/except/finally for reliable sys.modules and sys.path cleanup - Add comprehensive tests for parsing, validation, and auto-wrapping * fix sys.path cleanup in config loader and simplify tests - Use pop(0) instead of remove() to precisely undo the insert(0, ...) and avoid accidentally removing a different matching path entry - Replace MagicMock with real DataDesignerConfigBuilder in tests * move config format auto-detection into from_config Centralize the shorthand DataDesignerConfig detection (columns at top level without a data_designer wrapper) in DataDesignerConfigBuilder.from_config so all callers benefit, not just the CLI config loader. Simplify config_loader to delegate file parsing and format normalization entirely to from_config. * extract GenerationController from CLI commands Move shared generation logic (preview, validate, create) out of the individual Typer command functions into a dedicated GenerationController, matching the existing controller pattern (DownloadController, etc.). The command functions now delegate to the controller, keeping them as thin entry points. Tests updated accordingly — command tests verify delegation while controller tests cover the full behavior. * harden sys.path cleanup and add explanatory comments Use sys.path.remove() instead of checking sys.path[0] so cleanup succeeds even when exec_module inserts entries at index 0. Drop unnecessary spec=DataDesignerConfigBuilder from test mocks. * check stdout TTY in preview interactive mode detection Previously only stdin was checked, so piping stdout (e.g. `dd preview cfg.yaml \| head`) would still attempt interactive browsing. Now both stdin and stdout must be a TTY.	2026-02-11 14:06:06 -05:00
Kirit Thadaka	b03201086b	docs: New post on SDG design principles (#318 ) * Added cat emoji sequence * Added post on SDG * Updated post * Added image * refined post * Added one line on personas	2026-02-11 08:27:13 -08:00
Andre Manoel	f012703a96	fix: use reasoning_effort for OpenAI gpt-5 models (#315 ) gpt-5 is a reasoning model that doesn't support temperature/top_p. Use reasoning_effort via extra_body instead. Fixes #314	2026-02-10 12:01:52 -03:00
Kirit Thadaka	e4ff980adb	docs: Added cat emoji sequence (#316 ) * Added cat emoji sequence * a couple more progress emojis --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-02-09 18:11:31 -05:00
Johnny Greco	11143c788f	docs: restructure plugin docs with multi-file layout and seed reader type (#302 ) * docs: restructure plugin docs with multi-file layout and seed reader type - Update plugin overview to document both column generator and seed reader plugin types - Restructure example plugin to use separate config.py, impl.py, and plugin.py files instead of a single-file approach - Add sections for plugin validation and multiple plugins per package - Document required config class methods (get_column_emoji, required_columns, side_effect_columns) * docs: clarify benefits of multi-file plugin structure Expand explanation to mention circular dependency prevention as a key reason for separating config, impl, and plugin modules. * docs: fix import ordering in plugin example * import spacing * better example column name * add a bit to the comment * Updated plugin docs * update plugin overview call-to-action wording --------- Co-authored-by: Kirit93 <kthadaka@nvidia.com>	2026-02-09 16:03:56 -05:00
Johnny Greco	c31888b051	chore: export ConstraintType and InequalityOperator from config init (#308 ) * update init constraint imports * add missing columns	2026-02-09 12:35:34 -05:00
Johnny Greco	5b84a6261e	fix: allow BuilderConfig round-trip serialization for library_version (#311 ) BuilderConfig.library_version was a @computed_field that got serialized to YAML/JSON but rejected on deserialization due to extra="forbid". Changed to a regular field with a model_validator that auto-sets the version and warns when loading configs from a different version.	2026-02-09 11:10:10 -05:00
Johnny Greco	2e413d31ce	bump pytest, nbconvert, and pyjwt for vulnerability fixes (#312 ) - pytest: 8.x -> 9.0.2 (with pytest-asyncio 1.3.0, pytest-httpx 0.36.0) - nbconvert: 7.16.6 -> 7.17.0 - pyjwt: 2.10.1 -> 2.11.0	2026-02-09 10:02:36 -05:00
Andre Manoel	58734d09f0	test: add provider health checks script and CI workflow (#301 ) * test: add e2e health checks for default provider models Add parametrized tests that verify model connectivity for all default providers (nvidia, openai, openrouter). Tests check API key availability and skip when not configured. * chore: move health checks out of e2e tests - Convert pytest test to standalone script at scripts/health_checks.py - Add `make health-checks` target - Add CI workflow (weekly + on release + manual dispatch) - Remove test_health_checks.py from tests_e2e/ * chore: make health checks non-blocking in CI * fix: print traceback to stdout to avoid interleaving * chore: add all provider API keys to health checks CI Co-authored-by: Cursor <cursoragent@cursor.com> * chore: remove temporary push trigger from health checks Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-06 15:18:35 -03:00
Johnny Greco	79392b83ef	add badge (#306 )	2026-02-06 12:45:25 -05:00
Johnny Greco	f69ec87914	chore: update HF card citation copy and add library version to builder config (#303 ) * update citation copy * add version to builder config * centralize library version lookup into get_library_version Extract version retrieval into a dedicated version.py module with graceful error handling (returns "unknown" if package not found). Replace direct importlib.metadata.version() calls in config_builder and column_wise_builder with the new helper. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-06 12:42:17 -05:00
Kirit Thadaka	1f3b00a0f6	docs: Update README.md (#305 )	2026-02-05 20:41:00 -08:00
Kirit Thadaka	6dd7dca9ba	docs: updated usage chart (#304 ) * updated usage chart * Updated readme	2026-02-05 20:09:05 -08:00
Kirit Thadaka	4cfc1669bd	docs: Added documentation for seed datasets (#300 ) * Added images for deployment options * Add seed datasets documentation - New concepts page explaining seed datasets - Covers seed sources (LocalFile, HuggingFace, DataFrame) - Documents sampling and selection strategies - Includes complete example and best practices * Incorporated greptile feedback * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Update docs/concepts/seed-datasets.md Co-authored-by: Johnny Greco <jogreco@nvidia.com> * Addressed feedback * Addressed comments --------- Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-02-05 14:29:05 -08:00
Johnny Greco	f74f25872c	chore: quiet tool call logs and add tool usage statistics (#293 ) * add tool usage statistics tracking - Add ToolUsageStats class with metrics for tool calls, turns, and statistical aggregates (mean/stddev per generation) - Extend ModelUsageStats to include tool_usage tracking - Update ModelFacade.generate() to track total tool calls and turns - Update tests with tool_call_count method and new assertions * silence noisy mcp logs * log message updates * add tools enabled info message * exclude empty tool_usage from usage stats output * add tool usage summary logging after column generation - Track tool usage snapshots before/after column processing - Log mean tool calls per generation for columns with tools enabled - Add get_tool_usage_snapshot/get_tool_usage_delta methods to ModelRegistry - Remove unused extra_info parameter from progress_tracker.log_start() - Add comprehensive tests for ToolUsageStats * pretty format model usage logs * reuse stubs and fixtures * add merge method to ToolUsageStats for accurate stats aggregation The previous implementation used extend() to combine tool usage stats, but extend() is designed for single generation data. This caused incorrect stddev calculations when merging stats from multiple sources. - Add ToolUsageStats.merge() that properly combines sum-of-squares - Update ModelUsageStats.extend() to use merge() for tool usage - Add tests verifying stddev accuracy after merging * fix tool usage stats missing generations_with_tools count When tracking tool usage after generation, the ToolUsageStats was created without setting generations_with_tools, causing the usage summary to report zeros for calls/gen and turns/gen metrics. * fix tool usage delta objects returning incorrect stddev values - Simplify facade API to use tool_usage.extend() directly - Return NaN for stddev when sum of squares wasn't tracked - Add docstring to get_tool_usage_delta explaining NaN behavior - Add comprehensive tests for stddev variance calculation * fix tool usage delta stddev by including sum of squares in deltas Convert sum_of_squares_turns and sum_of_squares_calls from private attributes to public fields, enabling them to be included in delta calculations. This allows get_tool_usage_delta to return objects that compute accurate stddev values instead of NaN. * fix test to use get_tool_usage_snapshot for accurate stddev tracking The test was manually constructing a ToolUsageStats snapshot without sum_of_squares fields, causing stddev to be NaN. Now uses the proper snapshot method that includes all fields needed for delta calculations. * use nvidia-reasoning by default * mean -> average in log message * refactor log indentation to use centralized LOG_INDENT constant - Add LOG_INDENT constant to logging.py for consistent indentation - Replace hardcoded " \|-- " strings across all log statements - Add tool alias and MCP provider info to pre-generation logs - Improve model usage log format for better consistency - Update tests to match new log formats * simplify usage stats dict access in model registry Remove defensive .get() calls and unnecessary type casts since the usage statistics dictionary structure is now guaranteed. * walrus baby * simplify tool usage tracking and reduce log verbosity - Remove mean/stddev calculations from ToolUsageStats in favor of simple counts and generation ratios - Add total_generations field to track all tool-enabled generations - Simplify registry log output to show generations ratio (with_tools/total) - Remove per-column tool usage snapshot/delta logging from column builder - Track tool usage for all tool-enabled generations, not just those with calls * format inference parameters as multi-line log output - Add get_formatted_params() method to BaseInferenceParams - Add LOG_DOUBLE_INDENT constant for nested indentation - Update log_pre_generation() to display each parameter on its own line * update tests to use LOG_INDENT constants Align test assertions with the centralized log indentation constants introduced in the logging module refactor. * two-space consistency	2026-02-05 10:14:02 -05:00
Kirit Thadaka	624f87f6fe	docs: Add RQA dataset blog post and improve blog navigation (#296 ) * Add RQA dataset blog post and improve blog navigation - Add new blog post about RQA (Reasoning Question-Answer) dataset - Add excerpt separator for blog index blurbs - Configure left nav to show individual blog posts - Add navigation.indexes feature for better section handling - Update authors.yml with new contributors * Update avatar. * Update Eric avatar. * Fix formatting. * Fix formatting. --------- Co-authored-by: Dane Corneil <dane.corneil@gretel.ai> Co-authored-by: Eric W. Tramel <eric.tramel@gmail.com>	2026-02-04 14:28:12 -08:00
Kirit Thadaka	6dc35b2875	Added images for deployment options (#297 )	2026-02-04 14:22:56 -08:00
Nabin Mulepati	236f62b3d1	feat: add HuggingFace Hub integration for dataset publishing (#275 ) * feat: add push_to_hub integration for HuggingFace datasets Implement HuggingFace Hub integration to upload DataDesigner datasets: - Add HuggingFaceHubClient with upload_dataset method - Upload main parquet files to data/ subset - Upload processor outputs to data/{processor_name}/ subsets - Generate dataset card from metadata.json with column statistics - Include sdg.json and metadata.json configuration files - Comprehensive validation and error handling - Add push_to_hub() method to DatasetCreationResults * feat: improve push_to_hub with logging, path mapping, and config definitions - Add progress logging with emojis following codebase style - Add repository exists check before creation - Update metadata.json paths for HuggingFace structure (parquet-files/ → data/, processors-files/{name}/ → {name}/) - Enhance dataset card with detailed intro, tabular schema/statistics, and clickable config links - Add explicit configs in YAML frontmatter to fix schema mismatch between main dataset and processor outputs - Set data config as default configuration * feat: add optional description parameter to push_to_hub - Add description parameter to push_to_hub() for custom dataset card content - Description appears after NeMo Data Designer intro section - Update dataset card template to conditionally render custom description - Add tests for with/without custom description scenarios * feat: make description required and enhance dataset card design - Make description parameter required in push_to_hub() - Improve dataset card layout with flexbox header (title + right-aligned tagline) - Add horizontal dividers between sections for visual separation - Add emoji icons to section headers for better readability - Move About NeMo Data Designer section after Citation - Update section order: Description → Quick Start → Dataset Summary → Schema & Statistics → Generation Details → Citation → About - Update all tests to provide required description parameter * fix license headers * remove modality deteciton * break up upload_dataset * make token private * HuggingFace -> Hugging Face * remove inline imports * simplify tests + remvoe create pr option for simplicity * Update packages/data-designer/src/data_designer/integrations/huggingface/dataset_card.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * use consistent indentaion * fix temp file clean up * huggingface hub already a dep in engine * add missing spaces * reuse vars from artifact_storage.py * pull put hf hub datasets url to constants * HuggingfaceUploadError -> HuggingFaceHubClientUploadError * defer to hfhub repo validation * Update packages/data-designer/src/data_designer/integrations/huggingface/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * Update packages/data-designer/src/data_designer/interface/results.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * Update packages/data-designer/src/data_designer/integrations/huggingface/client.py Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com> * allow custom tags * change sdg.json -> builder_config.json --------- Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>	2026-02-04 11:40:53 -07:00
Daksh Gupta	13c4aded14	chore: enable status check in greptile.json (#295 )	2026-02-04 13:57:09 -03:00
Johnny Greco	4e89c2f9f3	standardize recipe script metadata (#292 )	2026-02-04 10:43:27 -05:00
Andre Manoel	62bae42dc2	feat: Add CustomColumnGenerator for user-defined column generation (#254 ) * first attempt * iterating a bit * some improvements + multiturn example * adapting to new monorepo structure * refining * fixed test * fixing license headers * adding docs * adding test for failed generation * allowing strategy to be picked * renaming argument * lint * remove recommendation * renaming for consistency * addressing comments pt1 * addressing comments pt2 * addressing comments pt3 * adding a mock for development * addressing greptile comments * revamping * docs: streamline custom columns documentation * docs: simplify CustomColumnConfig docstring Remove verbose code example and detailed function signatures from docstring to match the pattern of other config classes in the file. * test: clean up custom column tests - Remove tests for private _custom_column_metadata attribute - Combine redundant generator creation tests - Reuse stub_resource_provider and stub_model_facade fixtures * test: consolidate custom column tests Reduce from 26 to 11 tests while maintaining coverage: - Combine redundant config/decorator/creation tests - Use parametrized tests for error conditions - Remove duplicate validation tests for full_column strategy - Simplify section headers * refactor: deduplicate CustomColumnGenerator logic Merge cell-by-cell and full-column code paths: - _generate_cell_by_cell + _generate_full_column -> _generate - _validate_output_columns + _validate_output_columns_df -> _validate_output * chore: merge example files into single notebook-style example.py Combine example.py, example_multiturn.py, and example_benchmark_strategies.py into a single file with #%% cell markers for Jupyter/VS Code notebook mode. * addressing greptile comments * refactor: reuse generate_text in generate_text_batch * refactor: replace CustomColumnContext with models dict - Remove CustomColumnContext class; users now receive models dict directly - Add DataDesigner.get_models() for experimentation outside pipeline - Make parser optional in ModelFacade.generate() (defaults to identity) - Validate parameter names: row/df, generator_params, models - Update examples, tests, and docs for new API * fix: address PR review comments from Nabin and greptile - Make decorator metadata public (custom_column_metadata) - Simplify get_generation_strategy() to directly return config value - Use !r formatting in error messages - Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports) - Remove redundant error logging before re-raise - Validate max 3 positional parameters - Use GenerationStrategy enum in example instead of string * fix: replace lambda with module-level identity function in facade Use pickleable _identity function instead of lambda x: x for the default parser argument, ensuring compatibility with multiprocessing. * fix: restore inherited attributes in LLM column docstrings Restores the "Inherited Attributes" sections that were unintentionally removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig docstrings. * docs: clarify model_aliases is required for LLM access Updated documentation and docstrings to clarify that model_aliases populates the models dict (not just health checks). * fix: address PR review comments from nabinchha - clarify model_aliases requirement in docs - add note about model alias validation during health check - combine two loops into one in _run_model_health_check_if_needed - add signature validation at decoration time - enforce decorated functions in CustomColumnConfig validator - simplify generator to only validate strategy-specific first param * fix: address remaining PR review comments - remove example.py (development artifact) - fix get_models return type to dict[str, ModelFacade] * test: update tests for decoration-time validation - expect ValidationError instead of InvalidConfigError for non-callable - split param validation test into decoration-time and runtime tests	2026-02-03 19:23:39 -03:00
Johnny Greco	87119a545b	refactor: move SingleColumnConfig to config.base module (#287 ) * create top-level base file * add note * update license header * move exportable config and move base to config module * update references in docs * do not include single column config in init * add inverse import order e2e test	2026-02-03 14:04:04 -05:00
Johnny Greco	09c09dc0dc	perf: implement lazy loading for config module exports (#291 ) * perf: implement lazy loading for config module exports - Replace eager imports with __getattr__-based lazy loading - Add TYPE_CHECKING block for IDE autocomplete support - Defer submodule loading until attributes are accessed * add single column config back * refactor: use module path constants in lazy imports dictionary Extract repeated module paths into constants to reduce string duplication and improve maintainability. Also removes unused get_config_exports() function.	2026-02-03 13:09:29 -05:00
Eric W. Tramel	5430bcbe99	Remove `debug_trace_override` (#290 )	2026-02-03 12:09:30 -05:00
Johnny Greco	a578265151	feat: add dynamic version pinning for inter-package dependencies (#282 ) Switch from hatch-vcs to uv-dynamic-versioning to enable Jinja2 templating in dependencies. This ensures all three subpackages (data-designer, data-designer-config, data-designer-engine) are locked to the same version when published to PyPI. - Use `{{ version }}` template for sibling package dependencies - Update VERSIONING.md to document importlib.metadata approach - Remove unused _version.py file generation	2026-02-03 11:14:55 -05:00
Andre Manoel	44624accbf	chore: add greptile.json to reduce review verbosity (#289 ) * chore: add greptile.json to reduce review verbosity Configure Greptile to review only on PR open (not every commit), update existing comments instead of creating new ones, and collapse summary sections by default. * chore: disable status comments to reduce noise * chore: enable triggerOnUpdates for continuous review	2026-02-03 12:50:07 -03:00
Eric W. Tramel	532d21a8d7	feat: add extract_reasoning_content option to LLM columns (#285 )	2026-02-03 10:25:24 -05:00
Andre Manoel	b6d400ef7d	chore: update tutorial notebooks to use dd. notation consistently (#288 ) - Convert notebook 3 from string-based columns to class specs (dd.SamplerColumnConfig, etc.) - Fix grammar: "is the main object is responsible" → "is the main object responsible" - Remove stray "A" at end of URL in notebook 2 - Remove empty markdown cell in notebook 4 - Add missing data_designer.validate() call in notebook 4 - Regenerate colab notebooks from source	2026-02-03 12:03:32 -03:00
Kirit Thadaka	de7c3ab99a	docs: add deployment, performance tuning guides and streamline gettin… (#277 ) * docs: add deployment, performance tuning guides and streamline getting started - Add deployment-options.md: Library vs. Microservice decision guide - Add inference-architecture.md: Separation of concerns with LLM servers - Add performance-tuning.md: Concurrency and batching optimization guide - Streamline index.md: Merge installation, add quick example, simplify - Remove quick-start.md: Content merged into welcome page - Remove installation.md: Content merged into welcome page - Update model docs: Add concurrency control sections and cross-references - Update mkdocs.yml: Add new Architecture section to navigation * docs: add tasteful emojis to new documentation pages * docs: consolidate redundant concurrency and troubleshooting content - Remove duplicate max_parallel_requests tables from model-configs.md and inference-parameters.md - Remove duplicate Concurrency Control section from model-configs.md - Simplify Concurrency Control in inference-parameters.md to link to performance-tuning.md - Remove Troubleshooting section from inference-architecture.md (covered in performance-tuning.md) - performance-tuning.md is now the authoritative source for tuning guidance * Simplified doc additions * Switched default model to nemotron 3 nano * Addressed feedback * Added first blog draft	2026-02-02 21:03:58 -08:00
Eric W. Tramel	510761107b	feat: Add TraceType enum for granular trace control (#284 )	2026-02-02 19:43:51 -05:00
Eric W. Tramel	7248b9fc8f	Update trace normalization to ChatML content blocks (#283 )	2026-02-02 18:22:16 -05:00
Johnny Greco	932e1a1ac2	chore: configure independent pytest settings per subpackage (#278 ) * chore: configure independent pytest settings per subpackage - Add [tool.pytest.ini_options] to each package's pyproject.toml - Update conftest.py in each package to declare pytest_plugins directly - Remove root-level conftest.py (no longer needed) - Remove testpaths from root pyproject.toml This enables running tests independently per package without relying on root-level configuration. * update out of date comments	2026-02-02 13:37:15 -05:00
Johnny Greco	3045208599	fix: normalize license header year format in mcp module (#279 ) * fix: normalize license header year format in mcp module * existing header dates are authoritative	2026-02-02 10:56:35 -05:00
Eric W. Tramel	e6e58e692e	feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248 )	2026-02-02 09:41:58 -05:00
Johnny Greco	754ff71092	fix: ensure 100% progress is logged exactly once (#276 ) The progress tracker was logging 100% multiple times - once from _record_completion() and again from log_final(). Now _record_completion() skips logging at 100%, leaving that responsibility to log_final(). Also refactors tests from class-based to flat functions and adds explicit tests for the 100% logging behavior.	2026-01-30 21:03:18 -05:00
Kirit Thadaka	9e1c6ec679	feat: Add Phase 1 languages (Bash, C, C++, C#, COBOL) to CodeLang (#271 ) Add support for five high-priority programming languages to Data Designer's code generation capabilities: - Bash: Universal DevOps and automation scripting - C, C++, C#: Systems programming and enterprise development - COBOL: Legacy mainframe systems and modernization These languages address critical enterprise use cases including legacy code maintenance, systems programming, and infrastructure automation. Changes: - Add new CodeLang enum values for bash, c, cpp, csharp, cobol - Update code_lang_to_syntax_lexer() with Pygments lexer mappings - Update documentation to reflect new supported languages - Update tests to account for 21 total supported languages (up from 16) Co-authored-by: Johnny Greco <jogreco@nvidia.com>	2026-01-30 19:52:28 -05:00
Johnny Greco	fe5a1ec6af	chore: add animated emoji progress indicators to progress tracker (#273 ) * chore: add animated emoji progress indicators to progress tracker Add fun visual feedback during dataset generation with emoji that evolve based on completion percentage. Randomly selects from moon phases, weather, or hatching styles at tracker initialization. * always log 100% * refactor: move progress emoji logic into RandomEmoji class Add progress() method to RandomEmoji that returns an emoji based on completion percentage. This centralizes the progress style logic that was previously duplicated in ProgressTracker. * add some tests	2026-01-30 19:42:10 -05:00
Johnny Greco	0d51539aa6	feat: add message trace support for LLM generation (#272 ) Add support for capturing full conversation traces during LLM generation, enabling debugging and fine-tuning dataset creation. Changes: - Add `with_trace` field to LLMTextColumnConfig for per-column trace control - Add `debug_override_save_all_column_traces` to RunConfig for global trace - Introduce ChatMessage dataclass for structured message representation - Update ModelFacade.generate() to return full message trace - Rename trace column postfix from `__reasoning_trace` to `__trace` - Add comprehensive traces documentation Traces capture system/user/assistant messages in order, enabling visibility into the full generation conversation including correction retries.	2026-01-30 17:03:07 -05:00
Eric W. Tramel	4fddb4d900	feat: add job progress logging for cell-by-cell generation (#259 )	2026-01-29 13:06:21 -05:00
Johnny Greco	63c8dcc11d	chore: simplify publish script by removing redundant rebuild step (#268 ) - Remove rebuild_with_tag() function that caused double builds - Add dedicated delete_local_tag() function for TestPyPI cleanup - Production workflow now builds once: create local tag -> build -> upload -> push tag - Tag is only pushed after successful upload, so local tag can be deleted if build fails	2026-01-29 11:55:08 -05:00
Andre Manoel	e46fbd0759	fix: automate README sync for data-designer package builds (#266 ) * fix: uv sync or build requires copying README * update header (script doesn't check it) * changing path, ensuring proper checks	2026-01-29 13:10:26 -03:00
Johnny Greco	8bd7aafe45	feat: Add /commit skill for conventional commit messages (#252 ) * feat: Add /commit skill for conventional commit messages Adds a new Claude skill that creates well-formatted commit messages following Conventional Commits standards.	2026-01-29 10:39:52 -05:00

1 2 3 4 5 ...

259 commits