DataDesigner

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Author	SHA1	Message	Date
Nabin Mulepati	cebfb0e967	docs: Added starter dev notes on push to hugging face hub (#355 ) * Added starter dev notes on push to huggingface hub * fix: move excerpt marker to intro and remove redundant markers Move the single <\!-- more --> to after the intro paragraph for a shorter blog teaser and remove the 6 redundant markers throughout the post. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * docs: add HF ecosystem context to push-to-hub dev notes (#474) * docs: add HF ecosystem context to push-to-hub dev notes Add section on what datasets get on the Hub (Dataset Viewer, streaming, Viewer API), link to Hub search for DataDesigner datasets, and note that private datasets can be flipped to public. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: remove doubled library: prefix in Hub search URL --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update date * fix date for text-to-sql * update hero images" * updates --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>	2026-04-16 11:29:33 -06:00
dhruvnathawani	1448f9cbda	docs: add text-to-sql dev note (#349 ) * docs: add text-to-sql devnote * add diagram, update content * correct inconsistencies * docs: address PR #349 feedback and add BIRD benchmark results PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway #1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post * docs: address second round of PR #349 feedback - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha) * docs: address round 2 PR #349 feedback, replace production block with recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha) * docs: polish Try It Yourself and Summary sections - Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line * docs: add missing sql_dialect sampler to Step 1 code snippet The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. * fix ascii diagram * docs: fix BIRD score framing and MySQL dialect wording - Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them * docs: move text-to-sql images to assets/ convention and update refs * docs: address text-to-sql devnote review comments - Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe) --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Yev Meyer <ymeyer@nvidia.com>	2026-04-14 11:10:14 -07:00
Andre Manoel	0e90ea644b	docs: add async engine dev note (#490 ) * fix: address review feedback on async engine dev note - Fix wall-clock claim: 41% -> 22% to match benchmark table - Fix dual-model speedup rounding: 1.7x -> 1.6x (10.0/6.1 = 1.64) - Fix run_config API: use dd.set_run_config() instead of passing to create() * docs: add async engine dev note Add "Async All the Way Down" dev note covering the async task-queue scheduler built across PRs #356, #378, #404, #429, #456. Includes benchmark results, architecture diagrams, and DAG shape illustrations. * feat: add docs preview workflow for PRs Build MkDocs site on PRs that touch docs and deploy to Cloudflare Pages. Each PR gets a browseable preview URL posted as a comment. Notebook tutorials use placeholder stubs since they require API keys to execute. Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID repo secrets. * fix: update speedup chart alt text from 1.7x to 1.6x * docs: improve timeline figure context and labeling Add DAG subtitle to sync-vs-async timeline figure and bridge the surrounding text to explain which workload shape is being shown. * edits+additions to async-all-the-way-down dev notes * clarify two semaphore dance * remove dead link * replace hero image * docs: update scale figures with nginx-accurate data and adjust sizing Regenerate scale-model-timeline and scale-boxplot from nginx access logs (column_progress.csv, sync/summary.json) instead of buffered execution logs. Optimize both PNGs to palette mode. Adjust figure widths and update model timeline commentary. * add link from owning-the-model-stack to async-dev-node * docs: address review feedback on async blog post - Tighten intro to a concise abstract, move pipeline narrative into "The Bottleneck Was Structural" section - Remove multi-column generators / seed readers paragraph (TMI) - Clarify sync engine ran columns sequentially within each batch --------- Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>	2026-04-08 15:51:04 -03:00
Nabin Mulepati	f78c4e0cf7	Fix repeated header/footer on native-model-client-hero image (#492 )	2026-04-06 09:45:27 -06:00
Nabin Mulepati	a1eb244321	docs: add native model client dev note (#465 ) * add images * re-ran slopguard * update dev notes * address greptile comments * update example model name * add info on throttlemanager * address pr feedback * Add link to model aliases * address pr feedback * update key resources * update key resources * crop image for better fit * Fix max_parallel_requests * refine concluding paragraph	2026-03-31 15:45:56 -06:00
Johnny Greco	0a7b9e0d6d	docs: Data Designer Got Skills dev note (#457 ) * docs: add skeleton for "Data Designer Got Skills" dev note * create assets folder and add blog directory name * docs: add Claude Code plugin marketplace configuration Register the repo as a Claude Code plugin marketplace so users can install the data-designer skill via `/plugin marketplace add`. * docs: write first draft of "Data Designer Got Skills" dev note Full prose for all sections: intro with hero benchmark figure, agents as first-class users, baseline trace walkthrough, CLI and skill design, benchmark results (228 sessions), getting started with marketplace and npx install paths, and what's next. * docs: add error breakdown table and minor refinements * docs: add sdg and data-designer keywords to plugin metadata * docs: refine CLI framing, reduce em dashes, slop guard pass * docs: fix grammar in dev note (serial comma, double-which clause) * update hero image * docs: swap hero image, move benchmark figure, minor wording tweaks * docs: add narrative lead-in to skill trace summary * docs: refine quality bullet, streamline getting started modes * remove old image * slope-guard tweaks	2026-03-24 21:03:00 -04:00

6 commits