DataDesigner

Elgato_dark/DataDesigner

Fork 0

mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

Commit graph

Author	SHA1	Message	Date
Nabin Mulepati	cebfb0e967	docs: Added starter dev notes on push to hugging face hub (#355 ) * Added starter dev notes on push to huggingface hub * fix: move excerpt marker to intro and remove redundant markers Move the single <\!-- more --> to after the intro paragraph for a shorter blog teaser and remove the 6 redundant markers throughout the post. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * docs: add HF ecosystem context to push-to-hub dev notes (#474) * docs: add HF ecosystem context to push-to-hub dev notes Add section on what datasets get on the Hub (Dataset Viewer, streaming, Viewer API), link to Hub search for DataDesigner datasets, and note that private datasets can be flipped to public. * Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * fix: remove doubled library: prefix in Hub search URL --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> * Update date * fix date for text-to-sql * update hero images" * updates --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>	2026-04-16 11:29:33 -06:00
Andre Manoel	1a237d95d0	fix: text-to-sql devnote date, images, and publish-devnotes nav (#546 ) - Update post date from 2026-03-11 to 2026-04-14 so it appears as the newest post on the devnotes page. - Replace raw <img> tags with markdown image syntax so mkdocs rewrites relative paths correctly for the blog plugin's slug-based URLs. - Overlay mkdocs.yml from HEAD in publish-devnotes workflow so new nav entries are included in devnotes-only rebuilds.	2026-04-14 15:48:23 -03:00
dhruvnathawani	1448f9cbda	docs: add text-to-sql dev note (#349 ) * docs: add text-to-sql devnote * add diagram, update content * correct inconsistencies * docs: address PR #349 feedback and add BIRD benchmark results PR feedback fixes: - Fix Window Functions contradiction: Key Takeaway #1 now uses "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate) - Fix score-0 truthiness bug: use `is not none` instead of truthy check in Jinja2 expression columns (inline example + production pipeline) - Soften Code Sandbox language: "A natural next step would be..." instead of "We are actively implementing..." - Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron team description - Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME, ASCII diagram labels, Pipeline Overview prose - Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck - Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default provider pattern (matches structured-outputs dev note), remove unused explicit ModelConfig - Remove placeholder dataset link (#), add "Dataset: Internal" note New content: - Add BIRD Benchmark Results section with bar chart (JPG), data table, BIRD caveat paragraph, and Jocelyn Huang acknowledgement (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B) - Replace "Looking Ahead: Code Sandbox" with broader "Next Steps": Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0 - Add Project Summary table at end of post * docs: address second round of PR #349 feedback - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying code snippets are illustrative, not runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Add companion file note and recipe link to production pipeline details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha) * docs: address round 2 PR #349 feedback, replace production block with recipe - Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1 to match the exact taxonomy string in the code example (greptile) - Add admonition clarifying inline code snippets are illustrative, with link to runnable Enterprise Text-to-SQL Recipe (nabinchha) - Add context before score extraction snippet referencing the five LLMJudgeColumnConfig columns and linking to full recipe (nabinchha) - Replace production pipeline <details> block (230 lines with phantom imports from prompts.py, rubrics.py, text2sql_seed.json) with snippet include of enterprise_text_to_sql.py recipe — self-contained and runnable, consistent with other merged dev notes (nabinchha) * docs: polish Try It Yourself and Summary sections - Wrap minimal inline example in collapsible <details> dropdown - Rename "A Team Effort" section to "Summary" - Remove redundant Scale/Dialects/Dataset line * docs: add missing sql_dialect sampler to Step 1 code snippet The Step 3/4 prompt templates reference {{ sql_dialect }} but the Step 1 seeding code never defined it, leaving an unresolved Jinja2 variable for readers following along. Add the sql_dialect sampler with a comment explaining the pipeline runs once per dialect. * fix ascii diagram * docs: fix BIRD score framing and MySQL dialect wording - Remove specific "60-70%" BIRD claim from intro to avoid contradiction with the 41.80%/38.25% direct-generation results shown later (those higher figures come from specialized systems with schema linking) - Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and CONVERT_TZ are valid MySQL functions; the pipeline excluded them for portability, not because the dialect forbids them * docs: move text-to-sql images to assets/ convention and update refs * docs: address text-to-sql devnote review comments - Add devnote to mkdocs nav after Async All the Way Down - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe) - Fix score extraction truthy check to use 'is not none' (preserves score-0 values) - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect) - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe) --------- Signed-off-by: Yev Meyer <ymeyer@nvidia.com> Co-authored-by: Yev Meyer <ymeyer@nvidia.com>	2026-04-14 11:10:14 -07:00

Author

SHA1

Message

Date

Nabin Mulepati

cebfb0e967

docs: Added starter dev notes on push to hugging face hub (#355 )

* Added starter dev notes on push to huggingface hub

* fix: move excerpt marker to intro and remove redundant markers

Move the single <\!-- more --> to after the intro paragraph for a shorter
blog teaser and remove the 6 redundant markers throughout the post.

* Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* docs: add HF ecosystem context to push-to-hub dev notes (#474)

* docs: add HF ecosystem context to push-to-hub dev notes

Add section on what datasets get on the Hub (Dataset Viewer, streaming,
Viewer API), link to Hub search for DataDesigner datasets, and note that
private datasets can be flipped to public.

* Update docs/devnotes/posts/push-datasets-to-hugging-face-hub.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* fix: remove doubled library: prefix in Hub search URL

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

* Update date

* fix date for text-to-sql

* update hero images"

* updates

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Daniel van Strien <davanstrien@users.noreply.github.com>

2026-04-16 11:29:33 -06:00

Andre Manoel

1a237d95d0

fix: text-to-sql devnote date, images, and publish-devnotes nav (#546 )

- Update post date from 2026-03-11 to 2026-04-14 so it appears as the
  newest post on the devnotes page.
- Replace raw <img> tags with markdown image syntax so mkdocs rewrites
  relative paths correctly for the blog plugin's slug-based URLs.
- Overlay mkdocs.yml from HEAD in publish-devnotes workflow so new nav
  entries are included in devnotes-only rebuilds.

2026-04-14 15:48:23 -03:00

dhruvnathawani

1448f9cbda

docs: add text-to-sql dev note (#349 )

* docs: add text-to-sql devnote

* add diagram, update content

* correct inconsistencies

* docs: address PR #349 feedback and add BIRD benchmark results
PR feedback fixes:
- Fix Window Functions contradiction: Key Takeaway #1 now uses
  "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate)
- Fix score-0 truthiness bug: use `is not none` instead of truthy check
  in Jinja2 expression columns (inline example + production pipeline)
- Soften Code Sandbox language: "A natural next step would be..." instead
  of "We are actively implementing..."
- Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron
  team description
- Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME,
  ASCII diagram labels, Pipeline Overview prose
- Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck
- Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default
  provider pattern (matches structured-outputs dev note), remove unused
  explicit ModelConfig
- Remove placeholder dataset link (#), add "Dataset: Internal" note
New content:
- Add BIRD Benchmark Results section with bar chart (JPG), data table,
  BIRD caveat paragraph, and Jocelyn Huang acknowledgement
  (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B)
- Replace "Looking Ahead: Code Sandbox" with broader "Next Steps":
  Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0
- Add Project Summary table at end of post

* docs: address second round of PR #349 feedback

- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying code snippets are illustrative, not
  runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Add companion file note and recipe link to production pipeline
  details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)

* docs: address round 2 PR #349 feedback, replace production block with recipe
- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying inline code snippets are illustrative,
  with link to runnable Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Replace production pipeline <details> block (230 lines with phantom
  imports from prompts.py, rubrics.py, text2sql_seed.json) with
  snippet include of enterprise_text_to_sql.py recipe — self-contained
  and runnable, consistent with other merged dev notes (nabinchha)

* docs: polish Try It Yourself and Summary sections
- Wrap minimal inline example in collapsible <details> dropdown
- Rename "A Team Effort" section to "Summary"
- Remove redundant Scale/Dialects/Dataset line

* docs: add missing sql_dialect sampler to Step 1 code snippet

The Step 3/4 prompt templates reference {{ sql_dialect }} but the
Step 1 seeding code never defined it, leaving an unresolved Jinja2
variable for readers following along. Add the sql_dialect sampler
with a comment explaining the pipeline runs once per dialect.

* fix ascii diagram

* docs: fix BIRD score framing and MySQL dialect wording
- Remove specific "60-70%" BIRD claim from intro to avoid contradiction
  with the 41.80%/38.25% direct-generation results shown later (those
  higher figures come from specialized systems with schema linking)
- Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and
  CONVERT_TZ are valid MySQL functions; the pipeline excluded them for
  portability, not because the dialect forbids them

* docs: move text-to-sql images to assets/ convention and update refs

* docs: address text-to-sql devnote review comments

  - Add devnote to mkdocs nav after Async All the Way Down
  - Swap Recursive CTEs to Advanced, CASE Expressions to Intermediate (matches recipe)
  - Fix score extraction truthy check to use 'is not none' (preserves score-0 values)
  - Drop REPLACE() vs regexp_replace from dialect takeaway (REPLACE is cross-dialect)
  - Tighten prose: remove 'The key insight:', use actual BIRD number, trim X-not-Y
  - Fix knowledge dependency count: 8 -> 9 concepts (3x3 in recipe)

---------

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Co-authored-by: Yev Meyer <ymeyer@nvidia.com>

2026-04-14 11:10:14 -07:00

3 commits