mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
Preserves tree from previous docs-website head: 5e47d33ea8. This branch is a CI-managed publish artifact like gh-pages; source provenance is tracked in commit messages rather than Git ancestry.
472 lines
24 KiB
HTML
472 lines
24 KiB
HTML
<style>
|
|
/* ── Nuke ALL Material-injected pseudo-elements on our details ── */
|
|
.md-typeset details.session-trace > summary,
|
|
.md-typeset .st-body details > summary {
|
|
padding-left: 0 !important; /* Material adds 2.4em for its icon */
|
|
position: relative;
|
|
}
|
|
.md-typeset details.session-trace > summary::before,
|
|
.md-typeset details.session-trace > summary::after,
|
|
.md-typeset .st-body details > summary::before,
|
|
.md-typeset .st-body details > summary::after {
|
|
display: none !important;
|
|
content: none !important;
|
|
}
|
|
|
|
/* ── Arrow icon (real <span>, not pseudo-element) ── */
|
|
.st-arrow {
|
|
display: inline-block;
|
|
font-size: .55em;
|
|
transition: transform .2s;
|
|
flex-shrink: 0;
|
|
color: var(--md-default-fg-color--light);
|
|
}
|
|
details[open] > summary > .st-arrow { transform: rotate(90deg); }
|
|
|
|
/* ── Session trace container ── */
|
|
.md-typeset details.session-trace { border: 1px solid var(--md-default-fg-color--lightest); border-radius: .4rem; margin: .6rem 0; overflow: hidden; }
|
|
.md-typeset details.session-trace > summary { padding: .65rem 1rem !important; background: var(--md-code-bg-color); cursor: pointer; display: flex; align-items: center; gap: .5rem; flex-wrap: wrap; list-style: none; user-select: none; font-size: .85rem; }
|
|
.st-badge { padding: .1rem .4rem; border-radius: .2rem; font-size: .68rem; font-weight: 700; text-transform: uppercase; letter-spacing: .04em; }
|
|
.st-badge.skill { background: #76B900; color: #000; }
|
|
.st-badge.baseline { background: #f59e0b; color: #000; }
|
|
.st-name { font-weight: 600; }
|
|
.st-stats { display: inline-flex; gap: .3rem; flex-wrap: wrap; margin-left: auto; }
|
|
.st-stat { padding: .08rem .4rem; border-radius: 1rem; font-size: .68rem; font-family: var(--md-code-font-family); background: var(--md-default-fg-color--lightest); white-space: nowrap; }
|
|
.st-stat.err { background: rgba(239,68,68,.15); color: #ef4444; }
|
|
.st-stat.zero { background: rgba(118,185,0,.12); color: #76B900; }
|
|
|
|
/* ── Trace body / steps ── */
|
|
.st-body { padding: .25rem 0; }
|
|
.st-step { padding: .4rem 1rem; display: flex; gap: .6rem; align-items: baseline; border-top: 1px solid var(--md-default-fg-color--lightest); border-left: 3px solid transparent; font-size: .84rem; }
|
|
.st-step:first-child { border-top: none; }
|
|
.st-step.err { background: rgba(239,68,68,.06); border-left-color: #ef4444; }
|
|
.st-step.ok { border-left-color: #76B900; }
|
|
.st-num { min-width: 1.5rem; font-size: .7rem; color: var(--md-default-fg-color--lighter); font-family: var(--md-code-font-family); text-align: right; flex-shrink: 0; }
|
|
.st-tool { padding: .05rem .35rem; border-radius: .2rem; font-size: .68rem; font-family: var(--md-code-font-family); font-weight: 600; min-width: 2.8rem; text-align: center; flex-shrink: 0; white-space: nowrap; }
|
|
.st-tool.bash { background: #3b82f6; color: #fff; }
|
|
.st-tool.read { background: #8b5cf6; color: #fff; }
|
|
.st-tool.write { background: #10b981; color: #fff; }
|
|
.st-tool.edit { background: #f59e0b; color: #000; }
|
|
.st-tool.grep { background: #ec4899; color: #fff; }
|
|
.st-tool.glob { background: #a855f7; color: #fff; }
|
|
.st-tool.agent { background: #06b6d4; color: #000; }
|
|
.st-tool.msg { background: transparent; border: 1px solid var(--md-default-fg-color--lighter); color: var(--md-default-fg-color--light); font-weight: 500; font-style: italic; }
|
|
.st-desc { flex: 1; min-width: 0; }
|
|
.st-desc code { font-size: .78rem; }
|
|
|
|
/* ── Inner collapsible blocks ── */
|
|
.md-typeset .st-body details.st-expand { margin: .25rem 0 .1rem; border: none; box-shadow: none; }
|
|
.md-typeset .st-body details.st-expand > summary { cursor: pointer; color: var(--md-default-fg-color--light); font-size: .76rem; padding: .15rem 0 !important; list-style: none; min-height: 0; font-weight: normal; }
|
|
.md-typeset .st-body details.st-expand > summary:hover { color: var(--md-default-fg-color); }
|
|
.md-typeset .st-body details.st-expand > div { margin-top: .2rem; padding: .5rem .75rem; background: var(--md-code-bg-color); border-radius: .25rem; font-size: .73rem; font-family: var(--md-code-font-family); max-height: 30rem; overflow: auto; line-height: 1.45; white-space: pre; }
|
|
</style>
|
|
|
|
<!-- ── 1. Typical Skill Session ── -->
|
|
<details class="session-trace" open>
|
|
<summary>
|
|
<span class="st-arrow">▶</span>
|
|
<span class="st-badge skill">Skill</span>
|
|
<span class="st-name">Typical Skill Session</span>
|
|
<span class="st-stats">
|
|
<span class="st-stat">9 tools</span>
|
|
<span class="st-stat zero">0 errors</span>
|
|
<span class="st-stat">92s</span>
|
|
</span>
|
|
</summary>
|
|
<div class="st-body">
|
|
<div class="st-step">
|
|
<span class="st-tool msg">Agent</span>
|
|
<span class="st-desc"><em>I'll start by learning about the Data Designer library to understand the available column types, samplers, validators, and processors.</em></span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">1</span>
|
|
<span class="st-tool bash">Bash</span>
|
|
<span class="st-desc">
|
|
<code>data-designer agent context 2>&1</code><details class="st-expand"><summary><span class="st-arrow">▶</span> output (7,427 chars)</summary><div>Data Designer v0.5.4rc2
|
|
|
|
## Config Module
|
|
|
|
The config module contains all user-facing configuration types. Do not search other modules in the library.
|
|
config_root: .venv/lib/python3.13/site-packages/data_designer/config
|
|
|
|
builder: {config_root}/config_builder.py
|
|
base: {config_root}/base.py (read for inherited fields shared by columns and processors)
|
|
All config types are accessible via: import data_designer.config as dd
|
|
|
|
## Types
|
|
|
|
### columns
|
|
file: {config_root}/column_configs.py
|
|
|
|
type description
|
|
------------------------- ---------------------------------------------------------------------------------
|
|
CustomColumnConfig Configuration for custom user-defined column generators.
|
|
EmbeddingColumnConfig Configuration for embedding generation columns.
|
|
ExpressionColumnConfig Configuration for derived columns using Jinja2 expressions.
|
|
ImageColumnConfig Configuration for image generation columns.
|
|
LLMCodeColumnConfig Configuration for code generation columns using Large Language Models.
|
|
LLMJudgeColumnConfig Configuration for LLM-as-a-judge quality assessment and scoring columns.
|
|
LLMStructuredColumnConfig Configuration for structured JSON generation columns using Large Language Models.
|
|
LLMTextColumnConfig Configuration for text generation columns using Large Language Models.
|
|
SamplerColumnConfig Configuration for columns generated using numerical samplers.
|
|
SeedDatasetColumnConfig Configuration for columns sourced from seed datasets.
|
|
ValidationColumnConfig Configuration for validation columns that validate existing columns.
|
|
|
|
### constraints
|
|
file: {config_root}/sampler_constraints.py
|
|
|
|
type description
|
|
-------------------------- --------------------------------------------------------------------------
|
|
ColumnInequalityConstraint Constrain a sampler column to be less/greater than another sampler column.
|
|
ScalarInequalityConstraint Constrain a sampler column to be less/greater than a scalar value.
|
|
|
|
### processors
|
|
file: {config_root}/processors.py
|
|
|
|
type description
|
|
------------------------------ ---------------------------------------------------------------------------------
|
|
DropColumnsProcessorConfig Drop columns from the output dataset (prefer ``drop=True`` in the column config).
|
|
SchemaTransformProcessorConfig Configuration for transforming the dataset schema using Jinja2 templates.
|
|
|
|
### samplers
|
|
file: {config_root}/sampler_params.py
|
|
|
|
type description
|
|
----------------------------- -------------------------------------------------------------------------------------
|
|
BernoulliSamplerParams Parameters for sampling from a Bernoulli distribution.
|
|
BernoulliMixtureSamplerParams Parameters for sampling from a Bernoulli mixture distribution.
|
|
BinomialSamplerParams Parameters for sampling from a Binomial distribution.
|
|
CategorySamplerParams Parameters for categorical sampling with optional probability weighting.
|
|
DatetimeSamplerParams Parameters for uniform datetime sampling within a specified range.
|
|
GaussianSamplerParams Parameters for sampling from a Gaussian (Normal) distribution.
|
|
PersonSamplerParams Parameters for sampling synthetic person data with demographic attributes.
|
|
PersonFromFakerSamplerParams Parameters for sampling synthetic person data with demographic attributes from Faker.
|
|
PoissonSamplerParams Parameters for sampling from a Poisson distribution.
|
|
ScipySamplerParams Parameters for sampling from any scipy.stats continuous or discrete distribution.
|
|
SubcategorySamplerParams Parameters for subcategory sampling conditioned on a parent category column.
|
|
TimeDeltaSamplerParams Parameters for sampling time deltas relative to a reference datetime column.
|
|
UniformSamplerParams Parameters for sampling from a continuous Uniform distribution.
|
|
UUIDSamplerParams Parameters for generating UUID (Universally Unique Identifier) values.
|
|
|
|
### validators
|
|
file: {config_root}/validator_params.py
|
|
|
|
type description
|
|
---------------------------- -----------------------------------------------------------------------------------------------------
|
|
CodeValidatorParams Configuration for code validation. Supports Python and SQL code validation.
|
|
LocalCallableValidatorParams Configuration for local callable validation. Expects a function to be passed that validates the data.
|
|
RemoteValidatorParams Configuration for remote validation. Sends data to a remote endpoint for validation.
|
|
|
|
## Model Aliases
|
|
|
|
default_provider: nvidia
|
|
|
|
model_alias model generation_type provider
|
|
---------------- --------------------------------- --------------- --------
|
|
nvidia-embedding nvidia/llama-3.2-nv-embedqa-1b-v2 embedding nvidia
|
|
nvidia-reasoning openai/gpt-oss-20b chat-completion nvidia
|
|
nvidia-text nvidia/nemotron-3-nano-30b-a3b chat-completion nvidia
|
|
nvidia-vision nvidia/nemotron-nano-12b-v2-vl chat-completion nvidia
|
|
openai-embedding text-embedding-3-large embedding openai
|
|
openai-reasoning openai/openai/gpt-5.2 chat-completion openai
|
|
openai-text us/azure/openai/gpt-4.1 chat-completion openai
|
|
openai-vision openai/openai/gpt-5.2 chat-completion openai
|
|
|
|
## Persona Datasets
|
|
|
|
locale size installed
|
|
---------- ------- ---------
|
|
en_IN 2.39 GB True
|
|
en_SG 0.30 GB True
|
|
en_US 1.24 GB True
|
|
hi_Deva_IN 4.14 GB True
|
|
hi_Latn_IN 2.7 GB True
|
|
ja_JP 1.69 GB True
|
|
pt_BR 2.33 GB True
|
|
|
|
## Commands
|
|
|
|
command_pattern description
|
|
------------------------------------------ -------------------------------------------------------------------
|
|
data-designer agent context Prints output from all agent subcommands to bootstrap context.
|
|
data-designer agent types [family] Type names, descriptions, and source files for one or all families.
|
|
data-designer agent state model-aliases Model aliases and usability status.
|
|
data-designer agent state persona-datasets Persona locales and install status.</div></details>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-tool msg">Agent</span>
|
|
<span class="st-desc"><em>Now let me inspect the specific config schemas I'll need.</em></span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">2</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/base.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">3</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/column_configs.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">4</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/sampler_params.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">5</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/validator_params.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">6</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/config_builder.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">7</span>
|
|
<span class="st-tool read">Read</span>
|
|
<span class="st-desc">
|
|
<code>data_designer/config/processors.py</code>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-tool msg">Agent</span>
|
|
<span class="st-desc">
|
|
<details class="st-expand"><summary><span class="st-arrow">▶</span> message (863 chars)</summary><div>Now I have a complete understanding of the library. Here are my key design decisions:
|
|
|
|
**Schema:**
|
|
1. `difficulty` — Category sampler: beginner, intermediate, advanced (equal weights)
|
|
2. `subtopic` — Category sampler: 7 data science subtopics
|
|
3. `instruction` — LLM text column: generates a natural language task description conditioned on difficulty + subtopic
|
|
4. `solution` — LLM code column (Python): generates a complete solution conditioned on instruction + difficulty + subtopic
|
|
5. `code_validation` — Validation column (code validator, Python): checks syntactic correctness of the solution
|
|
6. `quality_score` — LLM judge column: scores correctness, code quality, and instruction clarity on 1-5 scale
|
|
|
|
**Diversity axes:** difficulty x subtopic combinations ensure broad coverage. The LLM prompts emphasize varying libraries, complexity, and operation types.</div></details>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-num">8</span>
|
|
<span class="st-tool write">Write</span>
|
|
<span class="st-desc">
|
|
<code>text_to_python.py</code><details class="st-expand"><summary><span class="st-arrow">▶</span> written content (7,540 chars)</summary><div># /// script
|
|
# dependencies = [
|
|
# "data-designer",
|
|
# ]
|
|
# ///
|
|
import data_designer.config as dd
|
|
|
|
|
|
def load_config_builder() -> dd.DataDesignerConfigBuilder:
|
|
config_builder = dd.DataDesignerConfigBuilder()
|
|
|
|
# 1. Difficulty level
|
|
config_builder.add_column(
|
|
dd.SamplerColumnConfig(
|
|
name="difficulty",
|
|
sampler_type="category",
|
|
params=dd.CategorySamplerParams(
|
|
values=["beginner", "intermediate", "advanced"],
|
|
),
|
|
)
|
|
)
|
|
|
|
# 2. Subtopic
|
|
config_builder.add_column(
|
|
dd.SamplerColumnConfig(
|
|
name="subtopic",
|
|
sampler_type="category",
|
|
params=dd.CategorySamplerParams(
|
|
values=[
|
|
"data cleaning",
|
|
"exploratory analysis",
|
|
"aggregation and groupby operations",
|
|
"visualization with matplotlib/seaborn",
|
|
"statistical testing",
|
|
"feature engineering",
|
|
"working with messy or missing data",
|
|
],
|
|
),
|
|
)
|
|
)
|
|
|
|
# 3. Natural language instruction
|
|
config_builder.add_column(
|
|
dd.LLMTextColumnConfig(
|
|
name="instruction",
|
|
model_alias="nvidia-text",
|
|
system_prompt=(
|
|
"You are an expert data science educator who writes clear, realistic Python coding tasks. "
|
|
"Your instructions should be specific enough that a developer can implement a complete solution "
|
|
"without ambiguity. Vary the libraries used (pandas, numpy, scipy, sklearn, matplotlib, seaborn, etc.) "
|
|
"and the operations required to match the difficulty level."
|
|
),
|
|
prompt=(
|
|
"Write a single, self-contained Python coding instruction for a {{ difficulty }}-level data science task "
|
|
"in the subtopic of '{{ subtopic }}'.\n\n"
|
|
"Requirements:\n"
|
|
"- The task should be realistic and something a data scientist would actually encounter.\n"
|
|
"- For beginner tasks, focus on simple operations with clear steps.\n"
|
|
"- For intermediate tasks, combine multiple operations or require domain knowledge.\n"
|
|
"- For advanced tasks, involve complex logic, performance considerations, or sophisticated techniques.\n"
|
|
"- The instruction should describe what the code should do, including any expected input/output behavior.\n"
|
|
"- Do NOT include the solution — only the task description.\n\n"
|
|
"Return only the instruction text, nothing else."
|
|
),
|
|
)
|
|
)
|
|
|
|
# 4. Python solution
|
|
config_builder.add_column(
|
|
dd.LLMCodeColumnConfig(
|
|
name="solution",
|
|
model_alias="nvidia-text",
|
|
code_lang="python",
|
|
system_prompt=(
|
|
"You are an expert Python data scientist. Write clean, well-structured, and correct Python code. "
|
|
"The code must be completely self-contained — include all necessary imports, sample data creation, "
|
|
"and function definitions. Do not rely on external files or databases. "
|
|
"Use realistic variable names and follow PEP 8 conventions."
|
|
),
|
|
prompt=(
|
|
"Write a complete, self-contained Python solution for the following {{ difficulty }}-level "
|
|
"data science task in the area of '{{ subtopic }}'.\n\n"
|
|
"Task:\n{{ instruction }}\n\n"
|
|
"Requirements:\n"
|
|
"- Include all necessary imports at the top.\n"
|
|
"- Create any sample/mock data needed so the code runs independently.\n"
|
|
"- The code must be syntactically valid and runnable as-is.\n"
|
|
"- Match the complexity to the {{ difficulty }} level.\n"
|
|
"- Add brief inline comments explaining key steps.\n\n"
|
|
"Return only the Python code."
|
|
),
|
|
)
|
|
)
|
|
|
|
# 5. Code validation
|
|
config_builder.add_column(
|
|
dd.ValidationColumnConfig(
|
|
name="code_validation",
|
|
target_columns=["solution"],
|
|
validator_type="code",
|
|
validator_params=dd.CodeValidatorParams(code_lang="python"),
|
|
)
|
|
)
|
|
|
|
# 6. LLM-as-a-judge quality scoring
|
|
config_builder.add_column(
|
|
dd.LLMJudgeColumnConfig(
|
|
name="quality_score",
|
|
model_alias="nvidia-text",
|
|
prompt=(
|
|
"Evaluate the following Python data science coding task and its solution.\n\n"
|
|
"Difficulty: {{ difficulty }}\n"
|
|
"Subtopic: {{ subtopic }}\n\n"
|
|
"Instruction:\n{{ instruction }}\n\n"
|
|
"Solution:\n{{ solution }}\n\n"
|
|
"Score each dimension below on a 1-5 scale using the provided rubrics."
|
|
),
|
|
scores=[
|
|
dd.Score(
|
|
name="correctness",
|
|
description=(
|
|
"Does the solution correctly implement what the instruction asks for? "
|
|
"Consider whether the code logic is sound, handles edge cases appropriately, "
|
|
"and would produce the expected output."
|
|
),
|
|
options={
|
|
1: "Completely incorrect or does not address the instruction.",
|
|
2: "Major logical errors; partially addresses the instruction.",
|
|
3: "Mostly correct with minor bugs or missing edge cases.",
|
|
4: "Correct implementation with only trivial issues.",
|
|
5: "Fully correct, handles edge cases, and produces expected output.",
|
|
},
|
|
),
|
|
dd.Score(
|
|
name="code_quality",
|
|
description=(
|
|
"Assess the code quality including readability, proper use of libraries, "
|
|
"PEP 8 compliance, appropriate variable naming, and idiomatic Python/pandas usage."
|
|
),
|
|
options={
|
|
1: "Very poor quality: unreadable, no structure, bad naming.",
|
|
2: "Below average: some structure but many style issues.",
|
|
3: "Acceptable: readable but could improve in idiom or style.",
|
|
4: "Good: clean, well-structured, follows conventions.",
|
|
5: "Excellent: idiomatic, elegant, and highly readable.",
|
|
},
|
|
),
|
|
dd.Score(
|
|
name="instruction_clarity",
|
|
description=(
|
|
"How clear and unambiguous is the instruction? Could a competent developer "
|
|
"implement the solution from the instruction alone without guessing?"
|
|
),
|
|
options={
|
|
1: "Extremely vague or incomprehensible.",
|
|
2: "Unclear with significant ambiguity.",
|
|
3: "Understandable but some details are missing or ambiguous.",
|
|
4: "Clear with only minor room for interpretation.",
|
|
5: "Crystal clear, specific, and unambiguous.",
|
|
},
|
|
),
|
|
],
|
|
)
|
|
)
|
|
|
|
return config_builder
|
|
</div></details><div>File created successfully at: ./text_to_python.py</div>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-tool msg">Agent</span>
|
|
<span class="st-desc"><em>Now let me validate the config.</em></span>
|
|
</div>
|
|
<div class="st-step ok">
|
|
<span class="st-num">9</span>
|
|
<span class="st-tool bash">Bash</span>
|
|
<span class="st-desc">
|
|
<code>data-designer validate text_to_python.py 2>&1</code><div> ────────────────────────── Data Designer Validate ──────────────────────────
|
|
|
|
Config: text_to_python.py
|
|
|
|
[22:15:26] [INFO] ✅ Validation passed
|
|
✅ Configuration is valid</div>
|
|
</span>
|
|
</div>
|
|
<div class="st-step">
|
|
<span class="st-tool msg">Agent</span>
|
|
<span class="st-desc">
|
|
<details class="st-expand"><summary><span class="st-arrow">▶</span> message (1,025 chars)</summary><div>Validation passed. Here's a summary of the dataset pipeline:
|
|
|
|
| Column | Type | Description |
|
|
|---|---|---|
|
|
| `difficulty` | Category sampler | beginner, intermediate, advanced (uniform) |
|
|
| `subtopic` | Category sampler | 7 data science subtopics (data cleaning, exploratory analysis, aggregation/groupby, visualization, statistical testing, feature engineering, messy/missing data) |
|
|
| `instruction` | LLM text | Natural language task description conditioned on difficulty + subtopic |
|
|
| `solution` | LLM code (Python) | Complete self-contained Python solution for the instruction |
|
|
| `code_validation` | Code validator | Checks the solution for syntactic correctness |
|
|
| `quality_score` | LLM judge | Scores correctness, code_quality, and instruction_clarity (1-5 scale with structured reasoning) |
|
|
|
|
To generate the dataset, run:
|
|
|
|
```
|
|
data-designer create text_to_python.py --num-records <N> --dataset-name text-to-python
|
|
```
|
|
|
|
Note: generation can take a long time for large record counts (50+), so choose `N` accordingly.</div></details>
|
|
</span>
|
|
</div>
|
|
</div>
|
|
</details>
|