DataDesigner/docs/devnotes/posts/assets/data-designer-got-skills/trace-baseline.html

<!-- ── 2. Typical Baseline Session ── -->
<details class="session-trace" open>
<summary>
<span class="st-arrow">&#9654;</span>
<span class="st-badge baseline">Baseline</span>
<span class="st-name">Typical Baseline Session</span>
<span class="st-stats">
<span class="st-stat">35 tools</span>
<span class="st-stat err">1 errors</span>
<span class="st-stat">159s</span>
</span>
</summary>
<div class="st-body">
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Let me start by exploring the Data Designer package and checking available configurations.</em></span>
</div>
<div class="st-step">
<span class="st-num">1</span>
<span class="st-tool bash">Bash</span>
<span class="st-desc">
<code>.venv/bin/python -c &quot;import data_designer.config as dd, os; print(os.path.dirname(dd.__file__))&quot;</code><div>.venv/lib/python3.13/site-packages/data_designer/config</div>
</span>
</div>
<div class="st-step">
<span class="st-num">2</span>
<span class="st-tool bash">Bash</span>
<span class="st-desc">
<code>.venv/bin/data-designer config list</code><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> output (5,284 chars)</summary><div>  ─────────────────────── Data Designer Configurations ───────────────────────

  💡  Configuration directory: ~/.data-designer

                                Model Providers
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Name       ┃ Endpoint             ┃ Type   ┃ API Key               ┃ Default ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ nvidia     │ https://integrate.a… │ openai │ NVIDIA_API_KEY        │    ✓    │
│ openai     │ https://inference-a… │ openai │ NVIDIA_INFERENCE_API… │         │
│ openrouter │ https://openrouter.… │ openai │ OPENROUTER_API_KEY    │         │
└────────────┴──────────────────────┴────────┴───────────────────────┴─────────┘

                              Model Configurations
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃                      ┃                     ┃            ┃ Inference          ┃
┃ Alias                ┃ Model               ┃ Provider   ┃ Parameters         ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ nvidia-text          │ nvidia/nemotron-3-… │ nvidia     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=1.00,  │
│                      │                     │            │ top_p=1.00         │
│ nvidia-reasoning     │ openai/gpt-oss-20b  │ nvidia     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=0.35,  │
│                      │                     │            │ top_p=0.95         │
│ nvidia-vision        │ nvidia/nemotron-na… │ nvidia     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=0.85,  │
│                      │                     │            │ top_p=0.95         │
│ nvidia-embedding     │ nvidia/llama-3.2-n… │ nvidia     │ generation_type=e… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ extra_body={&#x27;inpu… │
│                      │                     │            │ &#x27;query&#x27;},          │
│                      │                     │            │ encoding_format=f… │
│ openai-text          │ us/azure/openai/gp… │ openai     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=0.85,  │
│                      │                     │            │ top_p=0.95         │
│ openai-reasoning     │ openai/openai/gpt-… │ openai     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ extra_body={&#x27;reas… │
│                      │                     │            │ &#x27;medium&#x27;}          │
│ openai-vision        │ openai/openai/gpt-… │ openai     │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ extra_body={&#x27;reas… │
│                      │                     │            │ &#x27;medium&#x27;}          │
│ openai-embedding     │ text-embedding-3-l… │ openai     │ generation_type=e… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ encoding_format=f… │
│ openrouter-text      │ nvidia/nemotron-3-… │ openrouter │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=1.00,  │
│                      │                     │            │ top_p=1.00         │
│ openrouter-reasoning │ openai/gpt-oss-20b  │ openrouter │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=0.35,  │
│                      │                     │            │ top_p=0.95         │
│ openrouter-vision    │ nvidia/nemotron-na… │ openrouter │ generation_type=c… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ temperature=0.85,  │
│                      │                     │            │ top_p=0.95         │
│ openrouter-embedding │ openai/text-embedd… │ openrouter │ generation_type=e… │
│                      │                     │            │ max_parallel_requ… │
│                      │                     │            │ encoding_format=f… │
└──────────────────────┴─────────────────────┴────────────┴────────────────────┘

  ⚠️   MCP providers have not been configured. Run &#x27;data-designer config mcp&#x27; to
configure them.

  ⚠️   Tool configs have not been configured. Run &#x27;data-designer config tools&#x27;
to configure them.</div></details>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Now let me explore the Data Designer package structure and understand the API.</em></span>
</div>
<div class="st-step err">
<span class="st-num">3</span>
<span class="st-tool agent">Agent</span>
<span class="st-desc">
<code>Explore</code> &mdash; Explore Data Designer API<details class="st-expand"><summary><span class="st-arrow">&#9654;</span> prompt (974 chars)</summary><div>Explore the Data Designer package thoroughly. The package is located at:
.venv/lib/python3.13/site-packages/data_designer/

I need to understand:
1. How to create a DataDesignerConfig or ConfigBuilder — look for the main config builder API
2. What column types are available (especially for LLM generation, code validation, sampling/categorical, LLM-as-judge)
3. How columns are defined and added to a config
4. How to specify model aliases for columns
5. Any example configs or tests that show how to build a config
6. How the `data-designer validate` CLI command works (what it expects)
7. The `load_config_builder` function pattern — what should it return?

Be very thorough — check __init__.py files, column types, config module, CLI module, and any examples directory. Look at the actual class signatures and constructors.</div></details><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> subagent work (25 tool calls, <strong style="color:#ef4444">1 error</strong>)</summary><div style="white-space:normal"><div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">1</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">2</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid #ef4444;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">3</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/__init__.py</code> &mdash; <strong style="color:#ef4444">error</strong></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">4</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/__init__.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">5</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/config_builder.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">6</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/column_types.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">7</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">8</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/column_configs.py</code> <span style="opacity:.5">(limit=300)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">9</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/column_configs.py</code> <span style="opacity:.5">(offset=300, limit=300)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">10</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">11</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">12</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/cli/commands/validate.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">13</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/cli/utils/config_loader.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">14</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/cli/controllers/generation_controller.py</code> <span style="opacity:.5">(limit=200)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">15</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/models.py</code> <span style="opacity:.5">(limit=250)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">16</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/models.py</code> <span style="opacity:.5">(offset=250, limit=250)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">17</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/models.py</code> <span style="opacity:.5">(offset=500, limit=200)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">18</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/sampler_params.py</code> <span style="opacity:.5">(limit=200)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">19</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">20</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/data_designer_config.py</code> <span style="opacity:.5">(limit=200)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">21</span><span class="st-tool bash" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Bash</span><span style="font-size:.72rem"><code>find .venv/lib/python3.13/site-packages/data_designer</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">22</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/validator_params.py</code> <span style="opacity:.5">(limit=200)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">23</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/interface/data_designer.py</code> <span style="opacity:.5">(limit=150)</span></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">24</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/utils/code_lang.py</code></span></div>
<div style="display:flex;gap:.5rem;align-items:baseline;padding:.15rem 0;border-left:2px solid transparent;padding-left:.4rem"><span style="min-width:1.2rem;text-align:right;opacity:.4;font-size:.7rem">25</span><span class="st-tool read" style="font-size:.62rem;min-width:2.2rem;padding:.02rem .25rem">Read</span><span style="font-size:.72rem"><code>data_designer/config/sampler_params.py</code> <span style="opacity:.5">(offset=200, limit=300)</span></span></div></div></details><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> subagent report (12,076 chars)</summary><div>Now I have comprehensive information about the Data Designer package. Let me compile a thorough report.

## Data Designer Package Exploration Report

I&#x27;ve conducted a thorough exploration of the Data Designer package located at `.venv/lib/python3.13/site-packages/data_designer/`. Here&#x27;s my comprehensive understanding:

### 1. Creating a DataDesignerConfig or ConfigBuilder

**Main API:** The primary builder class is `DataDesignerConfigBuilder` located at:
- `.venv/lib/python3.13/site-packages/data_designer/config/config_builder.py`

**Constructor:**
```python
DataDesignerConfigBuilder(
    model_configs: list[ModelConfig] | str | Path | None = None,
    tool_configs: list[ToolConfig] | None = None
)
```

**Key Methods:**
- `__init__()`: Initialize with optional model configs (defaults to local models) and tool configs
- `add_column()`: Add columns to the config (most commonly used)
- `add_model_config()`: Add model configurations
- `add_tool_config()`: Add MCP tool configurations
- `with_seed_dataset()`: Attach a seed dataset source
- `build()`: Returns a `DataDesignerConfig` object
- `get_builder_config()`: Returns a `BuilderConfig` wrapping the config
- `write_config()`: Save to YAML/JSON file
- `from_config()`: Create builder from existing config file or dict

**Alternative Creation:**
```python
# From existing config file/dict
builder = DataDesignerConfigBuilder.from_config(config_path_or_dict)
```

### 2. Available Column Types

**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/column_types.py`

**Column Types (enum: DataDesignerColumnType):**

1. **LLM Generation Columns:**
   - `llm-text` → `LLMTextColumnConfig`: Free-form text generation with Jinja2 templates
   - `llm-code` → `LLMCodeColumnConfig`: Code generation with language specification (Python, SQL, JS, etc.)
   - `llm-structured` → `LLMStructuredColumnConfig`: Structured JSON output using JSON schema or Pydantic models
   - `llm-judge` → `LLMJudgeColumnConfig`: Multi-dimensional evaluation/scoring of other columns

2. **Sampling/Categorical Columns:**
   - `sampler` → `SamplerColumnConfig`: Numerical and categorical sampling (13+ sampler types available)

3. **Code Validation:**
   - `validation` → `ValidationColumnConfig`: Run validation code (Python/SQL) or call validation functions

4. **Other Column Types:**
   - `seed-dataset` → `SeedDatasetColumnConfig`: Columns sourced from seed data (auto-created)
   - `expression` → `ExpressionColumnConfig`: Jinja2-based derived/computed columns
   - `embedding` → `EmbeddingColumnConfig`: Generate embeddings for text
   - `image` → `ImageColumnConfig`: Generate images from prompts
   - `custom` → `CustomColumnConfig`: User-defined generation logic with decorators

### 3. How Columns Are Defined and Added

**Two approaches:**

**Approach 1: Direct column objects**
```python
from data_designer.config import LLMTextColumnConfig, SamplerColumnConfig

text_col = LLMTextColumnConfig(
    name=&quot;review&quot;,
    prompt=&quot;Write a product review about {{ product_name }}&quot;,
    model_alias=&quot;claude-3.5-sonnet&quot;
)
builder.add_column(text_col)

# Sampler column
sampler_col = SamplerColumnConfig(
    name=&quot;age&quot;,
    sampler_type=&quot;uniform&quot;,
    params={&quot;low&quot;: 18, &quot;high&quot;: 65}
)
builder.add_column(sampler_col)
```

**Approach 2: Via kwargs (using add_column)**
```python
builder.add_column(
    name=&quot;review&quot;,
    column_type=&quot;llm-text&quot;,
    prompt=&quot;Write a product review about {{ product_name }}&quot;,
    model_alias=&quot;claude-3.5-sonnet&quot;
)
```

**Column Properties:**
- Each column config extends `SingleColumnConfig` with a `column_type` discriminator field
- Columns support:
  - `required_columns`: List of columns this column depends on
  - `side_effect_columns`: Additional columns created as side effects (e.g., traces, reasoning_content)
  - Jinja2 template validation for prompts
  - Type conversion and formatting options

### 4. Model Aliases for Columns

**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/models.py`

**Model Configuration:**
```python
from data_designer.config import ModelConfig, ChatCompletionInferenceParams

model_config = ModelConfig(
    alias=&quot;my-claude&quot;,  # User-defined alias referenced in columns
    model=&quot;claude-3.5-sonnet&quot;,
    inference_parameters=ChatCompletionInferenceParams(
        temperature=0.7,
        top_p=0.95,
        max_tokens=2000
    ),
    provider=None,  # Optional custom provider
    skip_health_check=False
)

builder.add_model_config(model_config)
```

**Inference Parameter Types:**
- `ChatCompletionInferenceParams`: For LLM generation (temperature, top_p, max_tokens, dynamic distributions)
- `EmbeddingInferenceParams`: For embeddings (encoding_format, dimensions)
- `ImageInferenceParams`: For image generation (extra_body for model-specific options)

**Model Properties:**
- `alias`: User-defined reference used in columns (e.g., `model_alias=&quot;my-claude&quot;`)
- `model`: Model identifier (e.g., &quot;claude-3.5-sonnet&quot;)
- `inference_parameters`: Supports fixed values or distributions for sampling
- `generation_type`: Auto-determined from inference_parameters (chat-completion, embedding, or image)

### 5. Available Sampler Types

**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/sampler_params.py`

**All Sampler Types (SamplerType enum):**

1. **Statistical Distributions:**
   - `uniform`: Continuous uniform distribution
   - `gaussian`: Normal/Gaussian distribution
   - `binomial`: Binomial distribution (success count)
   - `bernoulli`: Binary (0/1) outcome
   - `bernoulli_mixture`: Mixed zero-inflation with continuous distribution
   - `poisson`: Poisson count distribution
   - `scipy`: Any scipy.stats distribution

2. **Categorical/Discrete:**
   - `category`: Discrete categories with optional weights
   - `subcategory`: Hierarchical sampling based on parent category

3. **Temporal:**
   - `datetime`: Uniform datetime sampling within range
   - `timedelta`: Time offset relative to reference column

4. **Special:**
   - `uuid`: UUID generation (with optional prefix/formatting)
   - `person`: Synthetic person data with demographic filtering (from managed datasets or Faker)
   - `person_from_faker`: Pure Faker-based person generation

### 6. `data-designer validate` CLI Command

**Location:** `.venv/lib/python3.13/site-packages/data_designer/cli/commands/validate.py`

**Usage:**
```bash
data-designer validate &lt;config_source&gt;
```

**What it validates:**
- Checks configuration is well-formed
- Resolves all references (models, columns, seed datasets, etc.)
- Validates model aliases are defined
- Validates column dependencies and required columns
- Checks constraint validity
- Validates processor configurations

**Accepted Config Sources:**
1. **YAML/JSON files**: Local paths or HTTP(S) URLs
2. **Python modules**: `.py` files that define a `load_config_builder()` function

**Controller Implementation:**
```python
# In GenerationController.run_validate()
data_designer = DataDesigner()
data_designer.validate(config_builder)
```

Exits with code 0 on success, code 1 on validation failure.

### 7. `load_config_builder` Function Pattern

**Location:** `.venv/lib/python3.13/site-packages/data_designer/cli/utils/config_loader.py`

**Function Signature Required:**
```python
def load_config_builder() -&gt; DataDesignerConfigBuilder:
    &quot;&quot;&quot;Must return a DataDesignerConfigBuilder instance.&quot;&quot;&quot;
    ...
```

**What it should return:**
- A `DataDesignerConfigBuilder` instance (not BuilderConfig, not DataDesignerConfig)
- The builder is used by the CLI to access the full configuration

**Validation Requirements:**
1. The `.py` file **must** define `load_config_builder()` function
2. Function must be callable and return exactly a `DataDesignerConfigBuilder`
3. The CLI will call this function and validate the returned builder
4. The parent directory of the `.py` file is temporarily added to sys.path to support local imports

**Example Pattern:**
```python
# my_config.py
from data_designer.config import DataDesignerConfigBuilder

def load_config_builder() -&gt; DataDesignerConfigBuilder:
    builder = DataDesignerConfigBuilder()

    # Add model
    builder.add_model_config(...)

    # Add columns
    builder.add_column(...)

    return builder

# CLI usage:
# data-designer validate my_config.py
```

### 8. LLM-as-Judge Column Configuration

**Location:** `.venv/lib/python3.13/site-packages/data_designer/config/column_configs.py`

**Class:** `LLMJudgeColumnConfig` (extends `LLMTextColumnConfig`)

**Configuration Example:**
```python
from data_designer.config import LLMJudgeColumnConfig, Score

judge_col = LLMJudgeColumnConfig(
    name=&quot;quality_assessment&quot;,
    prompt=&quot;Evaluate the quality of: {{ generated_text }}&quot;,
    model_alias=&quot;claude-3.5-sonnet&quot;,
    system_prompt=&quot;You are a quality evaluator...&quot;,
    scores=[
        Score(
            name=&quot;Relevance&quot;,
            description=&quot;How relevant is the content to the topic?&quot;,
            options={
                1: &quot;Not relevant&quot;,
                2: &quot;Somewhat relevant&quot;,
                3: &quot;Highly relevant&quot;
            }
        ),
        Score(
            name=&quot;Fluency&quot;,
            description=&quot;How fluent and natural is the writing?&quot;,
            options={
                &quot;poor&quot;: &quot;Difficult to understand&quot;,
                &quot;good&quot;: &quot;Clear and readable&quot;,
                &quot;excellent&quot;: &quot;Excellent quality&quot;
            }
        )
    ],
    with_trace=TraceType.ALL_MESSAGES,
    extract_reasoning_content=False
)

builder.add_column(judge_col)
```

**Key Features:**
- Inherits all LLMTextColumnConfig capabilities (prompts, traces, reasoning extraction, tool calling)
- `scores`: List of `Score` objects defining evaluation dimensions
- Each score has: name, description, options (dict of value → description)
- Can reference other columns in prompts via Jinja2 templating
- Supports system prompts and tool aliases for complex evaluations

### Key File Locations Summary

| Component | File Path |
|-----------|-----------|
| Config Builder | `.../data_designer/config/config_builder.py` |
| Column Types | `.../data_designer/config/column_types.py` |
| Column Configs | `.../data_designer/config/column_configs.py` |
| Model Config | `.../data_designer/config/models.py` |
| Sampler Params | `.../data_designer/config/sampler_params.py` |
| Validator Params | `.../data_designer/config/validator_params.py` |
| Code Languages | `.../data_designer/config/utils/code_lang.py` |
| Validate CLI | `.../data_designer/cli/commands/validate.py` |
| Config Loader | `.../data_designer/cli/utils/config_loader.py` |
| Main Interface | `.../data_designer/interface/data_designer.py` |
| Data Config | `.../data_designer/config/data_designer_config.py` |
agentId: af6f4be5db40ad2af (use SendMessage with to: &#x27;af6f4be5db40ad2af&#x27; to continue this agent)
&lt;usage&gt;total_tokens: 87439
tool_uses: 25
duration_ms: 75309&lt;/usage&gt;</div></details>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Let me check a few specific details about the column configs and validation column.</em></span>
</div>
<div class="st-step">
<span class="st-num">4</span>
<span class="st-tool read">Read</span>
<span class="st-desc">
<code>data_designer/config/column_configs.py</code>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Now let me check the validator params and the config builder&#x27;s `add_column` method, plus the `__init__` exports.</em></span>
</div>
<div class="st-step">
<span class="st-num">5</span>
<span class="st-tool read">Read</span>
<span class="st-desc">
<code>data_designer/config/validator_params.py</code>
</span>
</div>
<div class="st-step">
<span class="st-num">6</span>
<span class="st-tool read">Read</span>
<span class="st-desc">
<code>data_designer/config/__init__.py</code>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Let me check the sampler params for category type and the config builder&#x27;s `add_column` signature.</em></span>
</div>
<div class="st-step">
<span class="st-num">7</span>
<span class="st-tool grep">Grep</span>
<span class="st-desc">
<code>class CategorySamplerParams</code><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> results (1,072 chars)</summary><div>45:class CategorySamplerParams(ConfigBase):
46-    &quot;&quot;&quot;Parameters for categorical sampling with optional probability weighting.
47-
48-    Samples values from a discrete set of categories. When weights are provided, values are
49-    sampled according to their assigned probabilities. Without weights, uniform sampling is used.
50-
51-    Attributes:
52-        values: List of possible categorical values to sample from. Can contain strings, integers,
53-            or floats. Must contain at least one value.
54-        weights: Optional unnormalized probability weights for each value. If provided, must be
55-            the same length as `values`. Weights are automatically normalized to sum to 1.0.
56-            Larger weights result in higher sampling probability for the corresponding value.
57-    &quot;&quot;&quot;
58-
59-    values: list[str | int | float] = Field(
60-        ...,
61-        min_length=1,
62-        description=&quot;List of possible categorical values that can be sampled from.&quot;,
63-    )
64-    weights: list[float] | None = Field(
65-        default=None,</div></details>
</span>
</div>
<div class="st-step">
<span class="st-num">8</span>
<span class="st-tool grep">Grep</span>
<span class="st-desc">
<code>def add_column</code><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> results (1,590 chars)</summary><div>267:    def add_column(
268-        self,
269-        column_config: ColumnConfigT | None = None,
270-        *,
271-        name: str | None = None,
272-        column_type: DataDesignerColumnType | None = None,
273-        **kwargs,
274-    ) -&gt; Self:
275-        &quot;&quot;&quot;Add a Data Designer column configuration to the current Data Designer configuration.
276-
277-        If no column config object is provided, you must provide the `name`, `column_type`, and any
278-        additional keyword arguments that are required by the column config constructor.
279-
280-        Args:
281-            column_config: Data Designer column config object to add.
282-            name: Name of the column to add. This is only used if `column_config` is not provided.
283-            column_type: Column type to add. This is only used if `column_config` is not provided.
284-            **kwargs: Additional keyword arguments to pass to the column constructor.
285-
286-        Returns:
287-            The current Data Designer config builder instance.
288-
289-        Raises:
290-            BuilderConfigurationError: If the column name collides with an existing seed dataset column.
291-        &quot;&quot;&quot;
292-        if column_config is None:
293-            if name is None or column_type is None:
294-                raise BuilderConfigurationError(
295-                    &quot;🛑 You must provide either a &#x27;column_config&#x27; object or &#x27;name&#x27; *and* &#x27;column_type&#x27; &quot;
296-                    f&quot;with additional keyword arguments. You provided {column_config=}, {name=}, and {column_type=}.&quot;
297-                )</div></details>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Now I have everything I need. Let me write the config file.</em></span>
</div>
<div class="st-step">
<span class="st-num">9</span>
<span class="st-tool write">Write</span>
<span class="st-desc">
<code>config.py</code><details class="st-expand"><summary><span class="st-arrow">&#9654;</span> written content (6,872 chars)</summary><div>from data_designer.config import (
    CategorySamplerParams,
    CodeLang,
    CodeValidatorParams,
    DataDesignerConfigBuilder,
    LLMCodeColumnConfig,
    LLMJudgeColumnConfig,
    LLMTextColumnConfig,
    SamplerColumnConfig,
    Score,
    ValidationColumnConfig,
    ValidatorType,
)


def load_config_builder() -&gt; DataDesignerConfigBuilder:
    builder = DataDesignerConfigBuilder()

    # --- Difficulty level ---
    builder.add_column(
        SamplerColumnConfig(
            name=&quot;difficulty&quot;,
            sampler_type=&quot;category&quot;,
            params=CategorySamplerParams(
                values=[&quot;beginner&quot;, &quot;intermediate&quot;, &quot;advanced&quot;],
                weights=[0.35, 0.40, 0.25],
            ),
        )
    )

    # --- Subtopic ---
    builder.add_column(
        SamplerColumnConfig(
            name=&quot;subtopic&quot;,
            sampler_type=&quot;category&quot;,
            params=CategorySamplerParams(
                values=[
                    &quot;data cleaning&quot;,
                    &quot;exploratory analysis&quot;,
                    &quot;aggregation and groupby operations&quot;,
                    &quot;visualization with matplotlib/seaborn&quot;,
                    &quot;statistical testing&quot;,
                    &quot;feature engineering&quot;,
                    &quot;working with messy or missing data&quot;,
                ],
            ),
        )
    )

    # --- Natural language instruction ---
    builder.add_column(
        LLMTextColumnConfig(
            name=&quot;instruction&quot;,
            model_alias=&quot;nvidia-text&quot;,
            prompt=(
                &quot;Write a clear, specific natural-language instruction for a {{ difficulty }}-level &quot;
                &quot;Python data science task in the area of {{ subtopic }}.\n\n&quot;
                &quot;The instruction should describe a concrete problem that a data scientist might encounter &quot;
                &quot;and should be solvable in a single self-contained Python script. &quot;
                &quot;Do NOT include any code in your response — only the task description.\n\n&quot;
                &quot;Vary the complexity to match the difficulty level:\n&quot;
                &quot;- beginner: simple operations using pandas, basic plots, straightforward cleaning\n&quot;
                &quot;- intermediate: multi-step workflows, combining libraries, moderate data manipulation\n&quot;
                &quot;- advanced: complex pipelines, statistical modeling, performance optimization, &quot;
                &quot;advanced visualization techniques\n\n&quot;
                &quot;Return only the instruction text, nothing else.&quot;
            ),
        )
    )

    # --- Python solution ---
    builder.add_column(
        LLMCodeColumnConfig(
            name=&quot;solution&quot;,
            model_alias=&quot;nvidia-reasoning&quot;,
            code_lang=CodeLang.PYTHON,
            prompt=(
                &quot;Write a complete, self-contained Python solution for the following &quot;
                &quot;{{ difficulty }}-level data science task:\n\n&quot;
                &quot;{{ instruction }}\n\n&quot;
                &quot;Requirements:\n&quot;
                &quot;- The code must be syntactically valid and runnable as-is.\n&quot;
                &quot;- Include all necessary imports.\n&quot;
                &quot;- Generate or define any sample data needed (do not load from external files).\n&quot;
                &quot;- Use appropriate libraries (pandas, numpy, matplotlib, seaborn, scipy, &quot;
                &quot;scikit-learn, etc.) as needed.\n&quot;
                &quot;- Add brief inline comments explaining key steps.\n&quot;
                &quot;- Follow Python best practices and PEP 8 style.\n&quot;
            ),
        )
    )

    # --- Code validation ---
    builder.add_column(
        ValidationColumnConfig(
            name=&quot;code_validation&quot;,
            target_columns=[&quot;solution&quot;],
            validator_type=ValidatorType.CODE,
            validator_params=CodeValidatorParams(code_lang=CodeLang.PYTHON),
        )
    )

    # --- LLM-as-judge evaluation ---
    builder.add_column(
        LLMJudgeColumnConfig(
            name=&quot;quality_score&quot;,
            model_alias=&quot;nvidia-reasoning&quot;,
            prompt=(
                &quot;Evaluate the following Python data science solution against its instruction.\n\n&quot;
                &quot;**Difficulty:** {{ difficulty }}\n&quot;
                &quot;**Subtopic:** {{ subtopic }}\n\n&quot;
                &quot;**Instruction:**\n{{ instruction }}\n\n&quot;
                &quot;**Solution:**\n```python\n{{ solution }}\n```\n\n&quot;
                &quot;Score the solution on each dimension below. Be rigorous but fair.&quot;
            ),
            scores=[
                Score(
                    name=&quot;correctness&quot;,
                    description=(
                        &quot;Does the code correctly implement what the instruction asks for? &quot;
                        &quot;Consider logical correctness, proper use of libraries, and whether &quot;
                        &quot;the output would match expectations.&quot;
                    ),
                    options={
                        1: &quot;Completely wrong or does not address the instruction&quot;,
                        2: &quot;Major errors that would produce incorrect results&quot;,
                        3: &quot;Partially correct but has notable issues&quot;,
                        4: &quot;Mostly correct with minor issues&quot;,
                        5: &quot;Fully correct and complete implementation&quot;,
                    },
                ),
                Score(
                    name=&quot;code_quality&quot;,
                    description=(
                        &quot;Assess the code style, readability, and Pythonic practices. &quot;
                        &quot;Consider variable naming, structure, comments, and adherence &quot;
                        &quot;to best practices.&quot;
                    ),
                    options={
                        1: &quot;Very poor quality, unreadable or badly structured&quot;,
                        2: &quot;Below average, multiple style issues&quot;,
                        3: &quot;Acceptable but could be improved&quot;,
                        4: &quot;Good quality, clean and well-structured&quot;,
                        5: &quot;Excellent quality, exemplary Python code&quot;,
                    },
                ),
                Score(
                    name=&quot;instruction_clarity&quot;,
                    description=(
                        &quot;How clear and well-defined is the original instruction? &quot;
                        &quot;Consider whether the task is unambiguous, specific enough to implement, &quot;
                        &quot;and appropriately scoped for the stated difficulty level.&quot;
                    ),
                    options={
                        1: &quot;Vague or incomprehensible instruction&quot;,
                        2: &quot;Unclear, missing important details&quot;,
                        3: &quot;Adequate but could be more specific&quot;,
                        4: &quot;Clear and well-defined&quot;,
                        5: &quot;Excellent clarity, perfectly scoped for difficulty level&quot;,
                    },
                ),
            ],
        )
    )

    return builder
</div></details><div>File created successfully at: ./config.py</div>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc"><em>Now let me validate it.</em></span>
</div>
<div class="st-step ok">
<span class="st-num">10</span>
<span class="st-tool bash">Bash</span>
<span class="st-desc">
<code>.venv/bin/data-designer validate config.py</code><div>  ────────────────────────── Data Designer Validate ──────────────────────────

  Config: config.py

[21:25:04] [INFO] ✅ Validation passed
  ✅  Configuration is valid</div>
</span>
</div>
<div class="st-step">
<span class="st-tool msg">Agent</span>
<span class="st-desc">
<details class="st-expand"><summary><span class="st-arrow">&#9654;</span> message (999 chars)</summary><div>The configuration validates successfully. Here&#x27;s what the pipeline does:

1. **`difficulty`** — Sampler column that randomly picks from beginner (35%), intermediate (40%), advanced (25%)
2. **`subtopic`** — Sampler column that uniformly samples from 7 data science subtopics (data cleaning, exploratory analysis, aggregation/groupby, visualization, statistical testing, feature engineering, messy/missing data)
3. **`instruction`** — LLM text column (`nvidia-text`) that generates a natural language task description conditioned on the difficulty and subtopic
4. **`solution`** — LLM code column (`nvidia-reasoning`) that generates a complete, self-contained Python solution for the instruction
5. **`code_validation`** — Validation column that checks the generated Python code for syntactic correctness
6. **`quality_score`** — LLM judge column (`nvidia-reasoning`) that scores each (instruction, solution) pair on three dimensions (correctness, code quality, instruction clarity) using a 1–5 scale</div></details>
</span>
</div>
</div>
</details>