* feat: add allow_resize for 1:N and N:1 generation patterns Adds support for generators that produce a different number of records than the input (expansion or retraction). This addresses GitHub issue #265. Changes: - Add `allow_resize` parameter to `update_records()` in DatasetBatchManager - Add `allow_resize` field to CustomColumnConfig - Add validation requiring FULL_COLUMN strategy when allow_resize=True - Track and report actual_num_records in metadata (may differ from target) - Add logging when batch size changes - Add example_allow_resize.py demonstrating the feature - Add comprehensive tests * docs: add allow_resize to custom columns documentation * refactor: consolidate buffer API and elevate allow_resize to base config - Merge update_records and replace_buffer into a single replace_buffer method with allow_resize parameter on DatasetBatchManager - Move allow_resize field from CustomColumnConfig to SingleColumnConfig so plugins inherit it without needing a mixin - Align example and logging with final CustomColumn API - Parametrize resize tests and extract shared stub in test_columns * test: add chained resize and multi-batch integration tests - Add expand->retract->expand chaining test (single batch) - Add multi-batch resize test verifying combined parquet output - Update example to chain expand/retract/expand with preview+build - Use 💥/✂️ emojis for resize logging (expand/retract) * extend allow_resize to cell-by-cell (return dict or list[dict]) - Config: allow allow_resize with CELL_BY_CELL; relax validator - Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize; validate per row via _validate_cell_output - Builder: collect results by index when cell allow_resize, flatten and replace_buffer; add _log_resize_if_changed and _column_display_name - Docs: ALL_CAPS for strategies, simplify allow_resize table text - Tests: parametrized preview and multibatch; factories with n param; _RESIZE_SPECS with inline factory calls; ids ordered like specs * reorder allow_resize specs and add edge-case tests - Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd, cell_drop_all to _RESIZE_SPECS - Stubs before specs: _resize_full_keep_first, _resize_cell_expand, _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories - Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS - Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3 multibatch cases (5_2, 4_2) first - Handle all-batches-skipped in multibatch test (empty df when path missing) - test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list) * tidy allow_resize: drop validator, shared stub, explicit flag - Remove validate_allow_resize_requires_full_column from CustomColumnConfig - Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns - Pass allow_resize=False in _write_processed_batch replace_buffer call * fix: add missing f prefix to error message in custom.py * docs(plugins): add section on setting allow_resize=True for resize plugins * fix: address PR review comments on allow_resize - Replace getattr with direct attribute access where config is always SingleColumnConfig (custom.py, cell-by-cell path in builder) - Keep getattr in _run_full_column_generator which also handles multi-column configs without allow_resize - Restructure allow_resize validation branching in CustomColumnGenerator - Fix error message wording: "key" -> "column" * fix: remove duplicate tool_alias log, fix test docstring - Remove tool_alias log from _setup_fan_out (callers already log it) - Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory * fix: avoid duplicate undeclared-column warning in _validate_output Inline the strip instead of delegating to _validate_cell_output, which would log the same warning a second time. * fix: use lazy.pd instead of pd for runtime pandas usage in tests The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
6 KiB
Custom Columns
Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see Plugins instead.
Quick Start
import data_designer.config as dd
@dd.custom_column_generator(required_columns=["name"])
def create_greeting(row: dict) -> dict:
row["greeting"] = f"Hello, {row['name']}!"
return row
config_builder.add_column(
dd.CustomColumnConfig(
name="greeting",
generator_function=create_greeting,
)
)
Function Signatures
Three signatures are supported. Parameter names are validated:
| Args | Signature | Use Case |
|---|---|---|
| 1 | fn(row) -> dict |
Simple transforms |
| 2 | fn(row, generator_params) -> dict |
With typed params |
| 3 | fn(row, generator_params, models) -> dict |
LLM access via models dict |
For full_column strategy, use df instead of row.
For LLM access without params, use generator_params: None:
@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"])
def generate_message(row: dict, generator_params: None, models: dict) -> dict:
response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}")
row["greeting"] = response
return row
Model aliases are validated before generation starts. If an alias doesn't exist in your config, an error is raised during the health check.
Generation Strategies
| Strategy | Input | Use Case |
|---|---|---|
cell_by_cell (default) |
row: dict |
LLM calls, row-by-row logic |
full_column |
df: DataFrame |
Vectorized DataFrame operations |
Recommendation: Use cell_by_cell for LLM calls. The framework handles parallelization automatically. Use full_column only for vectorized operations that don't involve LLM calls.
For full_column, set generation_strategy=dd.GenerationStrategy.FULL_COLUMN.
The Decorator
@dd.custom_column_generator(
required_columns=["col1"], # DAG ordering
side_effect_columns=["extra"], # Additional columns created
model_aliases=["model1"], # Required for LLM access
)
Models Dict
The third argument is a dict of ModelFacade instances, keyed by alias. You must declare all models required in your custom column generator in model_aliases - this populates the models dict and enables health checks before generation starts.
@dd.custom_column_generator(model_aliases=["my-model"])
def my_generator(row: dict, generator_params: None, models: dict) -> dict:
model = models["my-model"]
response, trace = model.generate(
prompt="...",
parser=my_custom_parser, # optional, defaults to identity
system_prompt="...",
max_correction_steps=3,
)
row["result"] = response
return row
This gives you direct access to all ModelFacade capabilities: custom parsers, correction loops, structured output, tool use, etc.
Configuration
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
str | Yes | Column name |
generator_function |
Callable | Yes | Decorated function |
generation_strategy |
GenerationStrategy | No | CELL_BY_CELL or FULL_COLUMN |
generator_params |
BaseModel | No | Typed params passed to function |
allow_resize |
bool | No | Allow 1:N or N:1 generation |
Resizing (1:N and N:1)
FULL_COLUMN: Set allow_resize=True and return a DataFrame with more or fewer rows than the input:
@dd.custom_column_generator(
required_columns=["topic"],
side_effect_columns=["variation_id"],
)
def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
rows = []
for _, row in df.iterrows():
for i in range(3): # Generate 3 variations per input
rows.append({
"topic": row["topic"],
"question": f"Question {i+1} about {row['topic']}",
"variation_id": i,
})
return pd.DataFrame(rows)
dd.CustomColumnConfig(
name="question",
generator_function=expand_topics,
generation_strategy=dd.GenerationStrategy.FULL_COLUMN,
allow_resize=True,
)
CELL_BY_CELL: With allow_resize=True, your function may return a single row (dict) or multiple rows (list[dict]). Return [] to drop that input row.
@dd.custom_column_generator(required_columns=["id"])
def expand_row(row: dict) -> list[dict]:
return [
{**row, "variant": "a"},
{**row, "variant": "b"},
]
dd.CustomColumnConfig(
name="variant",
generator_function=expand_row,
generation_strategy=dd.GenerationStrategy.CELL_BY_CELL,
allow_resize=True,
)
Use cases:
- Expansion (1:N): Generate multiple variations per input
- Retraction (N:1): Filter, aggregate, or deduplicate records (FULL_COLUMN) or return
[]per row (CELL_BY_CELL)
Multi-Turn Example
@dd.custom_column_generator(
required_columns=["topic"],
side_effect_columns=["draft", "critique"],
model_aliases=["writer", "editor"],
)
def writer_editor(row: dict, generator_params: None, models: dict) -> dict:
draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'")
critique, _ = models["editor"].generate(prompt=f"Critique: {draft}")
revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}")
row["final_text"] = revised
row["draft"] = draft
row["critique"] = critique
return row
Development Testing
Test generators with real LLM calls without running the full pipeline:
data_designer = DataDesigner()
models = data_designer.get_models(["my-model"])
result = my_generator({"name": "Alice"}, None, models)