* first attempt * iterating a bit * some improvements + multiturn example * adapting to new monorepo structure * refining * fixed test * fixing license headers * adding docs * adding test for failed generation * allowing strategy to be picked * renaming argument * lint * remove recommendation * renaming for consistency * addressing comments pt1 * addressing comments pt2 * addressing comments pt3 * adding a mock for development * addressing greptile comments * revamping * docs: streamline custom columns documentation * docs: simplify CustomColumnConfig docstring Remove verbose code example and detailed function signatures from docstring to match the pattern of other config classes in the file. * test: clean up custom column tests - Remove tests for private _custom_column_metadata attribute - Combine redundant generator creation tests - Reuse stub_resource_provider and stub_model_facade fixtures * test: consolidate custom column tests Reduce from 26 to 11 tests while maintaining coverage: - Combine redundant config/decorator/creation tests - Use parametrized tests for error conditions - Remove duplicate validation tests for full_column strategy - Simplify section headers * refactor: deduplicate CustomColumnGenerator logic Merge cell-by-cell and full-column code paths: - _generate_cell_by_cell + _generate_full_column -> _generate - _validate_output_columns + _validate_output_columns_df -> _validate_output * chore: merge example files into single notebook-style example.py Combine example.py, example_multiturn.py, and example_benchmark_strategies.py into a single file with #%% cell markers for Jupyter/VS Code notebook mode. * addressing greptile comments * refactor: reuse generate_text in generate_text_batch * refactor: replace CustomColumnContext with models dict - Remove CustomColumnContext class; users now receive models dict directly - Add DataDesigner.get_models() for experimentation outside pipeline - Make parser optional in ModelFacade.generate() (defaults to identity) - Validate parameter names: row/df, generator_params, models - Update examples, tests, and docs for new API * fix: address PR review comments from Nabin and greptile - Make decorator metadata public (custom_column_metadata) - Simplify get_generation_strategy() to directly return config value - Use !r formatting in error messages - Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports) - Remove redundant error logging before re-raise - Validate max 3 positional parameters - Use GenerationStrategy enum in example instead of string * fix: replace lambda with module-level identity function in facade Use pickleable _identity function instead of lambda x: x for the default parser argument, ensuring compatibility with multiprocessing. * fix: restore inherited attributes in LLM column docstrings Restores the "Inherited Attributes" sections that were unintentionally removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and LLMJudgeColumnConfig docstrings. * docs: clarify model_aliases is required for LLM access Updated documentation and docstrings to clarify that model_aliases populates the models dict (not just health checks). * fix: address PR review comments from nabinchha - clarify model_aliases requirement in docs - add note about model alias validation during health check - combine two loops into one in _run_model_health_check_if_needed - add signature validation at decoration time - enforce decorated functions in CustomColumnConfig validator - simplify generator to only validate strategy-specific first param * fix: address remaining PR review comments - remove example.py (development artifact) - fix get_models return type to dict[str, ModelFacade] * test: update tests for decoration-time validation - expect ValidationError instead of InvalidConfigError for non-callable - split param validation test into decoration-time and runtime tests
4.4 KiB
Custom Columns
Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see Plugins instead.
Quick Start
import data_designer.config as dd
@dd.custom_column_generator(required_columns=["name"])
def create_greeting(row: dict) -> dict:
row["greeting"] = f"Hello, {row['name']}!"
return row
config_builder.add_column(
dd.CustomColumnConfig(
name="greeting",
generator_function=create_greeting,
)
)
Function Signatures
Three signatures are supported. Parameter names are validated:
| Args | Signature | Use Case |
|---|---|---|
| 1 | fn(row) -> dict |
Simple transforms |
| 2 | fn(row, generator_params) -> dict |
With typed params |
| 3 | fn(row, generator_params, models) -> dict |
LLM access via models dict |
For full_column strategy, use df instead of row.
For LLM access without params, use generator_params: None:
@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"])
def generate_message(row: dict, generator_params: None, models: dict) -> dict:
response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}")
row["greeting"] = response
return row
Model aliases are validated before generation starts. If an alias doesn't exist in your config, an error is raised during the health check.
Generation Strategies
| Strategy | Input | Use Case |
|---|---|---|
cell_by_cell (default) |
row: dict |
LLM calls, row-by-row logic |
full_column |
df: DataFrame |
Vectorized DataFrame operations |
Recommendation: Use cell_by_cell for LLM calls. The framework handles parallelization automatically. Use full_column only for vectorized operations that don't involve LLM calls.
For full_column, set generation_strategy=dd.GenerationStrategy.FULL_COLUMN.
The Decorator
@dd.custom_column_generator(
required_columns=["col1"], # DAG ordering
side_effect_columns=["extra"], # Additional columns created
model_aliases=["model1"], # Required for LLM access
)
Models Dict
The third argument is a dict of ModelFacade instances, keyed by alias. You must declare all models required in your custom column generator in model_aliases - this populates the models dict and enables health checks before generation starts.
@dd.custom_column_generator(model_aliases=["my-model"])
def my_generator(row: dict, generator_params: None, models: dict) -> dict:
model = models["my-model"]
response, trace = model.generate(
prompt="...",
parser=my_custom_parser, # optional, defaults to identity
system_prompt="...",
max_correction_steps=3,
)
row["result"] = response
return row
This gives you direct access to all ModelFacade capabilities: custom parsers, correction loops, structured output, tool use, etc.
Configuration
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
str | Yes | Column name |
generator_function |
Callable | Yes | Decorated function |
generation_strategy |
GenerationStrategy | No | CELL_BY_CELL or FULL_COLUMN |
generator_params |
BaseModel | No | Typed params passed to function |
Multi-Turn Example
@dd.custom_column_generator(
required_columns=["topic"],
side_effect_columns=["draft", "critique"],
model_aliases=["writer", "editor"],
)
def writer_editor(row: dict, generator_params: None, models: dict) -> dict:
draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'")
critique, _ = models["editor"].generate(prompt=f"Critique: {draft}")
revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}")
row["final_text"] = revised
row["draft"] = draft
row["critique"] = critique
return row
Development Testing
Test generators with real LLM calls without running the full pipeline:
data_designer = DataDesigner()
models = data_designer.get_models(["my-model"])
result = my_generator({"name": "Alice"}, None, models)