mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

feat: Add CustomColumnGenerator for user-defined column generation (#254 )

* first attempt

* iterating a bit

* some improvements + multiturn example

* adapting to new monorepo structure

* refining

* fixed test

* fixing license headers

* adding docs

* adding test for failed generation

* allowing strategy to be picked

* renaming argument

* lint

* remove recommendation

* renaming for consistency

* addressing comments pt1

* addressing comments pt2

* addressing comments pt3

* adding a mock for development

* addressing greptile comments

* revamping

* docs: streamline custom columns documentation

* docs: simplify CustomColumnConfig docstring

Remove verbose code example and detailed function signatures from
docstring to match the pattern of other config classes in the file.

* test: clean up custom column tests

- Remove tests for private _custom_column_metadata attribute
- Combine redundant generator creation tests
- Reuse stub_resource_provider and stub_model_facade fixtures

* test: consolidate custom column tests

Reduce from 26 to 11 tests while maintaining coverage:
- Combine redundant config/decorator/creation tests
- Use parametrized tests for error conditions
- Remove duplicate validation tests for full_column strategy
- Simplify section headers

* refactor: deduplicate CustomColumnGenerator logic

Merge cell-by-cell and full-column code paths:
- _generate_cell_by_cell + _generate_full_column -> _generate
- _validate_output_columns + _validate_output_columns_df -> _validate_output

* chore: merge example files into single notebook-style example.py

Combine example.py, example_multiturn.py, and example_benchmark_strategies.py
into a single file with #%% cell markers for Jupyter/VS Code notebook mode.

* addressing greptile comments

* refactor: reuse generate_text in generate_text_batch

* refactor: replace CustomColumnContext with models dict

- Remove CustomColumnContext class; users now receive models dict directly
- Add DataDesigner.get_models() for experimentation outside pipeline
- Make parser optional in ModelFacade.generate() (defaults to identity)
- Validate parameter names: row/df, generator_params, models
- Update examples, tests, and docs for new API

* fix: address PR review comments from Nabin and greptile

- Make decorator metadata public (custom_column_metadata)
- Simplify get_generation_strategy() to directly return config value
- Use !r formatting in error messages
- Use lazy imports pattern for pandas (TYPE_CHECKING + lazy_heavy_imports)
- Remove redundant error logging before re-raise
- Validate max 3 positional parameters
- Use GenerationStrategy enum in example instead of string

* fix: replace lambda with module-level identity function in facade

Use pickleable _identity function instead of lambda x: x for the
default parser argument, ensuring compatibility with multiprocessing.

* fix: restore inherited attributes in LLM column docstrings

Restores the "Inherited Attributes" sections that were unintentionally
removed from LLMCodeColumnConfig, LLMStructuredColumnConfig, and
LLMJudgeColumnConfig docstrings.

* docs: clarify model_aliases is required for LLM access

Updated documentation and docstrings to clarify that model_aliases
populates the models dict (not just health checks).

* fix: address PR review comments from nabinchha

- clarify model_aliases requirement in docs
- add note about model alias validation during health check
- combine two loops into one in _run_model_health_check_if_needed
- add signature validation at decoration time
- enforce decorated functions in CustomColumnConfig validator
- simplify generator to only validate strategy-specific first param

* fix: address remaining PR review comments

- remove example.py (development artifact)
- fix get_models return type to dict[str, ModelFacade]

* test: update tests for decoration-time validation

- expect ValidationError instead of InvalidConfigError for non-callable
- split param validation test into decoration-time and runtime tests

2026-02-03 19:23:39 -03:00

4.4 KiB

Raw Blame History

Custom Columns

Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see Plugins instead.

Quick Start

import data_designer.config as dd

@dd.custom_column_generator(required_columns=["name"])
def create_greeting(row: dict) -> dict:
    row["greeting"] = f"Hello, {row['name']}!"
    return row

config_builder.add_column(
    dd.CustomColumnConfig(
        name="greeting",
        generator_function=create_greeting,
    )
)

Function Signatures

Three signatures are supported. Parameter names are validated:

Args	Signature	Use Case
1	`fn(row) -> dict`	Simple transforms
2	`fn(row, generator_params) -> dict`	With typed params
3	`fn(row, generator_params, models) -> dict`	LLM access via models dict

For full_column strategy, use df instead of row.

For LLM access without params, use generator_params: None:

@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"])
def generate_message(row: dict, generator_params: None, models: dict) -> dict:
    response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}")
    row["greeting"] = response
    return row

Model aliases are validated before generation starts. If an alias doesn't exist in your config, an error is raised during the health check.

Generation Strategies

Strategy	Input	Use Case
`cell_by_cell` (default)	`row: dict`	LLM calls, row-by-row logic
`full_column`	`df: DataFrame`	Vectorized DataFrame operations

Recommendation: Use cell_by_cell for LLM calls. The framework handles parallelization automatically. Use full_column only for vectorized operations that don't involve LLM calls.

For full_column, set generation_strategy=dd.GenerationStrategy.FULL_COLUMN.

The Decorator

@dd.custom_column_generator(
    required_columns=["col1"],        # DAG ordering
    side_effect_columns=["extra"],    # Additional columns created
    model_aliases=["model1"],         # Required for LLM access
)

Models Dict

The third argument is a dict of ModelFacade instances, keyed by alias. You must declare all models required in your custom column generator in model_aliases - this populates the models dict and enables health checks before generation starts.

@dd.custom_column_generator(model_aliases=["my-model"])
def my_generator(row: dict, generator_params: None, models: dict) -> dict:
    model = models["my-model"]
    response, trace = model.generate(
        prompt="...",
        parser=my_custom_parser,  # optional, defaults to identity
        system_prompt="...",
        max_correction_steps=3,
    )
    row["result"] = response
    return row

This gives you direct access to all ModelFacade capabilities: custom parsers, correction loops, structured output, tool use, etc.

Configuration

Parameter	Type	Required	Description
`name`	str	Yes	Column name
`generator_function`	Callable	Yes	Decorated function
`generation_strategy`	GenerationStrategy	No	`CELL_BY_CELL` or `FULL_COLUMN`
`generator_params`	BaseModel	No	Typed params passed to function

Multi-Turn Example

@dd.custom_column_generator(
    required_columns=["topic"],
    side_effect_columns=["draft", "critique"],
    model_aliases=["writer", "editor"],
)
def writer_editor(row: dict, generator_params: None, models: dict) -> dict:
    draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'")
    critique, _ = models["editor"].generate(prompt=f"Critique: {draft}")
    revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}")

    row["final_text"] = revised
    row["draft"] = draft
    row["critique"] = critique
    return row

Development Testing

Test generators with real LLM calls without running the full pipeline:

data_designer = DataDesigner()
models = data_designer.get_models(["my-model"])
result = my_generator({"name": "Alice"}, None, models)

4.4 KiB Raw Blame History