mirror of https://github.com/NVIDIA-NeMo/DataDesigner synced 2026-05-24 09:48:29 +00:00

feat: add allow_resize for 1:N and N:1 generation patterns (#286 )

* feat: add allow_resize for 1:N and N:1 generation patterns

Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.

Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests

* docs: add allow_resize to custom columns documentation

* refactor: consolidate buffer API and elevate allow_resize to base config

- Merge update_records and replace_buffer into a single replace_buffer
  method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
  so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns

* test: add chained resize and multi-batch integration tests

- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)

* extend allow_resize to cell-by-cell (return dict or list[dict])

- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
  validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
  replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
  _RESIZE_SPECS with inline factory calls; ids ordered like specs

* reorder allow_resize specs and add edge-case tests

- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
  cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
  _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
  multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)

* tidy allow_resize: drop validator, shared stub, explicit flag

- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call

* fix: add missing f prefix to error message in custom.py

* docs(plugins): add section on setting allow_resize=True for resize plugins

* fix: address PR review comments on allow_resize

- Replace getattr with direct attribute access where config is always
  SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
  multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"

* fix: remove duplicate tool_alias log, fix test docstring

- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory

* fix: avoid duplicate undeclared-column warning in _validate_output

Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.

* fix: use lazy.pd instead of pd for runtime pandas usage in tests

The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.

2026-02-18 18:39:31 -03:00

6 KiB

Raw Blame History

Custom Columns

Custom columns let you implement your own generation logic using Python functions. Use them for multi-step LLM workflows, external API integration, or any scenario requiring full programmatic control. For reusable, distributable components, see Plugins instead.

Quick Start

import data_designer.config as dd

@dd.custom_column_generator(required_columns=["name"])
def create_greeting(row: dict) -> dict:
    row["greeting"] = f"Hello, {row['name']}!"
    return row

config_builder.add_column(
    dd.CustomColumnConfig(
        name="greeting",
        generator_function=create_greeting,
    )
)

Function Signatures

Three signatures are supported. Parameter names are validated:

Args	Signature	Use Case
1	`fn(row) -> dict`	Simple transforms
2	`fn(row, generator_params) -> dict`	With typed params
3	`fn(row, generator_params, models) -> dict`	LLM access via models dict

For full_column strategy, use df instead of row.

For LLM access without params, use generator_params: None:

@dd.custom_column_generator(required_columns=["name"], model_aliases=["my-model"])
def generate_message(row: dict, generator_params: None, models: dict) -> dict:
    response, _ = models["my-model"].generate(prompt=f"Greet {row['name']}")
    row["greeting"] = response
    return row

Model aliases are validated before generation starts. If an alias doesn't exist in your config, an error is raised during the health check.

Generation Strategies

Strategy	Input	Use Case
`cell_by_cell` (default)	`row: dict`	LLM calls, row-by-row logic
`full_column`	`df: DataFrame`	Vectorized DataFrame operations

Recommendation: Use cell_by_cell for LLM calls. The framework handles parallelization automatically. Use full_column only for vectorized operations that don't involve LLM calls.

For full_column, set generation_strategy=dd.GenerationStrategy.FULL_COLUMN.

The Decorator

@dd.custom_column_generator(
    required_columns=["col1"],        # DAG ordering
    side_effect_columns=["extra"],    # Additional columns created
    model_aliases=["model1"],         # Required for LLM access
)

Models Dict

The third argument is a dict of ModelFacade instances, keyed by alias. You must declare all models required in your custom column generator in model_aliases - this populates the models dict and enables health checks before generation starts.

@dd.custom_column_generator(model_aliases=["my-model"])
def my_generator(row: dict, generator_params: None, models: dict) -> dict:
    model = models["my-model"]
    response, trace = model.generate(
        prompt="...",
        parser=my_custom_parser,  # optional, defaults to identity
        system_prompt="...",
        max_correction_steps=3,
    )
    row["result"] = response
    return row

This gives you direct access to all ModelFacade capabilities: custom parsers, correction loops, structured output, tool use, etc.

Configuration

Parameter	Type	Required	Description
`name`	str	Yes	Column name
`generator_function`	Callable	Yes	Decorated function
`generation_strategy`	GenerationStrategy	No	`CELL_BY_CELL` or `FULL_COLUMN`
`generator_params`	BaseModel	No	Typed params passed to function
`allow_resize`	bool	No	Allow 1:N or N:1 generation

Resizing (1:N and N:1)

FULL_COLUMN: Set allow_resize=True and return a DataFrame with more or fewer rows than the input:

@dd.custom_column_generator(
    required_columns=["topic"],
    side_effect_columns=["variation_id"],
)
def expand_topics(df: pd.DataFrame, params: None, models: dict) -> pd.DataFrame:
    rows = []
    for _, row in df.iterrows():
        for i in range(3):  # Generate 3 variations per input
            rows.append({
                "topic": row["topic"],
                "question": f"Question {i+1} about {row['topic']}",
                "variation_id": i,
            })
    return pd.DataFrame(rows)

dd.CustomColumnConfig(
    name="question",
    generator_function=expand_topics,
    generation_strategy=dd.GenerationStrategy.FULL_COLUMN,
    allow_resize=True,
)

CELL_BY_CELL: With allow_resize=True, your function may return a single row (dict) or multiple rows (list[dict]). Return [] to drop that input row.

@dd.custom_column_generator(required_columns=["id"])
def expand_row(row: dict) -> list[dict]:
    return [
        {**row, "variant": "a"},
        {**row, "variant": "b"},
    ]

dd.CustomColumnConfig(
    name="variant",
    generator_function=expand_row,
    generation_strategy=dd.GenerationStrategy.CELL_BY_CELL,
    allow_resize=True,
)

Use cases:

Expansion (1:N): Generate multiple variations per input
Retraction (N:1): Filter, aggregate, or deduplicate records (FULL_COLUMN) or return [] per row (CELL_BY_CELL)

Multi-Turn Example

@dd.custom_column_generator(
    required_columns=["topic"],
    side_effect_columns=["draft", "critique"],
    model_aliases=["writer", "editor"],
)
def writer_editor(row: dict, generator_params: None, models: dict) -> dict:
    draft, _ = models["writer"].generate(prompt=f"Write about '{row['topic']}'")
    critique, _ = models["editor"].generate(prompt=f"Critique: {draft}")
    revised, _ = models["writer"].generate(prompt=f"Revise based on: {critique}\n\nOriginal: {draft}")

    row["final_text"] = revised
    row["draft"] = draft
    row["critique"] = critique
    return row

Development Testing

Test generators with real LLM calls without running the full pipeline:

data_designer = DataDesigner()
models = data_designer.get_models(["my-model"])
result = my_generator({"name": "Alice"}, None, models)

6 KiB Raw Blame History