Commit graph

9 commits

Author SHA1 Message Date
Andre Manoel
70dc48884e
feat: add allow_resize for 1:N and N:1 generation patterns (#286)
* feat: add allow_resize for 1:N and N:1 generation patterns

Adds support for generators that produce a different number of records
than the input (expansion or retraction). This addresses GitHub issue #265.

Changes:
- Add `allow_resize` parameter to `update_records()` in DatasetBatchManager
- Add `allow_resize` field to CustomColumnConfig
- Add validation requiring FULL_COLUMN strategy when allow_resize=True
- Track and report actual_num_records in metadata (may differ from target)
- Add logging when batch size changes
- Add example_allow_resize.py demonstrating the feature
- Add comprehensive tests

* docs: add allow_resize to custom columns documentation

* refactor: consolidate buffer API and elevate allow_resize to base config

- Merge update_records and replace_buffer into a single replace_buffer
  method with allow_resize parameter on DatasetBatchManager
- Move allow_resize field from CustomColumnConfig to SingleColumnConfig
  so plugins inherit it without needing a mixin
- Align example and logging with final CustomColumn API
- Parametrize resize tests and extract shared stub in test_columns

* test: add chained resize and multi-batch integration tests

- Add expand->retract->expand chaining test (single batch)
- Add multi-batch resize test verifying combined parquet output
- Update example to chain expand/retract/expand with preview+build
- Use 💥/✂️ emojis for resize logging (expand/retract)

* extend allow_resize to cell-by-cell (return dict or list[dict])

- Config: allow allow_resize with CELL_BY_CELL; relax validator
- Custom generator: accept dict | list[dict] when cell_by_cell + allow_resize;
  validate per row via _validate_cell_output
- Builder: collect results by index when cell allow_resize, flatten and
  replace_buffer; add _log_resize_if_changed and _column_display_name
- Docs: ALL_CAPS for strategies, simplify allow_resize table text
- Tests: parametrized preview and multibatch; factories with n param;
  _RESIZE_SPECS with inline factory calls; ids ordered like specs

* reorder allow_resize specs and add edge-case tests

- Rename specs: full_x3, cell_x2, cell_plus_full_chain; add cell_filter_odd,
  cell_drop_all to _RESIZE_SPECS
- Stubs before specs: _resize_full_keep_first, _resize_cell_expand,
  _resize_cell_filter_odd, _resize_cell_drop_all; drop cell factories
- Remove FULL/CELL constants; use GenerationStrategy.* in _RESIZE_SPECS
- Preview/multibatch parametrize: _preview and _multibatch ids; two full_x3
  multibatch cases (5_2, 4_2) first
- Handle all-batches-skipped in multibatch test (empty df when path missing)
- test_custom: add test_cell_by_cell_allow_resize_return_list_single (1:1 via list)

* tidy allow_resize: drop validator, shared stub, explicit flag

- Remove validate_allow_resize_requires_full_column from CustomColumnConfig
- Rename StubColumnConfigWithoutEmoji to StubColumnConfig in test_columns
- Pass allow_resize=False in _write_processed_batch replace_buffer call

* fix: add missing f prefix to error message in custom.py

* docs(plugins): add section on setting allow_resize=True for resize plugins

* fix: address PR review comments on allow_resize

- Replace getattr with direct attribute access where config is always
  SingleColumnConfig (custom.py, cell-by-cell path in builder)
- Keep getattr in _run_full_column_generator which also handles
  multi-column configs without allow_resize
- Restructure allow_resize validation branching in CustomColumnGenerator
- Fix error message wording: "key" -> "column"

* fix: remove duplicate tool_alias log, fix test docstring

- Remove tool_alias log from _setup_fan_out (callers already log it)
- Fix docstring: CELL_BY_CELL -> FULL_COLUMN in resize test factory

* fix: avoid duplicate undeclared-column warning in _validate_output

Inline the strip instead of delegating to _validate_cell_output,
which would log the same warning a second time.

* fix: use lazy.pd instead of pd for runtime pandas usage in tests

The pd import is under TYPE_CHECKING, so runtime calls need lazy.pd.
2026-02-18 18:39:31 -03:00
Johnny Greco
11143c788f
docs: restructure plugin docs with multi-file layout and seed reader type (#302)
* docs: restructure plugin docs with multi-file layout and seed reader type

- Update plugin overview to document both column generator and seed
  reader plugin types
- Restructure example plugin to use separate config.py, impl.py, and
  plugin.py files instead of a single-file approach
- Add sections for plugin validation and multiple plugins per package
- Document required config class methods (get_column_emoji,
  required_columns, side_effect_columns)

* docs: clarify benefits of multi-file plugin structure

Expand explanation to mention circular dependency prevention
as a key reason for separating config, impl, and plugin modules.

* docs: fix import ordering in plugin example

* import spacing

* better example column name

* add a bit to the comment

* Updated plugin docs

* update plugin overview call-to-action wording

---------

Co-authored-by: Kirit93 <kthadaka@nvidia.com>
2026-02-09 16:03:56 -05:00
Johnny Greco
87119a545b
refactor: move SingleColumnConfig to config.base module (#287)
* create top-level base file

* add note

* update license header

* move exportable config and move base to config module

* update references in docs

* do not include single column config in init

* add inverse import order e2e test
2026-02-03 14:04:04 -05:00
Johnny Greco
ae0665fa16
refactor: slim package refactor into three subpackages (#240)
* remove old structure

* major shuffle

* streamline project configs

* update make commands

* updates to make commands

* remove essentials

* initialize logger in interface

* uv lock

* ignore notepad

* update workflows

* fix e2e project config

* generate colab notebooks

* resolve default model settings in interface

* fix build commands

* update perf import make command

* cleaning up some slop

* update recipes

* move conftest files to tests/

* update subpackage readmes

* streamline config_logging

* use exports

* update perf import usage pattern

* update for IDE behavior with ruff

* remove engine's fixtures file

* add note to about lazy imports

* update dependencies

* update docs

* doc fixes

* uv lock

* updates to catch up with main

* clean up makefile

* remove package gitignores

* define deps only once

* isolate tests

* add test for protetion rule

* create temp dirs for isolated tests

* catch up to main

* update headers

* re apply changes

* better result summaries for isolated tests

* move exports into top-level init

* fix client importlib version syntax

* catch up with main
2026-01-27 13:53:20 -05:00
Johnny Greco
3d9f5185d7
refactor: remove task metadata property (#216)
* remove metadata

* docs and tests

* don't need that test

* use static method for generation strategy

* update docs

* add docstring
2026-01-15 14:12:11 -05:00
Johnny Greco
69cd989285
refactor: update required resources treatment and use subclasses over mixins (#184)
* removing required resources

* fix tests

* add get required resources method to base column generator

* move classification functions to engine; remove required resources

* drop single from subclass names

* update model config logging

* fix unit test

* typo

* update type hint

* move tests
2026-01-09 14:42:09 -05:00
Mike Knepper
8e69ab0336
refactor: Plugins rename task to impl (#189) 2026-01-08 13:34:05 -06:00
Mike Knepper
36a174af04
refactor: plugin system updates (#168) 2026-01-06 10:29:47 -06:00
Johnny Greco
48fdc8c838
docs: add initial plugin documentation (#107)
* add docstrings

* add analysis modules

* include toc for plugins section

* add plugin docs

* remove scope creep

* Update docs/plugins/example.md

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>

* address feedback

---------

Co-authored-by: Nabin Mulepati <nmulepati@nvidia.com>
2025-12-11 16:05:11 -05:00