Commit graph

10 commits

Author SHA1 Message Date
Eric W. Tramel
c0a4dcbb85
feat: implement async scheduling admission control (#661)
Some checks are pending
CI / End to end test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Config (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / End to end test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Engine (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test Interface (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
CI / Coverage Check (Python 3.11) (push) Blocked by required conditions
CI / Test (Python 3.10 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.10 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on macos-latest) (push) Blocked by required conditions
CI / Test (Python 3.11 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.12 on ubuntu-latest) (push) Blocked by required conditions
CI / Test (Python 3.13 on ubuntu-latest) (push) Blocked by required conditions
2026-05-20 20:58:05 -04:00
Andre Manoel
7e812630cf
feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder (#429)
* feat: wire async task-queue scheduler into ColumnWiseDatasetBuilder

* chore: add async benchmark notebook and demo scripts

* fix: address all PR review comments on async builder integration

- Wire on_batch_complete through on_row_group_complete callback
- Mark trailing slots as dropped in replace_dataframe when processor filters rows
- Ensure checkpoint still runs when on_before_checkpoint raises
- Gate non-seed task dispatch on pre-batch completion
- Add public run_pre_batch_on_df to ProcessorRunner (replaces private _run_stage call)
- Add public is_column_complete_for_rg to CompletionTracker (replaces private _completed access)
- Type task_traces as list[TaskTrace] in results.py
- Add async_trace docstring to RunConfig
- Move module-level log into _build_async
- Add replace_dataframe unit tests (same-size, dropped rows, fewer rows)
- Assert on public outcomes in scheduler tests instead of private attributes
- Parametrize allow_resize validation tests
- Cache seed_cols before main loop
- Remove redundant disable_early_shutdown from AsyncTaskScheduler

* style: fix ruff format for lambda expression

* fix: address open review issues on async scheduler

- Flush completed row groups before breaking on early shutdown (data loss)
- Change error rate check from >= to > so disable_early_shutdown sentinel
  (1.0) never triggers at 100% failure rate
- Extract seeds-complete check into helper and call it in salvage rounds
  via _drain_frontier, with pre-batch gating, so pre-batch processor runs
  even when seed tasks succeed only after retry
- Fix is_column_complete_for_rg to check _batch_complete first, then verify
  all non-dropped rows for CELL_BY_CELL columns
- Replace O(|in-flight|) scan in _in_flight_for_rg with per-RG counter

* fix: sync pre-batch row drops to CompletionTracker and restore stderr safely

Pre-batch processors that filter rows marked them as dropped in
RowGroupBufferManager but not in CompletionTracker, causing unnecessary
LLM calls for rows that would be discarded at checkpoint time.

Also wrap the benchmark warmup stderr redirect in try/finally so stderr
is restored if _run_once raises.

* fix: prune _admitted_rg_ids on row group checkpoint

Prevents unbounded growth of the admission set across large runs.

* chore: remove demo/async files from PR

Dev-time benchmarks and manual test scripts - kept locally, not needed in the PR.

* fix: wire disable_early_shutdown into AsyncTaskScheduler

RunConfig.disable_early_shutdown was forwarded to the sync executor
but silently ignored in the async path. Now passed through to the
scheduler's _check_error_rate.

* test: add e2e test for async engine concurrency

Verifies the async scheduler dispatches independent LLM columns
concurrently by checking for overlapping task trace intervals.
Uses a wide DAG (sampler -> 2 parallel LLM columns) with 2 records.
Requires NVIDIA_API_KEY.

* fix: drop row group on on_before_checkpoint failure instead of writing unprocessed data

Matches on_seeds_complete failure behavior and avoids silently
checkpointing unfiltered rows when a post-batch processor fails.

* fix: skip on_before_checkpoint when no POST_BATCH processors configured

Avoids unnecessary DataFrame round-trip for every row group in the
common case where no post-batch processors exist.

* fix: address remaining review nits from nabinchha and greptile summary

- Gate on_seeds_complete on PRE_BATCH processors (matches on_before_checkpoint pattern)
- Cache seed_cols as instance attr instead of recomputing in _dispatch_seeds
- Iterate list(self._active_rgs) snapshot in _run_seeds_complete_check
- Add logger.debug to telemetry except block
- Add design comment on on_before_checkpoint failure drop behavior
- Rename row_group param to row_group_index in is_column_complete_for_rg
- Document rg_id as current_batch_number equivalence
- Use mock.patch.object in e2e test instead of direct mutation
- Add max(0, ...) floor guard on _in_flight_counts decrement
- Rename _ensure_async_engine_loop to public ensure_async_engine_loop
- Move AsyncTaskScheduler import to module level in integration tests

* fix: preserve async callback contract and e2e setup

* fix: prune _seeds_dispatched_rgs and _pre_batch_done_rgs on checkpoint

* refactor: consolidate per-RG state into _RowGroupState dataclass

Replace 5 independent collections (_active_rgs, _admitted_rg_ids,
_seeds_dispatched_rgs, _pre_batch_done_rgs, _in_flight_counts) with a
single _rg_states dict keyed by row group ID. Cleanup is now a single
`del` instead of N separate discards, eliminating the class of bugs
where one collection is missed during row group teardown.

* fix: skip checkpoint and callbacks when on_before_checkpoint fails

When on_before_checkpoint raises and all rows are dropped, the code
previously fell through to checkpoint_row_group and on_row_group_complete,
sending a spurious progress notification for a batch with zero records.
Now gates both on a `dropped` flag so they are skipped after failure.

* fix: snapshot dropped rows before await in _run_batch and sync tracker on checkpoint failure

Two fixes:
- _run_batch: snapshot dropped rows before `await agenerate` so the
  row-count expectation matches batch_df. Concurrent tasks can drop rows
  during the await, causing a spurious ValueError that would drop the
  entire row group. Write-back now re-checks is_dropped to skip rows
  dropped mid-flight.
- _checkpoint_completed_row_groups: add tracker.drop_row alongside
  buffer_manager.drop_row when on_before_checkpoint fails, keeping
  both in sync.

* feat: sliding window error rate and out-of-order row group completion test

Replace cumulative error counters with a deque-based sliding window so
that early transient failures do not permanently inflate the error rate
in long-running jobs. Add tests for the sliding window recovery path
and for deterministic out-of-order row group checkpoint ordering.

* fix: use real time delays in out-of-order completion test

asyncio.sleep(0) interleaving is not deterministic across Python
versions. Switch to asyncio.sleep(num_records * 0.02) so the smaller
row group genuinely finishes seeds first regardless of event loop
scheduling.

* fix: prevent ZeroDivisionError when shutdown_error_window is 0

Change RunConfig.shutdown_error_window constraint from ge=0 to ge=1
so the sliding window denominator is never zero.

* fix: address Greptile review nits in async_scheduler

- Move del _rg_states inside try/finally so semaphore is always released
- Add exc_info=True to pre-batch failure log for consistent tracebacks
- Short-circuit _check_error_rate when _early_shutdown already set

* fix: address Greptile summary findings

- Remove duplicate async engine log in build() (kept in _build_async)
- Guard _run_seeds_complete_check with has_pre_batch at both call sites
- Change error rate comparison from > to >= to match sync path semantics
2026-03-20 13:05:09 -03:00
Eric W. Tramel
28c8345909
feat: add built-in filesystem seed readers (#421) 2026-03-16 17:40:27 -04:00
Andre Manoel
982ce79ca9
feat: add processor plugin support (#299)
* feat: add processor plugin support

Add PluginType.PROCESSOR to the plugin system, enabling third-party
processor plugins via entry points. Includes a demo plugin package
with RegexFilterProcessor (process_before_batch) and
SemanticDedupProcessor (process_after_generation).

- Add PluginType.PROCESSOR with processor_type discriminator
- Create processor_types.py for ProcessorConfigT with plugin injection
- Register plugin processors in engine ProcessorRegistry
- Use RLock in PluginRegistry to prevent deadlocks during discovery
- Add demo package: data-designer-demo-processors
- Update processor and plugin documentation

* test: add processor plugin registration test

Verify that processor plugins from PluginRegistry are picked up
by create_default_processor_registry and registered correctly.

* test: simplify processor plugin registration test

* move ProcessorConfig to base and convert demo to e2e test

- Move ProcessorConfig from processors.py to config.base to guard
  against circular deps (alongside SingleColumnConfig)
- Delete demo/ directory with regex_filter and semantic_dedup plugins
- Add regex_filter as an e2e processor plugin test in tests_e2e/

* move plan to plans/299/
2026-02-25 16:40:01 -03:00
Johnny Greco
87119a545b
refactor: move SingleColumnConfig to config.base module (#287)
* create top-level base file

* add note

* update license header

* move exportable config and move base to config module

* update references in docs

* do not include single column config in init

* add inverse import order e2e test
2026-02-03 14:04:04 -05:00
Eric W. Tramel
5430bcbe99
Remove debug_trace_override (#290) 2026-02-03 12:09:30 -05:00
Eric W. Tramel
510761107b
feat: Add TraceType enum for granular trace control (#284) 2026-02-02 19:43:51 -05:00
Eric W. Tramel
e6e58e692e
feat: MCP (Model Context Protocol) tool calling integration for LLM columns (#248) 2026-02-02 09:41:58 -05:00
Johnny Greco
ae0665fa16
refactor: slim package refactor into three subpackages (#240)
* remove old structure

* major shuffle

* streamline project configs

* update make commands

* updates to make commands

* remove essentials

* initialize logger in interface

* uv lock

* ignore notepad

* update workflows

* fix e2e project config

* generate colab notebooks

* resolve default model settings in interface

* fix build commands

* update perf import make command

* cleaning up some slop

* update recipes

* move conftest files to tests/

* update subpackage readmes

* streamline config_logging

* use exports

* update perf import usage pattern

* update for IDE behavior with ruff

* remove engine's fixtures file

* add note to about lazy imports

* update dependencies

* update docs

* doc fixes

* uv lock

* updates to catch up with main

* clean up makefile

* remove package gitignores

* define deps only once

* isolate tests

* add test for protetion rule

* create temp dirs for isolated tests

* catch up to main

* update headers

* re apply changes

* better result summaries for isolated tests

* move exports into top-level init

* fix client importlib version syntax

* catch up with main
2026-01-27 13:53:20 -05:00
Johnny Greco
367de1a063
rename (#214) 2026-01-14 15:26:46 -05:00