mirror of
https://github.com/NVIDIA-NeMo/DataDesigner
synced 2026-05-24 09:48:29 +00:00
* feat: add AsyncTaskScheduler and RowGroupBufferManager for async engine
* test: add retryable salvage and eager row-drop scheduler tests
Add two missing test cases for AsyncTaskScheduler:
- Transient 503 failure deferred to salvage round and recovered
- Downstream column never dispatched for rows dropped by upstream failure
Update plan checkboxes to reflect PR 3 completion status.
* fix: address Greptile review findings on scheduler and buffer manager
- Non-retryable batch/from_scratch failure now drops all rows in the
row group so it reaches a terminal state and gets checkpointed
- Salvage rounds re-dispatch from_scratch tasks directly instead of
relying on the frontier (which never contains them)
- Log error if scheduler exits with unfinished row groups
- update_batch raises ValueError on size mismatch
- Free _row_group_sizes entry on checkpoint
- Add row-count mismatch warning in _run_batch writeback
- Document intentional stateful lock ordering in _dispatch_seeds
* refactor: remove dead _seed_frontier method from CompletionTracker
* refactor: restore seed_frontier as public opt-in method on CompletionTracker
Keep the tracker self-contained for static introspection (capacity
planning, task enumeration) without requiring a scheduler. The scheduler
does not call it - it manages root dispatch directly for stateful locks
and multi-column dedup.
* fix: complete salvage path - stateful locks, frontier drain, checkpoint sweep
- Acquire stateful lock before re-dispatching from_scratch tasks in
salvage (prevents RuntimeError on lock release)
- Replace single-pass salvage drain with _drain_frontier() that
re-checks the frontier after each task completes (ensures downstream
tasks enqueued during retry are dispatched)
- Add post-salvage checkpoint sweep via _checkpoint_completed_row_groups
(row groups completed during salvage are now written to parquet)
- Extract _checkpoint_completed_row_groups helper, reused by both the
main loop and post-salvage sweep
* fix: checkpoint per salvage round, simplify deferred list, fire on_complete for empty groups
- Move _checkpoint_completed_row_groups inside the salvage loop so
completed row groups are flushed promptly between rounds
- Simplify _deferred from list[tuple[Task, int]] to list[Task] since
the attempt count was unused (salvage_max_rounds bounds retries)
- Fire on_complete(None) when all rows in a row group are dropped so
callers always receive a completion signal
* fix: address PR review - _is_retryable, semaphore safety, dispatched pruning
Replace string-matching _is_retryable with isinstance checks against
typed ModelXxxError exceptions. Add try/finally around checkpoint to
protect _rg_semaphore.release() from deadlocks. Prune completed tasks
from _dispatched to prevent unbounded growth. Re-register batch_alias
in salvage dispatch. Replace assert with ValueError in _run_cell.
Parameterize on_complete Callable signature. Add test for
on_complete(None) when all rows dropped.
* fix: guard against dispatching tasks for checkpointed row groups
When a from_scratch task fails non-retryably and all rows are dropped,
downstream full_column tasks become vacuously ready and enter the
frontier. If dispatched in the same loop iteration that checkpoints
the row group, the task's coroutine runs against a freed buffer
(KeyError). The guard in _execute_task_inner skips such tasks early,
and a `skipped` flag keeps them in _dispatched to prevent re-dispatch.
Also addresses remaining review feedback:
- agenerate({}) fallback now passes empty DataFrame
- dict comprehensions simplified to dict(row_groups)
- get_row return type annotated as dict[str, Any]
- configs parameter fully typed in test helper
- mock generators use self.config.name
* fix: raise on batch size mismatch, isolate checkpoint failures
- _run_batch: raise ValueError on row count mismatch instead of warning,
matching the update_batch path and preventing silent data corruption.
- _checkpoint_completed_row_groups: catch per-RG exceptions so one failed
checkpoint doesn't block subsequent row groups or leak semaphore slots.
|
||
|---|---|---|
| .. | ||
| async-generators-and-task-queue.md | ||
| code-sketches.md | ||
| diagrams.md | ||