mirror of
https://github.com/coleam00/Archon
synced 2026-04-21 13:37:41 +00:00
* Fix: Add stale workflow cleanup and defense-in-depth error handling Problem: Workflows could get stuck in "running" state indefinitely when the async generator disconnected but the AI subprocess continued working. This blocked new workflow invocations with "Workflow already running" errors. Root cause: No cleanup mechanism existed for workflows that failed to complete due to disconnection between the executor and the Claude SDK. Solution (defense-in-depth): 1. Activity-based staleness detection: Workflows inactive for 15+ minutes are auto-failed when a new workflow is triggered on the same conversation 2. Top-level error handling: All errors in workflow execution are caught and the workflow is properly marked as failed (prevents stuck state) 3. Manual cancel command: /workflow cancel lets users force-fail stuck workflows immediately Changes: - Add last_activity_at column via migration for staleness tracking - Add updateWorkflowActivity() to track activity during execution - Add staleness check before blocking concurrent workflows - Wrap workflow execution in try-catch to ensure failure is recorded - Add /workflow cancel subcommand to command handler - Update test to match new error handling behavior Fixes #232 * docs: Add /workflow cancel command to documentation * Improve error handling and add comprehensive tests for stale workflow cleanup Error handling improvements: - Add workflow ID and error context to updateWorkflowActivity logs - Add stack trace, error name, and cause to top-level catch block - Separate DB failure recording from file logging for clearer error messages - Add try-catch around staleness cleanup with user-facing error message - Check sendCriticalMessage return value and log when user not notified Test coverage additions: - Add staleness detection tests (stale vs non-stale, fallback to started_at) - Add /workflow cancel command tests - Add updateWorkflowActivity function tests (including non-throwing behavior) All 845 tests pass, type-check clean, lint clean.
16 lines
698 B
SQL
16 lines
698 B
SQL
-- Migration: Add last_activity_at column for staleness detection
|
|
-- This enables activity-based staleness detection for stuck workflows
|
|
|
|
-- Add last_activity_at column
|
|
ALTER TABLE remote_agent_workflow_runs
|
|
ADD COLUMN IF NOT EXISTS last_activity_at TIMESTAMP WITH TIME ZONE DEFAULT NOW();
|
|
|
|
-- Backfill existing rows: use completed_at if available, otherwise started_at
|
|
UPDATE remote_agent_workflow_runs
|
|
SET last_activity_at = COALESCE(completed_at, started_at)
|
|
WHERE last_activity_at IS NULL;
|
|
|
|
-- Partial index for efficient staleness queries on running workflows
|
|
CREATE INDEX IF NOT EXISTS idx_workflow_runs_last_activity
|
|
ON remote_agent_workflow_runs(last_activity_at)
|
|
WHERE status = 'running';
|