Archon/docs/authoring-workflows.md
Rasmus Widing ef04321eef
refactor(workflows)!: remove sequential execution mode, DAG becomes sole format (#805)
* refactor(workflows)!: remove sequential execution mode, DAG becomes sole format

Remove the steps-based (sequential) workflow execution mode entirely.
All workflows now use the nodes-based (DAG) format exclusively.

- Convert 8 sequential default workflows to DAG format
- Delete archon-fix-github-issue sequential (DAG version absorbs triggers)
- Remove SingleStep, ParallelBlock, StepWorkflow types and guards
- Gut executor.ts from ~2200 to ~730 lines (remove sequential loop)
- Remove step_started/completed/failed and parallel_agent_* events
- Remove logStepStart/Complete and logParallelBlockStart/Complete
- Delete SequentialEditor, StepProgress, ParallelBlockView components
- Remove sequential mode from workflow builder and execution views
- Delete executor.test.ts (4395 lines), update ~45 test fixtures
- Update CLAUDE.md and docs to reflect DAG-only format

BREAKING CHANGE: Workflows using `steps:` format are no longer supported.
Convert to `nodes:` (DAG) format. The loader provides a clear error message
directing users to the migration guide.

* fix: address review findings — guard errors, remove dead code, add tests

- Guard logNodeSkip/logWorkflowError against filesystem errors in dag-executor
- Move mkdir(artifactsDir) inside try-catch with user-friendly error
- Remove startFromStep dead parameter from executeWorkflow signature
- Remove isDagWorkflow() tautology and all callers (20+ sites)
- Remove dead BuilderMode/mode state from frontend components
- Remove vestigial isLoop, selectedStep, stepIndex, step_index fields
- Remove "DAG" prefix from user-facing resume/error messages
- Fix 5 stale docs (README, getting-started, authoring-commands, web adapter)
- Update event-emitter tests to use node events instead of removed step events
- Add executor-shared.test.ts (12 tests) for substituteWorkflowVariables
- Add executor.test.ts (11 tests) for concurrent-run, model resolution, resume

* fix(workflows): add migration guide, port preamble tests, improve error message

- Add docs/sequential-dag-migration-guide.md with 3 conversion patterns
  (single step, chain with clearContext, parallel block) and a Claude Code
  migration command for automated conversion
- Update loader error message to point to migration guide and include
  ready-to-run claude command
- Port 8 preamble tests from deleted executor.test.ts to new
  executor-preamble.test.ts: staleness detection (3), concurrent-run
  guard (3), DAG resume (2)

Addresses review feedback from #805.

* fix(workflows): update loader test to match new error message wording

* fix: address review findings — fail stuck runs, remove dead code, fix docs

- Mark workflow run as failed when artifacts mkdir fails (prevents
  15-min concurrent-run guard block)
- Remove vestigial totalSteps from WorkflowStartedEvent and executor
- Delete dead WorkflowToolbar.tsx (369 lines, no importers)
- Remove stepIndex prop from StepLogs (always 0, label now "Node logs")
- Restore cn() in StatusBar for consistent conditional classes
- Promote resume-check log to error, add errorType to failure logs
- Remove ghost $PLAN/$IMPLEMENTATION_SUMMARY from docs (never implemented)
- Update workflows.md rules to DAG-only format
- Fix migration guide trigger_rule example
- Clean up blank-line residues and stale comments

* fix: resolve rebase conflicts with #729 (forkSession) and #730 (dashboard)

- Remove sequential forkSession/persistSession code from #729 (dead after
  sequential removal)
- Fix loader type narrowing for DagNode context field
- Update dashboard components from #730 to use dagNodes instead of steps
- Remove WorkflowStepEvent/ParallelAgentEvent from dashboard SSE hook
2026-03-26 11:27:34 +02:00

30 KiB
Raw Blame History

Authoring Workflows for Archon

This guide explains how to create workflows that orchestrate multiple commands into automated pipelines. Read Authoring Commands first - workflows are built from commands.

What is a Workflow?

A workflow is a YAML file that defines a directed acyclic graph (DAG) of nodes to execute. Workflows enable:

  • Multi-node automation: Chain multiple AI agents together with dependency edges
  • Parallel execution: Independent nodes in the same topological layer run concurrently
  • Conditional branching: when: conditions and trigger_rule control which nodes run
  • Autonomous loops: Loop nodes iterate until a condition is met
name: fix-github-issue
description: Investigate and fix a GitHub issue end-to-end
nodes:
  - id: investigate
    command: investigate-issue
  - id: implement
    command: implement-issue
    depends_on: [investigate]
    context: fresh

File Location

Workflows live in .archon/workflows/ relative to the working directory:

.archon/
├── workflows/
│   ├── my-workflow.yaml
│   └── review/
│       └── full-review.yaml    # Subdirectories work
└── commands/
    └── [commands used by workflows]

Archon discovers workflows recursively - subdirectories are fine. If a workflow file fails to load (syntax error, validation failure), it's skipped and the error is reported via /workflow list.

CLI vs Server: The CLI reads workflow files from wherever you run it (sees uncommitted changes). The server reads from the workspace clone at ~/.archon/workspaces/owner/repo/, which only syncs from the remote before worktree creation. If you edit a workflow locally but don't push, the server won't see it.


Workflow Schema

Workflows use a nodes: format where nodes declare explicit dependency edges. Independent nodes in the same topological layer run concurrently via Promise.allSettled. Skipped nodes (failed when: condition or trigger_rule) propagate their skipped state to dependants.

Example: Conditional Branching

name: classify-and-fix
description: Classify issue type, then run the appropriate fix path

nodes:
  - id: classify
    command: classify-issue
    output_format:
      type: object
      properties:
        type:
          type: string
          enum: [BUG, FEATURE]
      required: [type]

  - id: investigate
    command: investigate-bug
    depends_on: [classify]
    when: "$classify.output.type == 'BUG'"

  - id: plan
    command: plan-feature
    depends_on: [classify]
    when: "$classify.output.type == 'FEATURE'"

  - id: implement
    command: implement-changes
    depends_on: [investigate, plan]
    trigger_rule: none_failed_min_one_success

Full Workflow Schema

# Required
name: workflow-name
description: |
  What this workflow does.  

# Optional
provider: claude
model: sonnet
modelReasoningEffort: medium     # Codex only
webSearchMode: live              # Codex only

# Required
nodes:
  - id: classify                 # Unique node ID (used for dependency refs and $id.output)
    command: classify-issue      # Loads from .archon/commands/classify-issue.md
    output_format:               # Optional: enforce structured JSON output (Claude + Codex)
      type: object
      properties:
        type:
          type: string
          enum: [BUG, FEATURE]
      required: [type]

  - id: investigate
    command: investigate-bug
    depends_on: [classify]       # Wait for classify to complete
    when: "$classify.output.type == 'BUG'"  # Skip if condition is false

  - id: plan
    command: plan-feature
    depends_on: [classify]
    when: "$classify.output.type == 'FEATURE'"

  - id: implement
    command: implement-changes
    depends_on: [investigate, plan]
    trigger_rule: none_failed_min_one_success  # Run if at least one dep succeeded

  - id: inline-node
    prompt: "Summarize the changes made in $implement.output"  # Inline prompt (no command file)
    depends_on: [implement]
    context: fresh               # Force fresh session for this node
    provider: claude             # Per-node provider override
    model: haiku                 # Per-node model override
    # hooks:                     # Optional: per-node SDK hook callbacks (Claude only) — see docs/hooks.md
    # mcp: .archon/mcp/servers.json  # Optional: per-node MCP servers (Claude only)
    # skills: [remotion-best-practices]  # Optional: per-node skills (Claude only) — see docs/skills.md

Node Fields

Field Type Default Description
id string required Unique node identifier. Used in depends_on, when:, and $id.output substitution
command string Command name to load from .archon/commands/. Mutually exclusive with prompt
prompt string Inline prompt string. Mutually exclusive with command
depends_on string[] [] Node IDs that must complete before this node runs
when string Condition expression. Node is skipped if false
trigger_rule string all_success Join semantics when multiple upstreams exist
output_format object JSON Schema for structured output. Supported for Claude and Codex nodes
context 'fresh' Force a fresh AI session for this node
provider 'claude' | 'codex' inherited Per-node provider override
model string inherited Per-node model override
allowed_tools string[] Whitelist of built-in tools for this node. [] disables all built-in tools (MCP-only mode). Claude only — Codex nodes emit a warning and ignore this field
denied_tools string[] Blacklist of built-in tools to remove from this node. Applied after allowed_tools if both are set. Claude only — Codex nodes emit a warning and ignore this field
retry object Per-node retry configuration. See Retry Configuration. Omit to use the automatic default (2 retries, 3 s base delay, transient errors only)
hooks object Per-node SDK hook callbacks. Claude only — Codex nodes emit a warning and ignore this field. See docs/hooks.md
mcp string Path to MCP server config JSON file (relative to cwd or absolute). Environment variables ($VAR_NAME) in env/headers values are expanded from process.env at execution time. Claude only — Codex nodes emit a warning and ignore this field. See docs/mcp-servers.md
skills string[] Skill names to preload into this node's agent context. Skills must be installed in .claude/skills/. The node is wrapped in an AgentDefinition with these skills + Skill auto-added to allowedTools. Claude only — Codex nodes emit a warning and ignore this field. See docs/skills.md

trigger_rule Values

Value Behavior
all_success Run only if all upstream deps completed successfully (default)
one_success Run if at least one upstream dep completed successfully
none_failed_min_one_success Run if no deps failed AND at least one succeeded (skipped deps are ok)
all_done Run when all deps are in a terminal state (completed, failed, or skipped)

when: Condition Syntax

Conditions use string equality against upstream node outputs:

when: "$nodeId.output == 'VALUE'"
when: "$nodeId.output != 'VALUE'"
when: "$nodeId.output.field == 'VALUE'"    # JSON dot notation for output_format nodes
  • Uses $nodeId.output to reference the full output string of a completed node
  • Use $nodeId.output.field to access a JSON field (for output_format nodes)
  • Invalid expressions default to true (fail open — node runs rather than silently skipping)
  • Skipped nodes propagate their skipped state to dependants

$node_id.output Substitution

In node prompts and commands, reference the output of any upstream node:

nodes:
  - id: classify
    command: classify-issue

  - id: fix
    command: implement-fix
    depends_on: [classify]
    # The command file can use $classify.output or $classify.output.field

Variable substitution order:

  1. Standard variables ($WORKFLOW_ID, $USER_MESSAGE, $ARTIFACTS_DIR, etc.)
  2. Node output references ($nodeId.output, $nodeId.output.field)

output_format for Structured JSON

Use output_format to enforce JSON output from an AI node. For Claude, the schema is passed via the SDK's outputFormat option and structured_output is used directly. For Codex (v0.116.0+), the schema is passed via TurnOptions.outputSchema and the agent's inline JSON response is used. Both ensure clean JSON for when: conditions and $nodeId.output substitution:

nodes:
  - id: classify
    command: classify-issue
    output_format:
      type: object
      properties:
        type:
          type: string
          enum: [BUG, FEATURE]
        severity:
          type: string
          enum: [low, medium, high]
      required: [type]
  • The output is captured as a JSON string and available via $classify.output (full JSON) or $classify.output.type (field access)
  • Use output_format when downstream nodes need to branch on specific values via when:

allowed_tools and denied_tools for Tool Restrictions

Restrict which built-in tools a node can use without relying on prompt instructions. Restrictions are enforced at the Claude SDK level.

nodes:
  - id: review
    command: code-review
    allowed_tools: [Read, Grep, Glob]   # whitelist — only these tools available

  - id: implement
    command: implement-feature
    denied_tools: [WebSearch, WebFetch] # blacklist — remove these tools

  - id: mcp-only
    command: mcp-command
    allowed_tools: []                   # empty list = disable all built-in tools
  • allowed_tools: [] disables all built-in tools (useful for MCP-only nodes). Use the mcp field on a node to attach per-node MCP servers — see Node Fields
  • If both are set, denied_tools is applied after allowed_tools
  • undefined (field absent) and [] have different semantics — absent means use default tool set, [] means no tools
  • Claude only — Codex nodes emit a warning and continue (Codex doesn't support per-call tool restrictions)

Retry Configuration

Every node automatically retries on transient errors (SDK subprocess crashes, rate limits, network timeouts) using a default configuration: 2 retries, 3 s base delay with exponential backoff. You will see a platform notification before each retry attempt.

To opt out or customise, add a retry: block:

nodes:
  - id: flaky-node
    command: flaky-command
    retry:
      max_attempts: 3      # Total attempts including the first (15)
      delay_ms: 5000       # Base delay before first retry in ms (100060000, default: 3000)
      on_error: transient  # 'transient' (default) | 'all'

  - id: no-retry-node
    command: stable-command
    retry:
      max_attempts: 1      # Effectively disables retry

  - id: aggressive-retry
    prompt: "Summarise the output"
    retry:
      max_attempts: 4
      on_error: all        # Retry even non-transient errors (use with caution)

Retry Fields

Field Type Default Constraints Description
max_attempts number 3 15 Total attempts including the first. 1 disables retry
delay_ms number 3000 100060000 Base delay in ms before the first retry. Doubles each attempt (exponential backoff)
on_error 'transient' | 'all' 'transient' Which errors trigger a retry. 'transient' = SDK crashes, rate limits, network timeouts only. 'all' = any error including unknown errors (FATAL errors such as auth failures are never retried regardless)

Error Classification

Archon classifies errors into three buckets before deciding whether to retry:

Class Examples Retried by default?
FATAL Auth failure, permission denied, credit balance exhausted Never (even with on_error: all)
TRANSIENT Process crashed (exited with code), rate limit, network timeout Yes
UNKNOWN Unrecognised error messages No (unless on_error: all)

Retry Notifications

Before each retry the platform receives a message like:

⚠️ Node `node-name` failed with transient error (attempt 1/3). Retrying in 3s...

Two-Layer Retry Stack

Archon uses two independent retry layers:

SDK subprocess retry (claude.ts)  — 3 total attempts, 2 s base backoff
    ↓ only if all SDK retries exhausted
Node retry (dag-executor)  — default 2 retries, 3 s base backoff
    ↓ only if all node retries exhausted
Workflow fails → next invocation auto-resumes

This means a single transient crash may trigger up to 3 SDK retries before a single node retry attempt is consumed.

Resume: Resume is automatic — the next invocation detects the prior failed run and skips already-completed nodes. No --resume flag is needed. See Resume on Failure below.


Resume on Failure

When a workflow fails, the next invocation automatically resumes from where it left off — no --resume flag required.

How it works:

  1. On each invocation, Archon checks for a prior failed run of the same workflow in the same conversation.
  2. If found, it loads the node_completed events from that run to determine which nodes finished successfully.
  3. Completed nodes are skipped; only failed and not-yet-run nodes are executed.
  4. You receive a platform message like: ▶️ Resuming workflow — skipping 3 already-completed node(s).

Known limitation: AI session context from prior nodes is not restored. If a downstream node relies on in-context knowledge from a prior run's session (rather than artifacts), it may need to re-read those artifacts explicitly.

Fresh start: If zero nodes completed in the prior run, Archon starts fresh (no nodes to skip).


Parallel Execution

Nodes without dependencies (or whose dependencies have all completed) run concurrently in the same topological layer:

nodes:
  - id: setup
    command: setup-scope            # Creates shared context

  - id: review-code
    command: review-code
    depends_on: [setup]             # These three run in parallel

  - id: review-comments
    command: review-comments
    depends_on: [setup]

  - id: review-security
    command: review-security
    depends_on: [setup]

  - id: synthesize
    command: synthesize-reviews     # Waits for all three reviews
    depends_on: [review-code, review-comments, review-security]
    context: fresh

Parallel Execution Rules

  1. Each node gets its own session - no context sharing (use context: fresh for explicit control)
  2. All nodes in a layer must complete before the next layer runs
  3. All failures are reported - not just the first one
  4. Shared state via artifacts - nodes read/write to known paths

Pattern: Coordinator + Parallel Agents

name: comprehensive-review
nodes:
  - id: scope
    command: create-review-scope

  - id: code-review
    command: code-review-agent
    depends_on: [scope]

  - id: comment-quality
    command: comment-quality-agent
    depends_on: [scope]

  - id: test-coverage
    command: test-coverage-agent
    depends_on: [scope]

  - id: synthesize
    command: synthesize-review
    depends_on: [code-review, comment-quality, test-coverage]
    context: fresh

The coordinator writes to .archon/artifacts/reviews/pr-{n}/scope.md. Each agent reads scope, writes to {category}-findings.md. The synthesizer reads all findings and produces final output.


The Artifact Chain

Workflows work because artifacts pass data between nodes:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ investigate     │     │ implement       │     │ create-pr       │
│                 │     │                 │     │                 │
│ Reads: input    │     │ Reads: artifact │     │ Reads: git diff │
│ Writes: artifact│────▶│ Writes: code    │────▶│ Writes: PR      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
         │                       │
         ▼                       ▼
  .archon/artifacts/      src/feature.ts
  issues/issue-123.md     src/feature.test.ts

Designing Artifact Flow

When creating a workflow, plan the artifact chain:

Node Reads Writes
investigate-issue GitHub issue via gh .archon/artifacts/issues/issue-{n}.md
implement-issue Artifact from investigate Code files, tests
create-pr Git diff GitHub PR

Each command must know:

  • Where to find its input
  • Where to write its output
  • What format to use

Model Configuration

Workflows can configure AI models and provider-specific options at the workflow level.

Configuration Priority

Model and options are resolved in this order:

  1. Workflow-level - Explicit settings in the workflow YAML
  2. Config defaults - assistants.* in .archon/config.yaml
  3. SDK defaults - Built-in defaults from Claude/Codex SDKs

Provider and Model

name: my-workflow
provider: claude     # 'claude' or 'codex' (default: from config)
model: sonnet        # Model override (default: from config assistants.claude.model)

Claude models:

  • sonnet - Fast, balanced (recommended)
  • opus - Powerful, expensive
  • haiku - Fast, lightweight
  • claude-* - Full model IDs (e.g., claude-3-5-sonnet-20241022)
  • inherit - Use model from previous session

Codex models:

  • Any OpenAI model ID (e.g., gpt-5.3-codex, o5-pro)
  • Cannot use Claude model aliases

Codex-Specific Options

name: my-workflow
provider: codex
model: gpt-5.3-codex
modelReasoningEffort: medium    # 'minimal' | 'low' | 'medium' | 'high' | 'xhigh'
webSearchMode: live             # 'disabled' | 'cached' | 'live'
additionalDirectories:
  - /absolute/path/to/other/repo
  - /path/to/shared/library

Model reasoning effort:

  • minimal, low - Fast, cheaper
  • medium - Balanced (default)
  • high, xhigh - More thorough, expensive

Web search mode:

  • disabled - No web access (default)
  • cached - Use cached search results
  • live - Real-time web search

Additional directories:

  • Codex can access files outside the codebase
  • Useful for shared libraries, documentation repos
  • Must be absolute paths

Model Validation

Workflows are validated at load time:

  • Provider/model compatibility checked
  • Invalid combinations fail with clear error messages
  • Validation errors shown in /workflow list

Example validation error:

Model "sonnet" is not compatible with provider "codex"

Example: Config Defaults + Workflow Override

.archon/config.yaml:

assistants:
  claude:
    model: haiku  # Fast model for most tasks
  codex:
    model: gpt-5.3-codex
    modelReasoningEffort: low
    webSearchMode: disabled

Workflow with override:

name: complex-analysis
description: Deep code analysis requiring powerful model
provider: claude
model: opus  # Override config default (haiku) for this workflow
nodes:
  - id: analyze
    command: analyze-architecture
  - id: report
    command: generate-report
    depends_on: [analyze]

The workflow uses opus instead of the config default haiku, but other settings inherit from config.


Workflow Description Best Practices

Write descriptions that help with routing and user understanding:

description: |
  Investigate and fix a GitHub issue end-to-end.

  **Use when**: User provides a GitHub issue number or URL
  **NOT for**: Feature requests, refactoring, documentation

  **Produces**:
  - Investigation artifact
  - Code changes
  - Pull request linked to issue

  **Steps**:
  1. Investigate root cause
  2. Implement fix with tests
  3. Create PR  

Good descriptions include:

  • What the workflow does
  • When to use it (and when NOT to)
  • What it produces
  • High-level steps

Variable Substitution

All workflows support these variables in prompts and commands:

Variable Description
$WORKFLOW_ID Unique ID for this workflow run
$USER_MESSAGE Original message that triggered workflow
$ARGUMENTS Same as $USER_MESSAGE
$ARTIFACTS_DIR Pre-created artifacts directory for this workflow run
$BASE_BRANCH Base branch; auto-detected from git when worktree.baseBranch is not set. Fails only if referenced and detection fails
$CONTEXT GitHub issue/PR context (if available)
$EXTERNAL_CONTEXT Same as $CONTEXT
$ISSUE_CONTEXT Same as $CONTEXT
$nodeId.output Output of a completed upstream DAG node (DAG workflows only)
$nodeId.output.field JSON field from a structured upstream node output (DAG workflows only)

Example:

prompt: |
  Workflow: $WORKFLOW_ID
  Original request: $USER_MESSAGE

  GitHub context:
  $CONTEXT

  [Instructions...]  

Example Workflows

Simple Two-Node

name: quick-fix
description: |
  Fast bug fix without full investigation.
  Use when: Simple, obvious bugs.
  NOT for: Complex issues needing root cause analysis.  

nodes:
  - id: fix
    command: analyze-and-fix
  - id: pr
    command: create-pr
    depends_on: [fix]
    context: fresh

Investigation Pipeline

name: fix-github-issue
description: |
  Full investigation and fix for GitHub issues.

  Use when: User provides issue number/URL
  Produces: Investigation artifact, code fix, PR  

nodes:
  - id: investigate
    command: investigate-issue    # Creates .archon/artifacts/issues/issue-{n}.md
  - id: implement
    command: implement-issue      # Reads artifact, implements fix
    depends_on: [investigate]
    context: fresh

Parallel Review

name: comprehensive-pr-review
description: |
  Multi-agent PR review covering code, comments, tests, and security.

  Use when: Reviewing PRs before merge
  Produces: Review findings, synthesized summary  

nodes:
  - id: scope
    command: create-review-scope

  - id: code-review
    command: code-review-agent
    depends_on: [scope]
  - id: comment-review
    command: comment-quality-agent
    depends_on: [scope]
  - id: test-review
    command: test-coverage-agent
    depends_on: [scope]
  - id: security-review
    command: security-review-agent
    depends_on: [scope]

  - id: synthesize
    command: synthesize-reviews
    depends_on: [code-review, comment-review, test-review, security-review]
    context: fresh

Loop Node

Loop nodes iterate until a completion signal is detected. Use them within a DAG for autonomous iteration:

name: implement-prd
description: |
  Autonomously implement a PRD, iterating until all stories pass.

  Use when: Full PRD implementation
  Requires: PRD file at .archon/prd.md  

nodes:
  - id: implement
    loop:
      until: COMPLETE
      max_iterations: 15
      fresh_context: true       # Progress tracked in files
    prompt: |
      # PRD Implementation Loop

      Workflow: $WORKFLOW_ID

      ## Instructions

      1. Read PRD from `.archon/prd.md`
      2. Read progress from `.archon/progress.json`
      3. Find the next incomplete story
      4. Implement it with tests
      5. Run validation: `bun run validate`
      6. Update progress file
      7. If ALL stories complete and validated:
         Output: <promise>COMPLETE</promise>

      ## Important

      - Implement ONE story per iteration
      - Always run validation after changes
      - Update progress file before ending iteration      

Classify and Route

name: classify-and-fix
description: |
  Classify issue type and run the appropriate path in parallel.

  Use when: User reports a bug or requests a feature
  Produces: Code fix (bug path) or feature plan (feature path), then PR  

nodes:
  - id: classify
    command: classify-issue
    output_format:
      type: object
      properties:
        type:
          type: string
          enum: [BUG, FEATURE]
      required: [type]

  - id: investigate
    command: investigate-bug
    depends_on: [classify]
    when: "$classify.output.type == 'BUG'"

  - id: plan
    command: plan-feature
    depends_on: [classify]
    when: "$classify.output.type == 'FEATURE'"

  - id: implement
    command: implement-changes
    depends_on: [investigate, plan]
    trigger_rule: none_failed_min_one_success

  - id: create-pr
    command: create-pr
    depends_on: [implement]
    context: fresh

Test-Fix Loop

name: fix-until-green
description: |
  Keep fixing until all tests pass.
  Use when: Tests are failing and need automated fixing.  

nodes:
  - id: fix-loop
    loop:
      until: ALL_TESTS_PASS
      max_iterations: 5
      fresh_context: false      # Remember what we've tried
    prompt: |
      # Fix Until Green

      ## Instructions

      1. Run tests: `bun test`
      2. If all pass: <promise>ALL_TESTS_PASS</promise>
      3. If failures:
         - Analyze the failure
         - Fix the code (not the test, unless test is wrong)
         - Run tests again

      ## Rules

      - Don't skip or delete failing tests
      - Don't modify test expectations unless they're wrong
      - Each iteration should fix at least one failure      

Common Patterns

Pattern: Gated Execution

Run different paths based on conditions using when::

name: smart-fix
description: Route to appropriate fix strategy based on issue complexity

nodes:
  - id: analyze
    command: analyze-complexity
    output_format:
      type: object
      properties:
        complexity:
          type: string
          enum: [simple, complex]
      required: [complexity]

  - id: quick-fix
    command: quick-fix-strategy
    depends_on: [analyze]
    when: "$analyze.output.complexity == 'simple'"

  - id: deep-fix
    command: deep-fix-strategy
    depends_on: [analyze]
    when: "$analyze.output.complexity == 'complex'"

Pattern: Checkpoint and Resume

For long workflows, save checkpoints. Resume is automatic on re-invocation — completed nodes are skipped:

name: large-migration
description: Multi-file migration with checkpoint recovery

nodes:
  - id: plan
    command: create-migration-plan

  - id: batch-1
    command: migrate-batch-1
    depends_on: [plan]
    context: fresh

  - id: batch-2
    command: migrate-batch-2
    depends_on: [batch-1]
    context: fresh

  - id: validate
    command: validate-migration
    depends_on: [batch-2]
    context: fresh

Each batch command saves progress to an artifact. If the workflow fails mid-way, re-invoking it skips already-completed nodes.

Pattern: Human-in-the-Loop

Pause for human approval:

name: careful-refactor
description: Refactor with human approval at each stage

nodes:
  - id: propose
    command: propose-refactor         # Creates proposal artifact
  # Workflow pauses here - human reviews proposal
  # Human triggers next workflow to continue:

Then a separate workflow to continue:

name: execute-refactor
nodes:
  - id: execute
    command: execute-approved-refactor
  - id: pr
    command: create-pr
    depends_on: [execute]
    context: fresh

Debugging Workflows

Check Workflow Discovery

bun run cli workflow list

Run with Verbose Output

bun run cli workflow run {name} "test input"

Watch the streaming output to see each node.

Check Artifacts

After a workflow runs, check the artifacts:

ls -la .archon/artifacts/
cat .archon/artifacts/issues/issue-*.md

Check Logs

Workflow execution logs to:

.archon/logs/{workflow-id}.jsonl

Each line is a JSON event (node start, AI response, tool call, etc.).


Workflow Validation

Before deploying a workflow:

  1. Test each command individually

    bun run cli workflow run {workflow} "test input"
    
  2. Verify artifact flow

    • Does each node produce what downstream nodes expect?
    • Are paths correct?
    • Is the format complete?
  3. Test edge cases

    • What if the input is invalid?
    • What if a node fails?
    • What if an artifact is missing?
  4. Check iteration limits (for loops)

    • Is max_iterations reasonable?
    • What happens when limit is hit?

Summary

  1. Workflows orchestrate commands - YAML files that define a DAG of nodes
  2. Nodes with dependencies - depends_on edges control execution order; independent nodes run in parallel
  3. Artifacts are the glue - Commands communicate via files, not memory
  4. context: fresh - Fresh session for a node, works from artifacts
  5. Parallel execution - Nodes in the same topological layer run concurrently
  6. Loop nodes - loop: on a node iterates until <promise>COMPLETE</promise> signal
  7. Conditional branching - when: conditions and trigger_rule control which nodes run
  8. output_format - Enforce structured JSON output from AI nodes for reliable branching
  9. allowed_tools / denied_tools - Restrict which tools a node can use (Claude only, enforced at SDK level)
  10. retry: - All nodes auto-retry transient errors (default: 2 retries, 3 s backoff); configure per-node with retry: block
  11. hooks — Attach static SDK hook callbacks to individual Claude nodes for tool control and context injection (see docs/hooks.md)
  12. mcp: — Attach per-node MCP servers via a JSON config file path (Claude only; env vars expanded at execution time); use with allowed_tools: [] for MCP-only nodes
  13. skills: — Preload named skills into individual Claude nodes for domain expertise (Claude only; see docs/skills.md)
  14. Test thoroughly - Each command, the artifact flow, and edge cases