ring/dev-team/skills/dev-chaos-testing/SKILL.md
Fred Amaral f862c50a56
feat(dev-cycle): reclassify gate cadence for ~40-50% speedup
Reclassify gates 1,2,4,5,6w,7w,8 (backend) and 1,2,4,5,6,7 (frontend) from subtask to task cadence. Gates 0, 3, 9 remain subtask-level. All 8 reviewers still run, all quality thresholds preserved.

Additional changes: standards pre-cache at Step 1.5 (cached_standards in state); Gate 0.5 merged into Gate 0 exit criteria via ring:dev-implementation Step 7; dev-report aggregates cycle-wide via accumulated_metrics (single cycle-end dispatch); dev-refactor clusters findings by (file, pattern_category) with findings:[] traceability array; read-after-write state verification removed; per-subtask visual reports opt-in only. State schema v1.1.0 (additive - backward compatible).

New shared patterns: standards-cache-protocol.md, gate-cadence-classification.md.

X-Lerian-Ref: 0x1
2026-04-17 21:25:17 -03:00

12 KiB

name description trigger skip_when NOT_skip_when sequence related input_schema output_schema verification
ring:dev-chaos-testing Gate 7 of development cycle - ensures chaos tests exist using Toxiproxy to verify graceful degradation under connection loss, latency, and partitions. Runs at TASK cadence (after all subtasks complete Gate 0 + Gate 3 + Gate 9): write mode runs per task, execute mode runs per cycle. - After integration testing complete (Gate 6) - MANDATORY for all development tasks with external dependencies - Verifies system behavior under failure conditions - Not inside a development cycle (ring:dev-cycle) - Service has no external dependencies (no database, cache, queue, or external API) - Task is documentation-only, configuration-only, or non-code - Frontend-only project with no backend service dependencies - "Infrastructure is reliable" - All infrastructure fails eventually. Be prepared. - "Integration tests cover failures" - Integration tests verify happy path. Chaos verifies failures. - "Toxiproxy is complex" - One container, 20 minutes setup. Prevents production incidents.
after before
ring:dev-integration-testing
ring:codereview
complementary
ring:dev-cycle
ring:dev-integration-testing
ring:qa-analyst
required optional
name type description
unit_id string TASK identifier (not a subtask id). This skill's write mode runs at TASK cadence — unit_id is always a task id. Execute mode runs per cycle.
name type items description
external_dependencies array string External services (postgres, redis, rabbitmq, etc.)
name type enum description
language string
go
typescript
Programming language
name type items description
implementation_files array string Union of changed files across all subtasks of this task.
name type description
gate0_handoffs array Array of per-subtask implementation handoffs (one entry per subtask). NOT a single gate0_handoff object.
name type description
gate6_handoff object Full handoff from Gate 6 (integration testing)
format required_sections metrics
markdown
name pattern required
Chaos Testing Summary ^## Chaos Testing Summary true
name pattern required
Failure Scenarios ^## Failure Scenarios true
name pattern required
Handoff to Next Gate ^## Handoff to Next Gate true
name type values
result enum
PASS
FAIL
name type
dependencies_tested integer
name type
scenarios_tested integer
name type
recovery_verified boolean
name type
iterations integer
automated manual
command description success_pattern
grep -rn 'TestIntegration_Chaos_' --include='*_test.go' . Chaos test functions exist TestIntegration_Chaos_
command description success_pattern
grep -rn 'CHAOS.*1' --include='*_test.go' . CHAOS env check present CHAOS
Chaos tests follow TestIntegration_Chaos_{Component}_{Scenario} naming
All external dependencies have failure scenarios
Recovery verified after each failure injection

Dev Chaos Testing (Gate 7)

Overview

Ensure code handles failure conditions gracefully by injecting faults using Toxiproxy. Verify connection loss, latency, and network partitions don't cause crashes.

Core principle: All infrastructure fails. Chaos testing ensures your code handles it gracefully.

<block_condition>

  • No chaos tests = FAIL
  • Any dependency without failure test = FAIL
  • Recovery not verified = FAIL
  • System crashes on failure = FAIL </block_condition>

CRITICAL: Role Clarification

This skill ORCHESTRATES. QA Analyst Agent (chaos mode) EXECUTES.

Who Responsibility
This Skill Gather requirements, dispatch agent, track iterations
QA Analyst Agent Write chaos tests, setup Toxiproxy, verify recovery

Standards Source (Cache-First Pattern)

Standards Source (Cache-First Pattern): This sub-skill reads standards from state.cached_standards populated by dev-cycle Step 1.5. If invoked outside a cycle (standalone), it falls back to direct WebFetch with a warning. See shared-patterns/standards-cache-protocol.md for protocol details.

Standards Reference

MANDATORY: Load testing-chaos.md standards via the cache-first pattern below.

URL: https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/golang/testing-chaos.md

Cache-first loading protocol: For each required standards URL: IF state.cached_standards[url] exists: → Read content from state.cached_standards[url].content → Log: "Using cached standard: {url} (fetched {state.cached_standards[url].fetched_at})" ELSE: → WebFetch url (fallback — should not happen if orchestrator ran Step 1.5) → Log warning: "Standard {url} was not pre-cached; fetched inline"

<fetch_required> https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/golang/testing-chaos.md </fetch_required>


Step 0: Detect External Dependencies (Auto-Detection)

MANDATORY: When external_dependencies is empty or not provided, scan the codebase to detect them automatically before validation.

if external_dependencies is empty or not provided:

  detected_dependencies = []

  1. Scan docker-compose.yml / docker-compose.yaml for service images:
     - Grep tool: pattern "postgres" in docker-compose* files → add "postgres"
     - Grep tool: pattern "mongo" in docker-compose* files → add "mongodb"
     - Grep tool: pattern "valkey" in docker-compose* files → add "valkey"
     - Grep tool: pattern "redis" in docker-compose* files → add "redis"
     - Grep tool: pattern "rabbitmq" in docker-compose* files → add "rabbitmq"

  2. Scan dependency manifests:
     if language == "go":
       - Grep tool: pattern "github.com/lib/pq" in go.mod → add "postgres"
       - Grep tool: pattern "github.com/jackc/pgx" in go.mod → add "postgres"
       - Grep tool: pattern "go.mongodb.org/mongo-driver" in go.mod → add "mongodb"
       - Grep tool: pattern "github.com/redis/go-redis" in go.mod → add "redis"
       - Grep tool: pattern "github.com/valkey-io/valkey-go" in go.mod → add "valkey"
       - Grep tool: pattern "github.com/rabbitmq/amqp091-go" in go.mod → add "rabbitmq"

     if language == "typescript":
       - Grep tool: pattern "\"pg\"" in package.json → add "postgres"
       - Grep tool: pattern "@prisma/client" in package.json → add "postgres"
       - Grep tool: pattern "\"mongodb\"" in package.json → add "mongodb"
       - Grep tool: pattern "\"mongoose\"" in package.json → add "mongodb"
       - Grep tool: pattern "\"redis\"" in package.json → add "redis"
       - Grep tool: pattern "\"ioredis\"" in package.json → add "redis"
       - Grep tool: pattern "@valkey" in package.json → add "valkey"
       - Grep tool: pattern "\"amqplib\"" in package.json → add "rabbitmq"
       - Grep tool: pattern "amqp-connection-manager" in package.json → add "rabbitmq"

  3. Deduplicate detected_dependencies
  4. Set external_dependencies = detected_dependencies

  Log: "Auto-detected external dependencies: [detected_dependencies]"

<auto_detect_reason> PM team task files often omit external_dependencies. If the codebase uses postgres, mongodb, valkey, or rabbitmq, these are external dependencies that MUST have chaos tests. Auto-detection prevents silent skips. </auto_detect_reason>


Step 1: Validate Input

REQUIRED INPUT:
- unit_id: [TASK id — write mode runs at task cadence, not per subtask]
- external_dependencies: [postgres, mongodb, valkey, redis, rabbitmq, etc.] (from input OR auto-detected in Step 0)
- language: [go|typescript]
- implementation_files: [union of changed files across all subtasks of this task]
- gate0_handoffs: [array of per-subtask Gate 0 handoffs — one entry per subtask]

OPTIONAL INPUT:
- gate6_handoff: [full Gate 6 output]

if any REQUIRED input is missing:
  → STOP and report: "Missing required input: [field]"

if external_dependencies is empty (AFTER auto-detection in Step 0):
  → STOP and report: "No external dependencies found after codebase scan - chaos testing requires dependencies"

Step 2: Dispatch QA Analyst Agent (Chaos Mode)

Task tool:
  subagent_type: "ring:qa-analyst"
  prompt: |
    **MODE:** CHAOS TESTING (Gate 7)

    **Standards:** Load testing-chaos.md

    **Input:**
    - Unit ID: {unit_id}
    - External Dependencies: {external_dependencies}
    - Language: {language}

    **Requirements:**
    1. Setup Toxiproxy infrastructure in tests/utils/chaos/
    2. Create chaos tests (TestIntegration_Chaos_{Component}_{Scenario} naming)
    3. Use dual-gate pattern (CHAOS=1 env + testing.Short())
    4. Test failure scenarios: Connection Loss, High Latency, Network Partition
    5. Verify 5-phase structure: Normal → Inject → Verify → Restore → Recovery

    **Output Sections Required:**
    - ## Chaos Testing Summary
    - ## Failure Scenarios
    - ## Handoff to Next Gate

Step 3: Evaluate Results

Parse agent output:

if "Status: PASS" in output:
  → Gate 7 PASSED
  → Return success with metrics

if "Status: FAIL" in output:
  → Dispatch fix to implementation agent
  → Re-run chaos tests (max 3 iterations)
  → If still failing: ESCALATE to user

Step 4: Generate Output

## Chaos Testing Summary
**Status:** {PASS|FAIL}
**Dependencies Tested:** {count}
**Scenarios Tested:** {count}
**Recovery Verified:** {Yes|No}

## Failure Scenarios
| Component | Scenario | Status | Recovery |
|-----------|----------|--------|----------|
| {component} | {scenario} | {PASS|FAIL} | {Yes|No} |

## Handoff to Next Gate
- Ready for Gate 8 (Code Review): {YES|NO}
- Iterations: {count}

Failure Scenarios by Dependency

Dependency Required Scenarios
PostgreSQL Connection Loss, High Latency, Network Partition
MongoDB Connection Loss, High Latency, Network Partition
Valkey Connection Loss, High Latency, Timeout
Redis Connection Loss, High Latency, Timeout
RabbitMQ Connection Loss, Network Partition, Slow Consumer
HTTP APIs Timeout, 5xx Errors, Connection Refused

Severity Calibration

Severity Criteria Examples
CRITICAL System crashes on failure, data loss Panic on connection loss, corrupted state on partition
HIGH No recovery, missing dependency tests System doesn't recover after failure, untested dependency
MEDIUM Partial recovery, missing scenarios Recovery takes too long, missing latency test
LOW Cleanup issues, documentation Test artifacts not cleaned, missing chaos docs

Report all severities. CRITICAL = immediate fix (production risk). HIGH = fix before gate pass. MEDIUM = fix in iteration. LOW = document.


Anti-Rationalization Table

Rationalization Why It's WRONG Required Action
"Infrastructure is reliable" AWS, GCP, Azure all have outages. Your code must handle them. Write chaos tests
"Integration tests cover failures" Integration tests verify happy path. Chaos tests verify failure handling. Write chaos tests
"Toxiproxy is complex" One container. 20 minutes setup. Prevents production incidents. Write chaos tests
"We have monitoring" Monitoring detects problems. Chaos testing prevents them. Write chaos tests
"Circuit breakers handle it" Circuit breakers need testing too. Chaos tests verify they work. Write chaos tests