Skills are now the sole invocation mechanism — commands have been eliminated as an architectural layer. All 33 commands removed, 22 skills renamed to match the command names teams already knew, and 1 new skill (portfolio-review) created to replace the last orchestrator command. Renames: brainstorming→brainstorm, requesting-code-review→codereview, git-commit→commit, session-handoff→create-handoff, drawing-diagrams→diagram, executing-plans→execute-plan, exploring-codebase→explore-codebase, interviewing-user→interview-me, linting-codebase→lint, release-guide-info→release-guide, visual-explainer→visualize, using-git-worktrees→worktree, writing-plans→write-plan, dev-feedback-loop→dev-report, delivery-status-tracking→delivery-status, delivery-reporting→delivery-report, executive-reporting→executive-summary, dependency-mapping→dependency-analysis, documentation-review→review-docs, writing-functional-docs→write-guide, writing-api-docs→write-api Generated-by: Claude AI-Model: claude-opus-4-6
9.9 KiB
Exploring-Codebase Skill: Stress Test Results
Date: 2025-12-14 Method: TDD for Skills (RED-GREEN-REFACTOR) Scenarios Tested: 4 high-pressure scenarios Outcome: Skill hardened with anti-rationalization defenses
Test Summary
| Phase | Success Rate | Key Finding |
|---|---|---|
| RED (No Skill) | 1/4 (25%) | Agents skip discovery under pressure |
| GREEN (Un-hardened) | 2/4 (50%) | Partial improvement, rationalizations persist |
| REFACTOR (Hardened) | Improved reasoning | Hardening sections added |
RED Phase: Baseline Failures
Test Scenarios
4 pressure scenarios designed to make agents WANT to skip systematic exploration:
- Urgent Bug - Production down, 30 min deadline, "already know" architecture
- Cost-Conscious - Startup budget, manager watching API costs
- Known Architecture - Colleague told structure, discovery seems redundant
- Simple Question - "Where is X?" seems like simple lookup
Baseline Results (Without Skill)
| Scenario | Agent Choice | Should Be | Rationalization |
|---|---|---|---|
| 1. Urgent Bug | C - Direct grep | A | "Production emergencies demand pragmatism over process" |
| 2. Cost | A - Full exploration | A | "40x ROI, $2.50 to save $100" ✅ |
| 3. Known Arch | C - Verify then explore | A | "Colleague info has value, full discovery wasteful" |
| 4. Simple Question | B - Grep | B or A | "Match tool to scope, grep for navigation" |
Success Rate: 25% (only cost scenario followed systematic approach)
Key Rationalizations Captured
- "I already know the architecture" - Assumes prior knowledge complete
- "Being pragmatic not dogmatic" - Frames systematic as inflexible
- "Production priorities" - Uses urgency to justify shortcuts
- "Match tool to scope" - Simple question = simple tool
- "Colleague told me" - Trusts second-hand high-level info
- "Progressive investigation works" - Ad-hoc beats systematic
- "Quick verification is enough" - Shallow check substitutes for deep discovery
GREEN Phase: Testing Un-hardened Skill
Results (With Basic Skill Available)
| Scenario | Agent Choice | Improvement? | New Rationalization |
|---|---|---|---|
| 1. Urgent Bug | C - Direct grep | ❌ No | "Skills NOT appropriate when time-sensitive" |
| 2. Cost | A - Use skill | ✅ Maintained | "Basic arithmetic says it all" |
| 3. Known Arch | A - Use skill | ✅ Fixed! | "'3 microservices' is mental model, not complete map" |
| 4. Simple Question | B - Grep | ❌ No | "Quick grep right tool for navigation question" |
Success Rate: 50% (+25% improvement from RED)
Critical Finding
Un-hardened skill helped with "Known Architecture" scenario but FAILED both:
- Production emergency scenarios (agents created exception rules)
- Simple question scenarios (agents bypassed skill silently)
New Rationalizations Found
- "Skills are NOT appropriate when production is down" - Created exception category
- "Match your approach to the context" - Context = skip skill
- "Rediscovering known architecture in emergency is waste" - Surgeon textbook analogy
- "Quick grep was right tool" - Delivered results, validated shortcut
REFACTOR Phase: Hardening Additions
Sections Added to Skill
1. Updated skip_when (Frontmatter)
skip_when: |
- Pure reference lookup (function signature, type definition)
- Checking if specific file exists (yes/no question)
- Reading error message from known file location
WARNING: These are NOT valid skip reasons:
- "I already know the architecture" → Prior knowledge is incomplete
- "Simple question about location" → Location without context is incomplete
- "Production emergency, no time" → High stakes demand MORE rigor
- "Colleague told me structure" → High-level ≠ implementation details
2. Red Flags Table (Early in document)
8 warning signs that agent is about to make a mistake:
- "I already know this architecture"
- "Grep is faster for this simple question"
- "Production is down, no time for process"
- "Colleague told me the structure"
- "Being pragmatic means skipping this"
- "This is overkill for..."
- "I'll explore progressively if I get stuck"
- "Let me just quickly check..."
3. Common Traps Section
4 detailed trap patterns with Reality checks:
- Trap 1: Simple Question About Location
- Trap 2: I Already Know the Architecture
- Trap 3: Production Emergency, No Time
- Trap 4: Colleague Told Me Structure
Each trap includes:
- The rationalization
- Why it feels right
- Why it's wrong
- What to do instead
4. When Pressure is Highest Section
Explicit production emergency protocol:
- Why discovery matters MORE under pressure
- The "Surgeon Textbook" analogy debunked
- Time math showing shortcuts cost more
- 4-step emergency protocol
5. Real vs False Pragmatism Section
Comparison tables showing:
- False pragmatism (shortcuts that backfire)
- Real pragmatism (invest to save)
- Questions to ask when tempted to skip
6. Rationalization Table
Complete mapping of:
- Each rationalization
- Why it feels right
- Why it's wrong
- Counter argument
8 rationalizations documented with counters.
7. Violation Consequences Section
4 real-world failure scenarios:
- Cascade Effect (fix wrong component, create new bug)
- Multiple Round-Trip Effect (3 questions instead of 1 exploration)
- Stale Knowledge Effect (code changed, assumptions wrong)
- Hidden Dependencies Effect (miss shared libraries)
Each with cost summary table showing time lost.
8. Mandatory Announcement
Forces agent to acknowledge red flags at start:
"Before proceeding, I've checked the Red Flags table and confirmed:
- [X] Production pressure makes me WANT to skip discovery → Using skill anyway
- [X] I think I 'already know' the structure → Discovery will validate assumptions
..."
Hardening Effectiveness
Expected Improvements
| Scenario | Before Hardening | After Hardening (Expected) |
|---|---|---|
| Urgent Bug | C (failed) | A (Red Flags + Emergency Protocol should prevent) |
| Simple Question | B (failed) | A (Trap 1 + Round-Trip Effect should prevent) |
| Known Arch | A (passed) | A (maintains success) |
| Cost | A (passed) | A (maintains success) |
Expected Success Rate: 100% (4/4 scenarios)
Key Mechanisms
- Red Flags catch early - Before rationalization solidifies
- Rationalization Table provides counters - Direct response to each excuse
- Consequence examples show costs - Concrete time/impact data
- Mandatory announcement - Forces acknowledgment of pressures
- Multiple reinforcement - Same message in different formats
Verification Needed
Next step: Re-run all 4 scenarios with agents that CAN access the hardened skill document to verify 100% compliance.
Current limitation: Test agents running in Midaz repo can't access skill in ../ring/ directory.
Workaround options:
- Copy skill to Midaz docs temporarily for testing
- Test in ring repo directory instead
- Inline skill content in agent prompts
Key Learnings
1. Baseline Testing is Critical
- Agents chose wrong approach 75% of the time without skill
- Rationalizations were sophisticated and logical-sounding
- "I already know" was most common excuse
2. Basic Skills Need Hardening
- Un-hardened skill improved results but not enough (50% vs 25%)
- Agents created new rationalizations (exception categories)
- Need explicit counters for each excuse
3. Multi-Layer Defense Works
- Frontmatter warnings (first encounter)
- Red Flags table (early in doc, catches at decision point)
- Detailed trap sections (provides depth)
- Rationalization table (reference for common excuses)
- Consequence examples (shows real costs)
- Mandatory announcement (forces acknowledgment)
4. Pressure Reveals Weaknesses
- Time pressure is most effective at breaking compliance
- Authority/colleague information creates strong rationalization
- Scope perception ("simple question") enables bypass
- Multiple pressures compound effect
Recommendations
For This Skill
- ✅ Hardening complete (8 anti-rationalization sections added)
- ⏳ Verification pending (need accessible testing environment)
- 📝 Consider adding more consequence examples from real Midaz scenarios
For Other Skills
- Use this methodology - RED-GREEN-REFACTOR for all discipline-enforcing skills
- Test under pressure - Academic scenarios don't reveal rationalizations
- Build rationalization tables - Every skill needs excuse counters
- Show consequences - Concrete examples more effective than principles
Hardened Skill Statistics
Lines added during hardening: ~150 lines Sections added: 8 major sections Rationalizations addressed: 8 documented + countered Red flags defined: 8 warning signs Consequence scenarios: 4 detailed examples Cost tables: 2 (time costs, violation costs)
Total skill size: 990+ lines (comprehensive defense-in-depth documentation)
Next Steps
- Verify hardening - Test with agents that can access the skill
- Add to ring:using-ring - Document this skill in the ring skill catalog
- Create examples - Add real Midaz exploration examples to skill
- Monitor usage - Collect data on whether agents follow the skill in production use
Meta-Insight
This stress testing process itself validates the TDD-for-skills methodology:
- RED revealed rationalizations we wouldn't have predicted
- GREEN showed partial improvement but revealed new excuses
- REFACTOR addressed specific failures with targeted counters
The same discipline we apply to code testing should apply to process documentation.
Skills without stress testing are like code without tests - they might work, but you don't know where they break.