# Exploring-Codebase Skill: Stress Test Results

**Date:** 2025-12-14
**Method:** TDD for Skills (RED-GREEN-REFACTOR)
**Scenarios Tested:** 4 high-pressure scenarios
**Outcome:** Skill hardened with anti-rationalization defenses

---

## Test Summary

| Phase | Success Rate | Key Finding |
|-------|--------------|-------------|
| **RED (No Skill)** | 1/4 (25%) | Agents skip discovery under pressure |
| **GREEN (Un-hardened)** | 2/4 (50%) | Partial improvement, rationalizations persist |
| **REFACTOR (Hardened)** | Improved reasoning | Hardening sections added |

---

## RED Phase: Baseline Failures

### Test Scenarios

4 pressure scenarios designed to make agents WANT to skip systematic exploration:

1. **Urgent Bug** - Production down, 30 min deadline, "already know" architecture
2. **Cost-Conscious** - Startup budget, manager watching API costs
3. **Known Architecture** - Colleague told structure, discovery seems redundant
4. **Simple Question** - "Where is X?" seems like simple lookup

### Baseline Results (Without Skill)

| Scenario | Agent Choice | Should Be | Rationalization |
|----------|--------------|-----------|-----------------|
| 1. Urgent Bug | **C - Direct grep** | A | "Production emergencies demand pragmatism over process" |
| 2. Cost | **A - Full exploration** | A | "40x ROI, $2.50 to save $100" ✅ |
| 3. Known Arch | **C - Verify then explore** | A | "Colleague info has value, full discovery wasteful" |
| 4. Simple Question | **B - Grep** | B or A | "Match tool to scope, grep for navigation" |

**Success Rate: 25%** (only cost scenario followed systematic approach)

### Key Rationalizations Captured

1. **"I already know the architecture"** - Assumes prior knowledge complete
2. **"Being pragmatic not dogmatic"** - Frames systematic as inflexible
3. **"Production priorities"** - Uses urgency to justify shortcuts
4. **"Match tool to scope"** - Simple question = simple tool
5. **"Colleague told me"** - Trusts second-hand high-level info
6. **"Progressive investigation works"** - Ad-hoc beats systematic
7. **"Quick verification is enough"** - Shallow check substitutes for deep discovery

---

## GREEN Phase: Testing Un-hardened Skill

### Results (With Basic Skill Available)

| Scenario | Agent Choice | Improvement? | New Rationalization |
|----------|--------------|--------------|---------------------|
| 1. Urgent Bug | **C - Direct grep** | ❌ No | "Skills NOT appropriate when time-sensitive" |
| 2. Cost | **A - Use skill** | ✅ Maintained | "Basic arithmetic says it all" |
| 3. Known Arch | **A - Use skill** | ✅ Fixed! | "'3 microservices' is mental model, not complete map" |
| 4. Simple Question | **B - Grep** | ❌ No | "Quick grep right tool for navigation question" |

**Success Rate: 50%** (+25% improvement from RED)

### Critical Finding

**Un-hardened skill helped with "Known Architecture" scenario but FAILED both:**
- Production emergency scenarios (agents created exception rules)
- Simple question scenarios (agents bypassed skill silently)

### New Rationalizations Found

1. **"Skills are NOT appropriate when production is down"** - Created exception category
2. **"Match your approach to the context"** - Context = skip skill
3. **"Rediscovering known architecture in emergency is waste"** - Surgeon textbook analogy
4. **"Quick grep was right tool"** - Delivered results, validated shortcut

---

## REFACTOR Phase: Hardening Additions

### Sections Added to Skill

#### 1. **Updated skip_when (Frontmatter)**
```yaml
skip_when: |
  - Pure reference lookup (function signature, type definition)
  - Checking if specific file exists (yes/no question)
  - Reading error message from known file location

  WARNING: These are NOT valid skip reasons:
  - "I already know the architecture" → Prior knowledge is incomplete
  - "Simple question about location" → Location without context is incomplete
  - "Production emergency, no time" → High stakes demand MORE rigor
  - "Colleague told me structure" → High-level ≠ implementation details
```

#### 2. **Red Flags Table** (Early in document)
8 warning signs that agent is about to make a mistake:
- "I already know this architecture"
- "Grep is faster for this simple question"
- "Production is down, no time for process"
- "Colleague told me the structure"
- "Being pragmatic means skipping this"
- "This is overkill for..."
- "I'll explore progressively if I get stuck"
- "Let me just quickly check..."

#### 3. **Common Traps Section**
4 detailed trap patterns with Reality checks:
- Trap 1: Simple Question About Location
- Trap 2: I Already Know the Architecture
- Trap 3: Production Emergency, No Time
- Trap 4: Colleague Told Me Structure

Each trap includes:
- The rationalization
- Why it feels right
- Why it's wrong
- What to do instead

#### 4. **When Pressure is Highest** Section
Explicit production emergency protocol:
- Why discovery matters MORE under pressure
- The "Surgeon Textbook" analogy debunked
- Time math showing shortcuts cost more
- 4-step emergency protocol

#### 5. **Real vs False Pragmatism** Section
Comparison tables showing:
- False pragmatism (shortcuts that backfire)
- Real pragmatism (invest to save)
- Questions to ask when tempted to skip

#### 6. **Rationalization Table**
Complete mapping of:
- Each rationalization
- Why it feels right
- Why it's wrong
- Counter argument

8 rationalizations documented with counters.

#### 7. **Violation Consequences** Section
4 real-world failure scenarios:
- Cascade Effect (fix wrong component, create new bug)
- Multiple Round-Trip Effect (3 questions instead of 1 exploration)
- Stale Knowledge Effect (code changed, assumptions wrong)
- Hidden Dependencies Effect (miss shared libraries)

Each with cost summary table showing time lost.

#### 8. **Mandatory Announcement**
Forces agent to acknowledge red flags at start:
```
"Before proceeding, I've checked the Red Flags table and confirmed:
- [X] Production pressure makes me WANT to skip discovery → Using skill anyway
- [X] I think I 'already know' the structure → Discovery will validate assumptions
..."
```

---

## Hardening Effectiveness

### Expected Improvements

| Scenario | Before Hardening | After Hardening (Expected) |
|----------|------------------|----------------------------|
| Urgent Bug | C (failed) | A (Red Flags + Emergency Protocol should prevent) |
| Simple Question | B (failed) | A (Trap 1 + Round-Trip Effect should prevent) |
| Known Arch | A (passed) | A (maintains success) |
| Cost | A (passed) | A (maintains success) |

**Expected Success Rate: 100%** (4/4 scenarios)

### Key Mechanisms

1. **Red Flags catch early** - Before rationalization solidifies
2. **Rationalization Table provides counters** - Direct response to each excuse
3. **Consequence examples show costs** - Concrete time/impact data
4. **Mandatory announcement** - Forces acknowledgment of pressures
5. **Multiple reinforcement** - Same message in different formats

---

## Verification Needed

**Next step:** Re-run all 4 scenarios with agents that CAN access the hardened skill document to verify 100% compliance.

**Current limitation:** Test agents running in Midaz repo can't access skill in `../ring/` directory.

**Workaround options:**
1. Copy skill to Midaz docs temporarily for testing
2. Test in ring repo directory instead
3. Inline skill content in agent prompts

---

## Key Learnings

### 1. Baseline Testing is Critical
- Agents chose wrong approach 75% of the time without skill
- Rationalizations were sophisticated and logical-sounding
- "I already know" was most common excuse

### 2. Basic Skills Need Hardening
- Un-hardened skill improved results but not enough (50% vs 25%)
- Agents created new rationalizations (exception categories)
- Need explicit counters for each excuse

### 3. Multi-Layer Defense Works
- Frontmatter warnings (first encounter)
- Red Flags table (early in doc, catches at decision point)
- Detailed trap sections (provides depth)
- Rationalization table (reference for common excuses)
- Consequence examples (shows real costs)
- Mandatory announcement (forces acknowledgment)

### 4. Pressure Reveals Weaknesses
- Time pressure is most effective at breaking compliance
- Authority/colleague information creates strong rationalization
- Scope perception ("simple question") enables bypass
- Multiple pressures compound effect

---

## Recommendations

### For This Skill
1. ✅ Hardening complete (8 anti-rationalization sections added)
2. ⏳ Verification pending (need accessible testing environment)
3. 📝 Consider adding more consequence examples from real Midaz scenarios

### For Other Skills
1. **Use this methodology** - RED-GREEN-REFACTOR for all discipline-enforcing skills
2. **Test under pressure** - Academic scenarios don't reveal rationalizations
3. **Build rationalization tables** - Every skill needs excuse counters
4. **Show consequences** - Concrete examples more effective than principles

---

## Hardened Skill Statistics

**Lines added during hardening:** ~150 lines
**Sections added:** 8 major sections
**Rationalizations addressed:** 8 documented + countered
**Red flags defined:** 8 warning signs
**Consequence scenarios:** 4 detailed examples
**Cost tables:** 2 (time costs, violation costs)

**Total skill size:** 990+ lines (comprehensive defense-in-depth documentation)

---

## Next Steps

1. **Verify hardening** - Test with agents that can access the skill
2. **Add to ring:using-ring** - Document this skill in the ring skill catalog
3. **Create examples** - Add real Midaz exploration examples to skill
4. **Monitor usage** - Collect data on whether agents follow the skill in production use

---

## Meta-Insight

**This stress testing process itself validates the TDD-for-skills methodology:**

- RED revealed rationalizations we wouldn't have predicted
- GREEN showed partial improvement but revealed new excuses
- REFACTOR addressed specific failures with targeted counters

**The same discipline we apply to code testing should apply to process documentation.**

Skills without stress testing are like code without tests - they might work, but you don't know where they break.