mirror of https://github.com/LerianStudio/ring synced 2026-04-21 21:47:49 +00:00

refactor(skills): remove commands layer and rename skills to match command names

Skills are now the sole invocation mechanism — commands have been eliminated
as an architectural layer. All 33 commands removed, 22 skills renamed to
match the command names teams already knew, and 1 new skill (portfolio-review)
created to replace the last orchestrator command.

Renames: brainstorming→brainstorm, requesting-code-review→codereview,
git-commit→commit, session-handoff→create-handoff, drawing-diagrams→diagram,
executing-plans→execute-plan, exploring-codebase→explore-codebase,
interviewing-user→interview-me, linting-codebase→lint,
release-guide-info→release-guide, visual-explainer→visualize,
using-git-worktrees→worktree, writing-plans→write-plan,
dev-feedback-loop→dev-report, delivery-status-tracking→delivery-status,
delivery-reporting→delivery-report, executive-reporting→executive-summary,
dependency-mapping→dependency-analysis, documentation-review→review-docs,
writing-functional-docs→write-guide, writing-api-docs→write-api

Generated-by: Claude
AI-Model: claude-opus-4-6

2026-04-12 10:14:20 -03:00

9.9 KiB

Raw Blame History

Exploring-Codebase Skill: Stress Test Results

Date: 2025-12-14 Method: TDD for Skills (RED-GREEN-REFACTOR) Scenarios Tested: 4 high-pressure scenarios Outcome: Skill hardened with anti-rationalization defenses

Test Summary

Phase	Success Rate	Key Finding
RED (No Skill)	1/4 (25%)	Agents skip discovery under pressure
GREEN (Un-hardened)	2/4 (50%)	Partial improvement, rationalizations persist
REFACTOR (Hardened)	Improved reasoning	Hardening sections added

RED Phase: Baseline Failures

Test Scenarios

4 pressure scenarios designed to make agents WANT to skip systematic exploration:

Urgent Bug - Production down, 30 min deadline, "already know" architecture
Cost-Conscious - Startup budget, manager watching API costs
Known Architecture - Colleague told structure, discovery seems redundant
Simple Question - "Where is X?" seems like simple lookup

Baseline Results (Without Skill)

Scenario	Agent Choice	Should Be	Rationalization
1. Urgent Bug	C - Direct grep	A	"Production emergencies demand pragmatism over process"
2. Cost	A - Full exploration	A	"40x ROI, $2.50 to save $100" ✅
3. Known Arch	C - Verify then explore	A	"Colleague info has value, full discovery wasteful"
4. Simple Question	B - Grep	B or A	"Match tool to scope, grep for navigation"

Success Rate: 25% (only cost scenario followed systematic approach)

Key Rationalizations Captured

"I already know the architecture" - Assumes prior knowledge complete
"Being pragmatic not dogmatic" - Frames systematic as inflexible
"Production priorities" - Uses urgency to justify shortcuts
"Match tool to scope" - Simple question = simple tool
"Colleague told me" - Trusts second-hand high-level info
"Progressive investigation works" - Ad-hoc beats systematic
"Quick verification is enough" - Shallow check substitutes for deep discovery

GREEN Phase: Testing Un-hardened Skill

Results (With Basic Skill Available)

Scenario	Agent Choice	Improvement?	New Rationalization
1. Urgent Bug	C - Direct grep	❌ No	"Skills NOT appropriate when time-sensitive"
2. Cost	A - Use skill	✅ Maintained	"Basic arithmetic says it all"
3. Known Arch	A - Use skill	✅ Fixed!	"'3 microservices' is mental model, not complete map"
4. Simple Question	B - Grep	❌ No	"Quick grep right tool for navigation question"

Success Rate: 50% (+25% improvement from RED)

Critical Finding

Un-hardened skill helped with "Known Architecture" scenario but FAILED both:

Production emergency scenarios (agents created exception rules)
Simple question scenarios (agents bypassed skill silently)

New Rationalizations Found

"Skills are NOT appropriate when production is down" - Created exception category
"Match your approach to the context" - Context = skip skill
"Rediscovering known architecture in emergency is waste" - Surgeon textbook analogy
"Quick grep was right tool" - Delivered results, validated shortcut

REFACTOR Phase: Hardening Additions

Sections Added to Skill

1. Updated skip_when (Frontmatter)

skip_when: |
  - Pure reference lookup (function signature, type definition)
  - Checking if specific file exists (yes/no question)
  - Reading error message from known file location

  WARNING: These are NOT valid skip reasons:
  - "I already know the architecture" → Prior knowledge is incomplete
  - "Simple question about location" → Location without context is incomplete
  - "Production emergency, no time" → High stakes demand MORE rigor
  - "Colleague told me structure" → High-level ≠ implementation details

2. Red Flags Table (Early in document)

8 warning signs that agent is about to make a mistake:

"I already know this architecture"
"Grep is faster for this simple question"
"Production is down, no time for process"
"Colleague told me the structure"
"Being pragmatic means skipping this"
"This is overkill for..."
"I'll explore progressively if I get stuck"
"Let me just quickly check..."

3. Common Traps Section

4 detailed trap patterns with Reality checks:

Trap 1: Simple Question About Location
Trap 2: I Already Know the Architecture
Trap 3: Production Emergency, No Time
Trap 4: Colleague Told Me Structure

Each trap includes:

The rationalization
Why it feels right
Why it's wrong
What to do instead

4. When Pressure is Highest Section

Explicit production emergency protocol:

Why discovery matters MORE under pressure
The "Surgeon Textbook" analogy debunked
Time math showing shortcuts cost more
4-step emergency protocol

5. Real vs False Pragmatism Section

Comparison tables showing:

False pragmatism (shortcuts that backfire)
Real pragmatism (invest to save)
Questions to ask when tempted to skip

6. Rationalization Table

Complete mapping of:

Each rationalization
Why it feels right
Why it's wrong
Counter argument

8 rationalizations documented with counters.

7. Violation Consequences Section

4 real-world failure scenarios:

Cascade Effect (fix wrong component, create new bug)
Multiple Round-Trip Effect (3 questions instead of 1 exploration)
Stale Knowledge Effect (code changed, assumptions wrong)
Hidden Dependencies Effect (miss shared libraries)

Each with cost summary table showing time lost.

8. Mandatory Announcement

Forces agent to acknowledge red flags at start:

"Before proceeding, I've checked the Red Flags table and confirmed:
- [X] Production pressure makes me WANT to skip discovery → Using skill anyway
- [X] I think I 'already know' the structure → Discovery will validate assumptions
..."

Hardening Effectiveness

Expected Improvements

Scenario	Before Hardening	After Hardening (Expected)
Urgent Bug	C (failed)	A (Red Flags + Emergency Protocol should prevent)
Simple Question	B (failed)	A (Trap 1 + Round-Trip Effect should prevent)
Known Arch	A (passed)	A (maintains success)
Cost	A (passed)	A (maintains success)

Expected Success Rate: 100% (4/4 scenarios)

Key Mechanisms

Red Flags catch early - Before rationalization solidifies
Rationalization Table provides counters - Direct response to each excuse
Consequence examples show costs - Concrete time/impact data
Mandatory announcement - Forces acknowledgment of pressures
Multiple reinforcement - Same message in different formats

Verification Needed

Next step: Re-run all 4 scenarios with agents that CAN access the hardened skill document to verify 100% compliance.

Current limitation: Test agents running in Midaz repo can't access skill in ../ring/ directory.

Workaround options:

Copy skill to Midaz docs temporarily for testing
Test in ring repo directory instead
Inline skill content in agent prompts

Key Learnings

1. Baseline Testing is Critical

Agents chose wrong approach 75% of the time without skill
Rationalizations were sophisticated and logical-sounding
"I already know" was most common excuse

2. Basic Skills Need Hardening

Un-hardened skill improved results but not enough (50% vs 25%)
Agents created new rationalizations (exception categories)
Need explicit counters for each excuse

3. Multi-Layer Defense Works

Frontmatter warnings (first encounter)
Red Flags table (early in doc, catches at decision point)
Detailed trap sections (provides depth)
Rationalization table (reference for common excuses)
Consequence examples (shows real costs)
Mandatory announcement (forces acknowledgment)

4. Pressure Reveals Weaknesses

Time pressure is most effective at breaking compliance
Authority/colleague information creates strong rationalization
Scope perception ("simple question") enables bypass
Multiple pressures compound effect

Recommendations

For This Skill

✅ Hardening complete (8 anti-rationalization sections added)
⏳ Verification pending (need accessible testing environment)
📝 Consider adding more consequence examples from real Midaz scenarios

For Other Skills

Use this methodology - RED-GREEN-REFACTOR for all discipline-enforcing skills
Test under pressure - Academic scenarios don't reveal rationalizations
Build rationalization tables - Every skill needs excuse counters
Show consequences - Concrete examples more effective than principles

Hardened Skill Statistics

Lines added during hardening: ~150 lines Sections added: 8 major sections Rationalizations addressed: 8 documented + countered Red flags defined: 8 warning signs Consequence scenarios: 4 detailed examples Cost tables: 2 (time costs, violation costs)

Total skill size: 990+ lines (comprehensive defense-in-depth documentation)

Next Steps

Verify hardening - Test with agents that can access the skill
Add to ring:using-ring - Document this skill in the ring skill catalog
Create examples - Add real Midaz exploration examples to skill
Monitor usage - Collect data on whether agents follow the skill in production use

Meta-Insight

This stress testing process itself validates the TDD-for-skills methodology:

RED revealed rationalizations we wouldn't have predicted
GREEN showed partial improvement but revealed new excuses
REFACTOR addressed specific failures with targeted counters

The same discipline we apply to code testing should apply to process documentation.

Skills without stress testing are like code without tests - they might work, but you don't know where they break.

9.9 KiB Raw Blame History