Skills fixed: sprint-status (stale escalation, threshold), retrospective (existing file detection, missing data fallback), changelog (misc category, task-ref count), patch-notes (BLOCKED on missing changelog, tone/template paths), story-readiness (Phase 0 mode resolution, QL-STORY-READY gate), art-bible, brainstorm, design-system, ux-design, dev-story, story-done, create-architecture, create-control-manifest, map-systems, propagate-design-change, quick-design, prototype, asset-spec. Agents fixed: all 4 directors (gate verdict token format), engine-programmer, ui-programmer, tools-programmer, technical-artist (engine version safety), gameplay-programmer (ADR compliance), godot-gdextension-specialist (ABI warning), systems-designer (escalation path to creative-director), accessibility-specialist (model, tools, WCAG criterion format, findings template), live-ops-designer (escalation paths, battle pass value language), qa-tester (model, test case format, evidence routing, ambiguous criteria, regression scope). Specs updated: smoke-check and adopt specs rewritten to match actual skill behavior. catalog.yaml reset to blank template state. Removed session-state marketing research file, removed session-state from gitignore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14 KiB
Skill Quality Rubric
Used by /skill-test category [name|all] to evaluate skills beyond structural compliance.
Each category defines 4–5 binary PASS/FAIL metrics specific to the skill's job.
A metric is PASS when the skill's written instructions clearly satisfy the criterion. A metric is FAIL when the instructions are absent, ambiguous, or contradictory. A metric is WARN when the instructions partially address the criterion.
Skill Categories
gate
Skills: gate-check
Gate skills control phase transitions. They must enforce correctness without auto-advancing stage and must respect the three review modes.
| Metric | PASS criteria |
|---|---|
| G1 — Review mode read | Skill reads production/session-state/review-mode.txt (or equivalent) before deciding which directors to spawn |
| G2 — Full mode: all 4 directors spawn | In full mode, all 4 Tier-1 directors (CD, TD, PR, AD) PHASE-GATE prompts are invoked in parallel |
| G3 — Lean mode: PHASE-GATE only | In lean mode, only *-PHASE-GATE gates run; inline gates (CD-PILLARS, TD-ARCHITECTURE, etc.) are skipped |
| G4 — Solo mode: no directors | In solo mode, no director gates spawn; each is noted as "skipped — Solo mode" |
| G5 — No auto-advance | Skill never writes production/stage.txt without explicit user confirmation via "May I write" |
review
Skills: design-review, architecture-review, review-all-gdds
Review skills read documents and produce structured verdicts. They are primarily read-only and must not trigger director gates during the analysis phase.
| Metric | PASS criteria |
|---|---|
| R1 — Read-only enforcement | Skill does not modify the reviewed document without explicit user approval; any write operations (review logs, index updates) are gated behind "May I write" |
| R2 — 8-section check | Skill evaluates all 8 required GDD sections (or equivalent architectural sections) explicitly |
| R3 — Correct verdict vocabulary | Verdict is exactly one of: APPROVED / NEEDS REVISION / MAJOR REVISION NEEDED (design) or PASS / CONCERNS / FAIL (architecture) |
| R4 — No director gates during analysis | Skill does not spawn director gates during its analysis phases; post-analysis director review (as in architecture-review) is acceptable when the skill's scope and stakes warrant it |
| R5 — Structured findings | Output contains a per-section status table or checklist before the final verdict |
Exceptions:
design-review: HasWrite, Editin allowed-tools to support an optional "Revise now" path (all writes gated behind user approval) and to write review logs. R1 is satisfied because the reviewed document is never silently modified.architecture-review: Spawns TD-ARCHITECTURE and LP-FEASIBILITY gates after its analysis is complete. This is intentional — architecture review is high-stakes and benefits from director sign-off. R4 is satisfied because the gates run post-analysis, not during it.
authoring
Skills: design-system, quick-design, architecture-decision, ux-design, ux-review, art-bible, create-architecture
Authoring skills create or update design documents collaboratively. Full GDD/UX authoring skills use a section-by-section cycle; lightweight authoring skills use a single-draft pattern appropriate to their smaller scope.
| Metric | PASS criteria |
|---|---|
| A1 — Section-by-section cycle | Full authoring skills (design-system, ux-design, art-bible) author one section at a time, presenting content for approval before proceeding to the next. Lightweight skills (quick-design, architecture-decision, create-architecture) may draft the complete document then ask for approval — single-draft is acceptable for documents under ~4 hours of implementation scope. |
| A2 — May-I-write per section | Full authoring skills ask "May I write this to [filepath]?" before each section write. Lightweight skills ask once for the complete document. |
| A3 — Retrofit mode | Skill detects if the target file already exists and offers to update specific sections rather than overwriting the whole document. Lightweight skills (quick-design) that always create new files are exempt. |
| A4 — Director gate at correct tier | If a director gate is defined for this skill (e.g., CD-GDD-ALIGN, TD-ADR), it runs at the correct mode threshold (full/lean) — NOT in solo |
| A5 — Skeleton-first | Full authoring skills create a file skeleton with all section headers before filling content, to preserve progress on session interruption. Lightweight skills are exempt. |
Full authoring skills (must pass all 5 metrics):
design-system,ux-design,art-bibleLightweight authoring skills (A1, A2, A5 use single-draft pattern; A3 exempt for new-file-only skills):quick-design,architecture-decision,create-architectureReview-mode skill (evaluated against review metrics):ux-review
readiness
Skills: story-readiness, story-done
Readiness skills validate stories before or after implementation. They must produce multi-dimensional verdicts and integrate correctly with director gate mode.
| Metric | PASS criteria |
|---|---|
| RD1 — Multi-dimensional check | Skill checks ≥3 independent dimensions (e.g., Design, Architecture, Scope, DoD) and reports each separately |
| RD2 — Three verdict levels | Verdict hierarchy is clearly defined: READY/COMPLETE > NEEDS WORK/COMPLETE WITH NOTES > BLOCKED |
| RD3 — BLOCKED requires external action | BLOCKED verdict is reserved for issues that cannot be fixed by the story author alone (e.g., Proposed ADR, unresolvable dependency) |
| RD4 — Director gate at correct mode | QL-STORY-READY or LP-CODE-REVIEW gate spawns in full mode, skips in lean/solo with a noted skip message |
| RD5 — Next-story handoff | After completion, skill surfaces the next READY story from the active sprint |
pipeline
Skills: create-epics, create-stories, dev-story, create-control-manifest, propagate-design-change, map-systems
Pipeline skills produce artifacts that other skills consume. They must write files with correct schema, respect layer/priority ordering, and gate before writing.
| Metric | PASS criteria |
|---|---|
| P1 — Correct output schema | Each produced file follows the project template (EPIC.md, story frontmatter, etc.); skill references the template path |
| P2 — Layer/priority ordering | Skills that produce epics or stories respect layer ordering (core → extended → meta) and priority fields |
| P3 — May-I-write before each artifact | Skill asks "May I write [artifact]?" before creating each output file, not batch-approving all files at once |
| P4 — Director gate at correct tier | In-scope gates (PR-EPIC, QL-STORY-READY, LP-CODE-REVIEW, etc.) run in full, skip in lean/solo with noted skip |
| P5 — Reads before writes | Skill reads the relevant GDD/ADR/manifest before producing artifacts to ensure alignment |
analysis
Skills: consistency-check, balance-check, content-audit, code-review, tech-debt, scope-check, estimate, perf-profile, asset-audit, security-audit, test-evidence-review, test-flakiness
Analysis skills scan the project and surface findings. They are read-only during analysis and must ask before recommending any file writes.
| Metric | PASS criteria |
|---|---|
| AN1 — Read-only scan | Analysis phase uses only Read/Glob/Grep tools; no Write or Edit during the scan itself |
| AN2 — Structured findings table | Output includes a findings table or checklist (not prose only) with severity/priority per finding |
| AN3 — No auto-write | Any suggested file writes (e.g., tech-debt register, fix patches) are gated behind "May I write" |
| AN4 — No director gates during analysis | Analysis skills do not spawn director gates; they produce findings for human review |
team
Skills: team-combat, team-narrative, team-audio, team-level, team-ui, team-qa, team-release, team-polish, team-live-ops
Team skills orchestrate multiple specialist agents for a department. They must spawn the right agents, run independent ones in parallel, and surface blocks immediately.
| Metric | PASS criteria |
|---|---|
| T1 — Named agent list | Skill explicitly names which agents it spawns and in what order |
| T2 — Parallel where independent | Agents whose inputs don't depend on each other are spawned in parallel (single message, multiple Task calls) |
| T3 — BLOCKED surfacing | If any spawned agent returns BLOCKED or fails, skill surfaces it immediately and halts dependent work — never silently skips |
| T4 — Collect all verdicts before proceeding | Dependent phases wait for all parallel agents to complete before proceeding |
| T5 — Usage error on no argument | If required argument (e.g., feature name) is missing, skill outputs usage hint and stops without spawning agents |
sprint
Skills: sprint-plan, sprint-status, milestone-review, retrospective, changelog, patch-notes
Sprint skills read production state and produce reports or planning artifacts. They have a PR-SPRINT or PR-MILESTONE gate at specific mode thresholds.
| Metric | PASS criteria |
|---|---|
| SP1 — Reads sprint/milestone state | Skill reads production/sprints/ or production/milestones/ before producing output |
| SP2 — Correct sprint gate | PR-SPRINT (for planning) or PR-MILESTONE (for milestone review) gate runs in full mode, skips in lean/solo |
| SP3 — Structured output | Output uses a consistent structure (velocity table, risk list, action items) rather than free prose |
| SP4 — No auto-commit | Skill never writes sprint files or milestone records without "May I write" |
utility
Skills: start, help, brainstorm, onboard, adopt, hotfix, prototype, localize, launch-checklist, release-checklist, smoke-check, soak-test, test-setup, test-helpers, regression-suite, qa-plan, bug-triage, bug-report, playtest-report, asset-spec, reverse-document, project-stage-detect, setup-engine, skill-test, skill-improve, day-one-patch, and any other skills not in categories above
Utility skills pass the 7 standard static checks. If they happen to spawn director gates, the gate mode logic must also be correct.
| Metric | PASS criteria |
|---|---|
| U1 — Passes all 7 static checks | /skill-test static [name] returns COMPLIANT with 0 FAILs |
| U2 — Gate mode correct (if applicable) | If the skill spawns any director gate, it reads review-mode and applies full/lean/solo logic correctly |
Agent Categories
Used to validate agent spec files in tests/agents/.
director
Agents: creative-director, technical-director, art-director, producer
| Metric | PASS criteria |
|---|---|
| D1 — Correct verdict vocabulary | Returns APPROVE / CONCERNS / REJECT (or domain equivalent: REALISTIC/CONCERNS/UNREALISTIC for producer) |
| D2 — Domain boundary respected | Does not make binding decisions outside its declared domain |
| D3 — Conflict escalation | When two departments conflict, escalates to correct parent (creative-director or technical-director) rather than unilaterally deciding |
| D4 — Opus model tier | Agent is assigned Opus model per coordination-rules.md |
lead
Agents: lead-programmer, qa-lead, narrative-director, audio-director, game-designer, systems-designer, level-designer
| Metric | PASS criteria |
|---|---|
| L1 — Domain verdict | Returns a domain-specific verdict (e.g., FEASIBLE/INFEASIBLE for lead-programmer, PASS/FAIL for qa-lead) |
| L2 — Escalates to shared parent | Out-of-domain conflicts escalate to creative-director (design) or technical-director (tech) |
| L3 — Sonnet model tier | Agent is assigned Sonnet model (default) per coordination-rules.md |
specialist
Agents: gameplay-programmer, ai-programmer, technical-artist, sound-designer, engine-programmer, tools-programmer, network-programmer, security-engineer, accessibility-specialist, ux-designer, ui-programmer, performance-analyst, prototyper, qa-tester, writer, world-builder
| Metric | PASS criteria |
|---|---|
| S1 — Stays in domain | Explicitly scopes itself to its declared domain; defers out-of-domain requests |
| S2 — No binding cross-domain decisions | Does not unilaterally decide matters owned by another specialist |
| S3 — Defers correctly | Out-of-domain requests are redirected to the correct agent, not refused silently |
engine
Agents: godot-specialist, godot-gdscript-specialist, godot-csharp-specialist, godot-shader-specialist, godot-gdextension-specialist, unity-specialist, unity-ui-specialist, unity-shader-specialist, unity-dots-specialist, unity-addressables-specialist, unreal-specialist, ue-blueprint-specialist, ue-gas-specialist, ue-umg-specialist, ue-replication-specialist
| Metric | PASS criteria |
|---|---|
| E1 — Version-aware | References engine version from docs/engine-reference/ before suggesting API calls; flags post-cutoff risk |
| E2 — File routing | Routes file types to the correct sub-specialist (e.g., .gdshader → godot-shader-specialist, not godot-gdscript-specialist) |
| E3 — Engine-specific patterns | Enforces engine-specific idioms (e.g., GDScript static typing, C# attribute exports, Blueprint function libraries) |
qa
Agents: qa-tester, qa-lead, security-engineer, accessibility-specialist
| Metric | PASS criteria |
|---|---|
| Q1 — Produces artifacts not code | Primary output is test cases, bug reports, or coverage gaps — not implementation code |
| Q2 — Evidence format | Test cases follow the project's test evidence format (unit/integration/visual/UI per coding-standards.md) |
| Q3 — No scope creep | Does not propose new features; flags gaps for humans to decide |
operations
Agents: devops-engineer, release-manager, live-ops-designer, community-manager, analytics-engineer, economy-designer, localization-lead
| Metric | PASS criteria |
|---|---|
| O1 — Domain ownership clear | Agent description clearly states what it owns (pipeline, releases, economy, etc.) |
| O2 — Defers implementation | Does not write game logic or engine code; delegates to appropriate specialist |
| O3 — Toolset matches role | allowed-tools in frontmatter matches the operational (not coding) nature of the role |