Claude-Code-Game-Studios/CCGS Skill Testing Framework/quality-rubric.md
Donchitos 3614e1dbfb Add v0.6.0: full skill/agent QA pass, 3 new agents tested, template cleanup
Skills fixed: sprint-status (stale escalation, threshold), retrospective
(existing file detection, missing data fallback), changelog (misc category,
task-ref count), patch-notes (BLOCKED on missing changelog, tone/template
paths), story-readiness (Phase 0 mode resolution, QL-STORY-READY gate),
art-bible, brainstorm, design-system, ux-design, dev-story, story-done,
create-architecture, create-control-manifest, map-systems, propagate-design-change,
quick-design, prototype, asset-spec.

Agents fixed: all 4 directors (gate verdict token format), engine-programmer,
ui-programmer, tools-programmer, technical-artist (engine version safety),
gameplay-programmer (ADR compliance), godot-gdextension-specialist (ABI warning),
systems-designer (escalation path to creative-director), accessibility-specialist
(model, tools, WCAG criterion format, findings template), live-ops-designer
(escalation paths, battle pass value language), qa-tester (model, test case
format, evidence routing, ambiguous criteria, regression scope).

Specs updated: smoke-check and adopt specs rewritten to match actual skill
behavior. catalog.yaml reset to blank template state. Removed
session-state marketing research file, removed session-state from gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 17:28:46 +10:00

14 KiB
Raw Permalink Blame History

Skill Quality Rubric

Used by /skill-test category [name|all] to evaluate skills beyond structural compliance. Each category defines 45 binary PASS/FAIL metrics specific to the skill's job.

A metric is PASS when the skill's written instructions clearly satisfy the criterion. A metric is FAIL when the instructions are absent, ambiguous, or contradictory. A metric is WARN when the instructions partially address the criterion.


Skill Categories

gate

Skills: gate-check

Gate skills control phase transitions. They must enforce correctness without auto-advancing stage and must respect the three review modes.

Metric PASS criteria
G1 — Review mode read Skill reads production/session-state/review-mode.txt (or equivalent) before deciding which directors to spawn
G2 — Full mode: all 4 directors spawn In full mode, all 4 Tier-1 directors (CD, TD, PR, AD) PHASE-GATE prompts are invoked in parallel
G3 — Lean mode: PHASE-GATE only In lean mode, only *-PHASE-GATE gates run; inline gates (CD-PILLARS, TD-ARCHITECTURE, etc.) are skipped
G4 — Solo mode: no directors In solo mode, no director gates spawn; each is noted as "skipped — Solo mode"
G5 — No auto-advance Skill never writes production/stage.txt without explicit user confirmation via "May I write"

review

Skills: design-review, architecture-review, review-all-gdds

Review skills read documents and produce structured verdicts. They are primarily read-only and must not trigger director gates during the analysis phase.

Metric PASS criteria
R1 — Read-only enforcement Skill does not modify the reviewed document without explicit user approval; any write operations (review logs, index updates) are gated behind "May I write"
R2 — 8-section check Skill evaluates all 8 required GDD sections (or equivalent architectural sections) explicitly
R3 — Correct verdict vocabulary Verdict is exactly one of: APPROVED / NEEDS REVISION / MAJOR REVISION NEEDED (design) or PASS / CONCERNS / FAIL (architecture)
R4 — No director gates during analysis Skill does not spawn director gates during its analysis phases; post-analysis director review (as in architecture-review) is acceptable when the skill's scope and stakes warrant it
R5 — Structured findings Output contains a per-section status table or checklist before the final verdict

Exceptions:

  • design-review: Has Write, Edit in allowed-tools to support an optional "Revise now" path (all writes gated behind user approval) and to write review logs. R1 is satisfied because the reviewed document is never silently modified.
  • architecture-review: Spawns TD-ARCHITECTURE and LP-FEASIBILITY gates after its analysis is complete. This is intentional — architecture review is high-stakes and benefits from director sign-off. R4 is satisfied because the gates run post-analysis, not during it.

authoring

Skills: design-system, quick-design, architecture-decision, ux-design, ux-review, art-bible, create-architecture

Authoring skills create or update design documents collaboratively. Full GDD/UX authoring skills use a section-by-section cycle; lightweight authoring skills use a single-draft pattern appropriate to their smaller scope.

Metric PASS criteria
A1 — Section-by-section cycle Full authoring skills (design-system, ux-design, art-bible) author one section at a time, presenting content for approval before proceeding to the next. Lightweight skills (quick-design, architecture-decision, create-architecture) may draft the complete document then ask for approval — single-draft is acceptable for documents under ~4 hours of implementation scope.
A2 — May-I-write per section Full authoring skills ask "May I write this to [filepath]?" before each section write. Lightweight skills ask once for the complete document.
A3 — Retrofit mode Skill detects if the target file already exists and offers to update specific sections rather than overwriting the whole document. Lightweight skills (quick-design) that always create new files are exempt.
A4 — Director gate at correct tier If a director gate is defined for this skill (e.g., CD-GDD-ALIGN, TD-ADR), it runs at the correct mode threshold (full/lean) — NOT in solo
A5 — Skeleton-first Full authoring skills create a file skeleton with all section headers before filling content, to preserve progress on session interruption. Lightweight skills are exempt.

Full authoring skills (must pass all 5 metrics): design-system, ux-design, art-bible Lightweight authoring skills (A1, A2, A5 use single-draft pattern; A3 exempt for new-file-only skills): quick-design, architecture-decision, create-architecture Review-mode skill (evaluated against review metrics): ux-review


readiness

Skills: story-readiness, story-done

Readiness skills validate stories before or after implementation. They must produce multi-dimensional verdicts and integrate correctly with director gate mode.

Metric PASS criteria
RD1 — Multi-dimensional check Skill checks ≥3 independent dimensions (e.g., Design, Architecture, Scope, DoD) and reports each separately
RD2 — Three verdict levels Verdict hierarchy is clearly defined: READY/COMPLETE > NEEDS WORK/COMPLETE WITH NOTES > BLOCKED
RD3 — BLOCKED requires external action BLOCKED verdict is reserved for issues that cannot be fixed by the story author alone (e.g., Proposed ADR, unresolvable dependency)
RD4 — Director gate at correct mode QL-STORY-READY or LP-CODE-REVIEW gate spawns in full mode, skips in lean/solo with a noted skip message
RD5 — Next-story handoff After completion, skill surfaces the next READY story from the active sprint

pipeline

Skills: create-epics, create-stories, dev-story, create-control-manifest, propagate-design-change, map-systems

Pipeline skills produce artifacts that other skills consume. They must write files with correct schema, respect layer/priority ordering, and gate before writing.

Metric PASS criteria
P1 — Correct output schema Each produced file follows the project template (EPIC.md, story frontmatter, etc.); skill references the template path
P2 — Layer/priority ordering Skills that produce epics or stories respect layer ordering (core → extended → meta) and priority fields
P3 — May-I-write before each artifact Skill asks "May I write [artifact]?" before creating each output file, not batch-approving all files at once
P4 — Director gate at correct tier In-scope gates (PR-EPIC, QL-STORY-READY, LP-CODE-REVIEW, etc.) run in full, skip in lean/solo with noted skip
P5 — Reads before writes Skill reads the relevant GDD/ADR/manifest before producing artifacts to ensure alignment

analysis

Skills: consistency-check, balance-check, content-audit, code-review, tech-debt, scope-check, estimate, perf-profile, asset-audit, security-audit, test-evidence-review, test-flakiness

Analysis skills scan the project and surface findings. They are read-only during analysis and must ask before recommending any file writes.

Metric PASS criteria
AN1 — Read-only scan Analysis phase uses only Read/Glob/Grep tools; no Write or Edit during the scan itself
AN2 — Structured findings table Output includes a findings table or checklist (not prose only) with severity/priority per finding
AN3 — No auto-write Any suggested file writes (e.g., tech-debt register, fix patches) are gated behind "May I write"
AN4 — No director gates during analysis Analysis skills do not spawn director gates; they produce findings for human review

team

Skills: team-combat, team-narrative, team-audio, team-level, team-ui, team-qa, team-release, team-polish, team-live-ops

Team skills orchestrate multiple specialist agents for a department. They must spawn the right agents, run independent ones in parallel, and surface blocks immediately.

Metric PASS criteria
T1 — Named agent list Skill explicitly names which agents it spawns and in what order
T2 — Parallel where independent Agents whose inputs don't depend on each other are spawned in parallel (single message, multiple Task calls)
T3 — BLOCKED surfacing If any spawned agent returns BLOCKED or fails, skill surfaces it immediately and halts dependent work — never silently skips
T4 — Collect all verdicts before proceeding Dependent phases wait for all parallel agents to complete before proceeding
T5 — Usage error on no argument If required argument (e.g., feature name) is missing, skill outputs usage hint and stops without spawning agents

sprint

Skills: sprint-plan, sprint-status, milestone-review, retrospective, changelog, patch-notes

Sprint skills read production state and produce reports or planning artifacts. They have a PR-SPRINT or PR-MILESTONE gate at specific mode thresholds.

Metric PASS criteria
SP1 — Reads sprint/milestone state Skill reads production/sprints/ or production/milestones/ before producing output
SP2 — Correct sprint gate PR-SPRINT (for planning) or PR-MILESTONE (for milestone review) gate runs in full mode, skips in lean/solo
SP3 — Structured output Output uses a consistent structure (velocity table, risk list, action items) rather than free prose
SP4 — No auto-commit Skill never writes sprint files or milestone records without "May I write"

utility

Skills: start, help, brainstorm, onboard, adopt, hotfix, prototype, localize, launch-checklist, release-checklist, smoke-check, soak-test, test-setup, test-helpers, regression-suite, qa-plan, bug-triage, bug-report, playtest-report, asset-spec, reverse-document, project-stage-detect, setup-engine, skill-test, skill-improve, day-one-patch, and any other skills not in categories above

Utility skills pass the 7 standard static checks. If they happen to spawn director gates, the gate mode logic must also be correct.

Metric PASS criteria
U1 — Passes all 7 static checks /skill-test static [name] returns COMPLIANT with 0 FAILs
U2 — Gate mode correct (if applicable) If the skill spawns any director gate, it reads review-mode and applies full/lean/solo logic correctly

Agent Categories

Used to validate agent spec files in tests/agents/.

director

Agents: creative-director, technical-director, art-director, producer

Metric PASS criteria
D1 — Correct verdict vocabulary Returns APPROVE / CONCERNS / REJECT (or domain equivalent: REALISTIC/CONCERNS/UNREALISTIC for producer)
D2 — Domain boundary respected Does not make binding decisions outside its declared domain
D3 — Conflict escalation When two departments conflict, escalates to correct parent (creative-director or technical-director) rather than unilaterally deciding
D4 — Opus model tier Agent is assigned Opus model per coordination-rules.md

lead

Agents: lead-programmer, qa-lead, narrative-director, audio-director, game-designer, systems-designer, level-designer

Metric PASS criteria
L1 — Domain verdict Returns a domain-specific verdict (e.g., FEASIBLE/INFEASIBLE for lead-programmer, PASS/FAIL for qa-lead)
L2 — Escalates to shared parent Out-of-domain conflicts escalate to creative-director (design) or technical-director (tech)
L3 — Sonnet model tier Agent is assigned Sonnet model (default) per coordination-rules.md

specialist

Agents: gameplay-programmer, ai-programmer, technical-artist, sound-designer, engine-programmer, tools-programmer, network-programmer, security-engineer, accessibility-specialist, ux-designer, ui-programmer, performance-analyst, prototyper, qa-tester, writer, world-builder

Metric PASS criteria
S1 — Stays in domain Explicitly scopes itself to its declared domain; defers out-of-domain requests
S2 — No binding cross-domain decisions Does not unilaterally decide matters owned by another specialist
S3 — Defers correctly Out-of-domain requests are redirected to the correct agent, not refused silently

engine

Agents: godot-specialist, godot-gdscript-specialist, godot-csharp-specialist, godot-shader-specialist, godot-gdextension-specialist, unity-specialist, unity-ui-specialist, unity-shader-specialist, unity-dots-specialist, unity-addressables-specialist, unreal-specialist, ue-blueprint-specialist, ue-gas-specialist, ue-umg-specialist, ue-replication-specialist

Metric PASS criteria
E1 — Version-aware References engine version from docs/engine-reference/ before suggesting API calls; flags post-cutoff risk
E2 — File routing Routes file types to the correct sub-specialist (e.g., .gdshader → godot-shader-specialist, not godot-gdscript-specialist)
E3 — Engine-specific patterns Enforces engine-specific idioms (e.g., GDScript static typing, C# attribute exports, Blueprint function libraries)

qa

Agents: qa-tester, qa-lead, security-engineer, accessibility-specialist

Metric PASS criteria
Q1 — Produces artifacts not code Primary output is test cases, bug reports, or coverage gaps — not implementation code
Q2 — Evidence format Test cases follow the project's test evidence format (unit/integration/visual/UI per coding-standards.md)
Q3 — No scope creep Does not propose new features; flags gaps for humans to decide

operations

Agents: devops-engineer, release-manager, live-ops-designer, community-manager, analytics-engineer, economy-designer, localization-lead

Metric PASS criteria
O1 — Domain ownership clear Agent description clearly states what it owns (pipeline, releases, economy, etc.)
O2 — Defers implementation Does not write game logic or engine code; delegates to appropriate specialist
O3 — Toolset matches role allowed-tools in frontmatter matches the operational (not coding) nature of the role