mirror of https://github.com/Donchitos/Claude-Code-Game-Studios synced 2026-04-21 13:27:18 +00:00

Donchitos 3614e1dbfb Add v0.6.0: full skill/agent QA pass, 3 new agents tested, template cleanup

Skills fixed: sprint-status (stale escalation, threshold), retrospective
(existing file detection, missing data fallback), changelog (misc category,
task-ref count), patch-notes (BLOCKED on missing changelog, tone/template
paths), story-readiness (Phase 0 mode resolution, QL-STORY-READY gate),
art-bible, brainstorm, design-system, ux-design, dev-story, story-done,
create-architecture, create-control-manifest, map-systems, propagate-design-change,
quick-design, prototype, asset-spec.

Agents fixed: all 4 directors (gate verdict token format), engine-programmer,
ui-programmer, tools-programmer, technical-artist (engine version safety),
gameplay-programmer (ADR compliance), godot-gdextension-specialist (ABI warning),
systems-designer (escalation path to creative-director), accessibility-specialist
(model, tools, WCAG criterion format, findings template), live-ops-designer
(escalation paths, battle pass value language), qa-tester (model, test case
format, evidence routing, ambiguous criteria, regression scope).

Specs updated: smoke-check and adopt specs rewritten to match actual skill
behavior. catalog.yaml reset to blank template state. Removed
session-state marketing research file, removed session-state from gitignore.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-07 17:28:46 +10:00

14 KiB

Raw Permalink Blame History

Skill Quality Rubric

Used by /skill-test category [name|all] to evaluate skills beyond structural compliance. Each category defines 4–5 binary PASS/FAIL metrics specific to the skill's job.

A metric is PASS when the skill's written instructions clearly satisfy the criterion. A metric is FAIL when the instructions are absent, ambiguous, or contradictory. A metric is WARN when the instructions partially address the criterion.

Skill Categories

`gate`

Skills: gate-check

Gate skills control phase transitions. They must enforce correctness without auto-advancing stage and must respect the three review modes.

Metric	PASS criteria
G1 — Review mode read	Skill reads `production/session-state/review-mode.txt` (or equivalent) before deciding which directors to spawn
G2 — Full mode: all 4 directors spawn	In `full` mode, all 4 Tier-1 directors (CD, TD, PR, AD) PHASE-GATE prompts are invoked in parallel
G3 — Lean mode: PHASE-GATE only	In `lean` mode, only `*-PHASE-GATE` gates run; inline gates (CD-PILLARS, TD-ARCHITECTURE, etc.) are skipped
G4 — Solo mode: no directors	In `solo` mode, no director gates spawn; each is noted as "skipped — Solo mode"
G5 — No auto-advance	Skill never writes `production/stage.txt` without explicit user confirmation via "May I write"

`review`

Skills: design-review, architecture-review, review-all-gdds

Review skills read documents and produce structured verdicts. They are primarily read-only and must not trigger director gates during the analysis phase.

Metric	PASS criteria
R1 — Read-only enforcement	Skill does not modify the reviewed document without explicit user approval; any write operations (review logs, index updates) are gated behind "May I write"
R2 — 8-section check	Skill evaluates all 8 required GDD sections (or equivalent architectural sections) explicitly
R3 — Correct verdict vocabulary	Verdict is exactly one of: APPROVED / NEEDS REVISION / MAJOR REVISION NEEDED (design) or PASS / CONCERNS / FAIL (architecture)
R4 — No director gates during analysis	Skill does not spawn director gates during its analysis phases; post-analysis director review (as in architecture-review) is acceptable when the skill's scope and stakes warrant it
R5 — Structured findings	Output contains a per-section status table or checklist before the final verdict

Exceptions:

design-review: Has Write, Edit in allowed-tools to support an optional "Revise now" path (all writes gated behind user approval) and to write review logs. R1 is satisfied because the reviewed document is never silently modified.

architecture-review: Spawns TD-ARCHITECTURE and LP-FEASIBILITY gates after its analysis is complete. This is intentional — architecture review is high-stakes and benefits from director sign-off. R4 is satisfied because the gates run post-analysis, not during it.

`authoring`

Skills: design-system, quick-design, architecture-decision, ux-design, ux-review, art-bible, create-architecture

Authoring skills create or update design documents collaboratively. Full GDD/UX authoring skills use a section-by-section cycle; lightweight authoring skills use a single-draft pattern appropriate to their smaller scope.

Metric	PASS criteria
A1 — Section-by-section cycle	Full authoring skills (design-system, ux-design, art-bible) author one section at a time, presenting content for approval before proceeding to the next. Lightweight skills (quick-design, architecture-decision, create-architecture) may draft the complete document then ask for approval — single-draft is acceptable for documents under ~4 hours of implementation scope.
A2 — May-I-write per section	Full authoring skills ask "May I write this to [filepath]?" before each section write. Lightweight skills ask once for the complete document.
A3 — Retrofit mode	Skill detects if the target file already exists and offers to update specific sections rather than overwriting the whole document. Lightweight skills (quick-design) that always create new files are exempt.
A4 — Director gate at correct tier	If a director gate is defined for this skill (e.g., CD-GDD-ALIGN, TD-ADR), it runs at the correct mode threshold (full/lean) — NOT in solo
A5 — Skeleton-first	Full authoring skills create a file skeleton with all section headers before filling content, to preserve progress on session interruption. Lightweight skills are exempt.

Full authoring skills (must pass all 5 metrics): design-system, ux-design, art-bible Lightweight authoring skills (A1, A2, A5 use single-draft pattern; A3 exempt for new-file-only skills): quick-design, architecture-decision, create-architecture Review-mode skill (evaluated against review metrics): ux-review

`readiness`

Skills: story-readiness, story-done

Readiness skills validate stories before or after implementation. They must produce multi-dimensional verdicts and integrate correctly with director gate mode.

Metric	PASS criteria
RD1 — Multi-dimensional check	Skill checks ≥3 independent dimensions (e.g., Design, Architecture, Scope, DoD) and reports each separately
RD2 — Three verdict levels	Verdict hierarchy is clearly defined: READY/COMPLETE > NEEDS WORK/COMPLETE WITH NOTES > BLOCKED
RD3 — BLOCKED requires external action	BLOCKED verdict is reserved for issues that cannot be fixed by the story author alone (e.g., Proposed ADR, unresolvable dependency)
RD4 — Director gate at correct mode	QL-STORY-READY or LP-CODE-REVIEW gate spawns in `full` mode, skips in `lean`/`solo` with a noted skip message
RD5 — Next-story handoff	After completion, skill surfaces the next READY story from the active sprint

`pipeline`

Skills: create-epics, create-stories, dev-story, create-control-manifest, propagate-design-change, map-systems

Pipeline skills produce artifacts that other skills consume. They must write files with correct schema, respect layer/priority ordering, and gate before writing.

Metric	PASS criteria
P1 — Correct output schema	Each produced file follows the project template (EPIC.md, story frontmatter, etc.); skill references the template path
P2 — Layer/priority ordering	Skills that produce epics or stories respect layer ordering (core → extended → meta) and priority fields
P3 — May-I-write before each artifact	Skill asks "May I write [artifact]?" before creating each output file, not batch-approving all files at once
P4 — Director gate at correct tier	In-scope gates (PR-EPIC, QL-STORY-READY, LP-CODE-REVIEW, etc.) run in `full`, skip in `lean`/`solo` with noted skip
P5 — Reads before writes	Skill reads the relevant GDD/ADR/manifest before producing artifacts to ensure alignment

`analysis`

Skills: consistency-check, balance-check, content-audit, code-review, tech-debt, scope-check, estimate, perf-profile, asset-audit, security-audit, test-evidence-review, test-flakiness

Analysis skills scan the project and surface findings. They are read-only during analysis and must ask before recommending any file writes.

Metric	PASS criteria
AN1 — Read-only scan	Analysis phase uses only Read/Glob/Grep tools; no Write or Edit during the scan itself
AN2 — Structured findings table	Output includes a findings table or checklist (not prose only) with severity/priority per finding
AN3 — No auto-write	Any suggested file writes (e.g., tech-debt register, fix patches) are gated behind "May I write"
AN4 — No director gates during analysis	Analysis skills do not spawn director gates; they produce findings for human review

`team`

Skills: team-combat, team-narrative, team-audio, team-level, team-ui, team-qa, team-release, team-polish, team-live-ops

Team skills orchestrate multiple specialist agents for a department. They must spawn the right agents, run independent ones in parallel, and surface blocks immediately.

Metric	PASS criteria
T1 — Named agent list	Skill explicitly names which agents it spawns and in what order
T2 — Parallel where independent	Agents whose inputs don't depend on each other are spawned in parallel (single message, multiple Task calls)
T3 — BLOCKED surfacing	If any spawned agent returns BLOCKED or fails, skill surfaces it immediately and halts dependent work — never silently skips
T4 — Collect all verdicts before proceeding	Dependent phases wait for all parallel agents to complete before proceeding
T5 — Usage error on no argument	If required argument (e.g., feature name) is missing, skill outputs usage hint and stops without spawning agents

`sprint`

Skills: sprint-plan, sprint-status, milestone-review, retrospective, changelog, patch-notes

Sprint skills read production state and produce reports or planning artifacts. They have a PR-SPRINT or PR-MILESTONE gate at specific mode thresholds.

Metric	PASS criteria
SP1 — Reads sprint/milestone state	Skill reads `production/sprints/` or `production/milestones/` before producing output
SP2 — Correct sprint gate	PR-SPRINT (for planning) or PR-MILESTONE (for milestone review) gate runs in `full` mode, skips in `lean`/`solo`
SP3 — Structured output	Output uses a consistent structure (velocity table, risk list, action items) rather than free prose
SP4 — No auto-commit	Skill never writes sprint files or milestone records without "May I write"

`utility`

Skills: start, help, brainstorm, onboard, adopt, hotfix, prototype, localize, launch-checklist, release-checklist, smoke-check, soak-test, test-setup, test-helpers, regression-suite, qa-plan, bug-triage, bug-report, playtest-report, asset-spec, reverse-document, project-stage-detect, setup-engine, skill-test, skill-improve, day-one-patch, and any other skills not in categories above

Utility skills pass the 7 standard static checks. If they happen to spawn director gates, the gate mode logic must also be correct.

Metric	PASS criteria
U1 — Passes all 7 static checks	`/skill-test static [name]` returns COMPLIANT with 0 FAILs
U2 — Gate mode correct (if applicable)	If the skill spawns any director gate, it reads review-mode and applies full/lean/solo logic correctly

Agent Categories

Used to validate agent spec files in tests/agents/.

`director`

Agents: creative-director, technical-director, art-director, producer

Metric	PASS criteria
D1 — Correct verdict vocabulary	Returns APPROVE / CONCERNS / REJECT (or domain equivalent: REALISTIC/CONCERNS/UNREALISTIC for producer)
D2 — Domain boundary respected	Does not make binding decisions outside its declared domain
D3 — Conflict escalation	When two departments conflict, escalates to correct parent (creative-director or technical-director) rather than unilaterally deciding
D4 — Opus model tier	Agent is assigned Opus model per coordination-rules.md

`lead`

Agents: lead-programmer, qa-lead, narrative-director, audio-director, game-designer, systems-designer, level-designer

Metric	PASS criteria
L1 — Domain verdict	Returns a domain-specific verdict (e.g., FEASIBLE/INFEASIBLE for lead-programmer, PASS/FAIL for qa-lead)
L2 — Escalates to shared parent	Out-of-domain conflicts escalate to creative-director (design) or technical-director (tech)
L3 — Sonnet model tier	Agent is assigned Sonnet model (default) per coordination-rules.md

`specialist`

Agents: gameplay-programmer, ai-programmer, technical-artist, sound-designer, engine-programmer, tools-programmer, network-programmer, security-engineer, accessibility-specialist, ux-designer, ui-programmer, performance-analyst, prototyper, qa-tester, writer, world-builder

Metric	PASS criteria
S1 — Stays in domain	Explicitly scopes itself to its declared domain; defers out-of-domain requests
S2 — No binding cross-domain decisions	Does not unilaterally decide matters owned by another specialist
S3 — Defers correctly	Out-of-domain requests are redirected to the correct agent, not refused silently

`engine`

Agents: godot-specialist, godot-gdscript-specialist, godot-csharp-specialist, godot-shader-specialist, godot-gdextension-specialist, unity-specialist, unity-ui-specialist, unity-shader-specialist, unity-dots-specialist, unity-addressables-specialist, unreal-specialist, ue-blueprint-specialist, ue-gas-specialist, ue-umg-specialist, ue-replication-specialist

Metric	PASS criteria
E1 — Version-aware	References engine version from `docs/engine-reference/` before suggesting API calls; flags post-cutoff risk
E2 — File routing	Routes file types to the correct sub-specialist (e.g., `.gdshader` → godot-shader-specialist, not godot-gdscript-specialist)
E3 — Engine-specific patterns	Enforces engine-specific idioms (e.g., GDScript static typing, C# attribute exports, Blueprint function libraries)

`qa`

Agents: qa-tester, qa-lead, security-engineer, accessibility-specialist

Metric	PASS criteria
Q1 — Produces artifacts not code	Primary output is test cases, bug reports, or coverage gaps — not implementation code
Q2 — Evidence format	Test cases follow the project's test evidence format (unit/integration/visual/UI per coding-standards.md)
Q3 — No scope creep	Does not propose new features; flags gaps for humans to decide

`operations`

Agents: devops-engineer, release-manager, live-ops-designer, community-manager, analytics-engineer, economy-designer, localization-lead

Metric	PASS criteria
O1 — Domain ownership clear	Agent description clearly states what it owns (pipeline, releases, economy, etc.)
O2 — Defers implementation	Does not write game logic or engine code; delegates to appropriate specialist
O3 — Toolset matches role	`allowed-tools` in frontmatter matches the operational (not coding) nature of the role

14 KiB Raw Permalink Blame History Unescape Escape

Skill Quality Rubric

Skill Categories

gate

review

authoring

readiness

pipeline

analysis

team

sprint

utility