mirror of https://github.com/Donchitos/Claude-Code-Game-Studios synced 2026-04-21 13:27:18 +00:00

Donchitos a73ff759c9 Add v0.5.0: CCGS Skill Testing Framework, skill-improve, 4 new skills, director gate path fixes

- Add CCGS Skill Testing Framework: self-contained QA layer with 72 skill specs,
  49 agent specs, catalog.yaml, quality-rubric.md, templates, README, CLAUDE.md
- Add /skill-improve: test-fix-retest loop covering static + category checks
- Add 4 missing skills: /art-bible, /asset-spec, /day-one-patch, /security-audit
- Add /skill-test category mode (Phase 2D) with quality rubric evaluation
- Extend /skill-test audit to cover agent specs alongside skill specs
- Update all skill-test and skill-improve path refs to CCGS Skill Testing Framework/
- Remove stale tests/skills/ directory (superseded by CCGS Skill Testing Framework)
- Add director gate intensity modes (full/lean/solo) to gate-check and related skills

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-06 17:42:32 +10:00

6.4 KiB

Raw Blame History

Agent Test Spec: analytics-engineer

Agent Summary

Domain: Telemetry architecture and event schema design, A/B test framework design, player behavior analysis methodology, analytics dashboard specification, event naming conventions, data pipeline design (schema → ingestion → dashboard)
Does NOT own: Game implementation of event tracking (appropriate programmer), economy design decisions informed by analytics (economy-designer), live ops event design (live-ops-designer)
Model tier: Sonnet
Gate IDs: None; produces schemas and test designs; defers implementation to programmers

Static Assertions (Structural)

description: field is present and domain-specific (references telemetry, A/B testing, event tracking, analytics)
allowed-tools: list matches the agent's role (Read/Write for design/analytics/ and documentation; no game source or CI tools)
Model tier is Sonnet (default for operations specialists)
Agent definition does not claim authority over game implementation, economy design, or live ops scheduling

Test Cases

Case 1: In-domain request — tutorial event tracking design

Input: "Design the analytics event tracking for our tutorial. We want to know where players drop off and which steps they complete." Expected behavior:

Produces a structured event schema for each tutorial step: at minimum, event_name, properties (step_id, step_name, player_id, session_id, timestamp), and trigger_condition (when exactly the event fires — on step start, on step complete, on step skip)
Includes a funnel-completion event and a drop-off event (e.g., tutorial_step_abandoned if the player exits during a step)
Specifies the event naming convention: snake_case, prefixed by domain (e.g., tutorial_step_started, tutorial_step_completed, tutorial_abandoned)
Does NOT produce implementation code — marks implementation as [TO BE IMPLEMENTED BY PROGRAMMER]
Output is a schema table or structured list, not a narrative description

Case 2: Out-of-domain request — implement the event tracking in code

Input: "Now that the event schema is designed, write the GDScript code to fire these events in our Godot tutorial scene." Expected behavior:

Does not produce GDScript or any implementation code
States clearly: "Telemetry implementation in game code is handled by the appropriate programmer (gameplay-programmer or systems-programmer); I provide the event schema and integration requirements"
Optionally produces an integration spec: what the programmer needs to know to implement correctly (event name, properties, when to fire, what analytics SDK or endpoint to use)

Case 3: Domain boundary — A/B test design for a UI change

Input: "We want to A/B test two versions of our HUD: the current version and a minimal version with only a health bar. Design the test." Expected behavior:

Produces a complete A/B test design document:
- Hypothesis: The minimal HUD will increase player engagement (measured by session length) by reducing UI cognitive load
- Primary metric: Average session length per player
- Secondary metrics: Tutorial completion rate, Day 1 retention
- Sample size: Calculated estimate based on expected effect size (or notes that exact calculation requires baseline data) — does NOT skip this field
- Duration: Minimum duration (e.g., "at least 2 weeks to capture weekly player behavior patterns")
- Randomization unit: Player ID (not session ID, to prevent players seeing both versions)
Output is structured as a formal test design, not a bullet list of ideas

Case 4: Conflict — overlapping A/B test player segments

Input: "We have two A/B tests running simultaneously: Test A (HUD variants) affects all players, and Test B (tutorial variants) also affects all players." Expected behavior:

Flags the overlap as a mutual exclusion violation: if both tests affect the same player, their results are confounded — neither test produces clean data
Identifies the problem precisely: players in both tests will have HUD and tutorial variants interacting, making it impossible to attribute outcome differences to either variable alone
Proposes resolution options: (a) run tests sequentially, (b) split the player population into exclusive segments (50% in Test A, 50% in Test B, 0% in both), or (c) run a factorial design if the interaction effect is also of interest (more complex, requires larger sample)
Does NOT recommend continuing both tests on overlapping populations

Case 5: Context pass — new events consistent with existing schema

Input context: Existing event schema uses the naming convention: [domain]_[object]_[action] in snake_case. Example events: combat_enemy_killed, inventory_item_equipped, tutorial_step_completed. Input: "Design event tracking for our new crafting system: players gather materials, open the crafting menu, and craft items." Expected behavior:

Produces events following the exact naming convention from the provided schema: crafting_material_gathered, crafting_menu_opened, crafting_item_crafted
Does NOT invent a different naming pattern (e.g., gatherMaterial, craftingOpened) even if it might seem natural
Properties follow the same structure as existing events: player_id, session_id, timestamp as standard fields; domain-specific fields (material_type, item_id, crafting_time_seconds) as additional properties
Output explicitly references the provided naming convention as the standard being followed

Protocol Compliance

Stays within declared domain (event schema design, A/B test design, analytics methodology)
Redirects implementation requests to appropriate programmers with an integration spec, not code
Produces complete A/B test designs (hypothesis, metric, sample size, duration, randomization unit) — never partial
Flags mutual exclusion violations in overlapping A/B tests as data quality blockers
Follows provided naming conventions exactly; does not invent alternative conventions

Coverage Notes

Case 3 (A/B test design completeness) is a quality gate — an incomplete test design wastes experiment budget
Case 4 (mutual exclusion) is a data integrity test — overlapping tests produce unusable results; this must be caught
Case 5 is the most important context-awareness test; naming convention drift across schemas causes dashboard breakage
No automated runner; review manually or via /skill-test

6.4 KiB Raw Blame History