diff --git a/.gemini/commands/fix-behavioral-eval.toml b/.gemini/commands/fix-behavioral-eval.toml deleted file mode 100644 index d2f1c5b3ed..0000000000 --- a/.gemini/commands/fix-behavioral-eval.toml +++ /dev/null @@ -1,60 +0,0 @@ -description = "Check status of nightly evals, fix failures for key models, and re-run." -prompt = """ -You are an expert at fixing behavioral evaluations. - -1. **Investigate**: - - Use 'gh' cli to fetch the results from the latest run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml. - - DO NOT push any changes or start any runs. The rest of your evaluation will be local. - - Evals are in evals/ directory and are documented by evals/README.md. - - The test case trajectory logs will be logged to evals/logs. - - You should also enable and review the verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable. - - Identify the relevant test. Confine your investigation and validation to just this test. - - Proactively add logging that will aid in gathering information or validating your hypotheses. - -2. **Fix**: - - If a relevant test is failing, locate the test file and the corresponding prompt/code. - - It's often helpful to make an extreme, brute force change to see if you are changing the right place to make an improvement and then scope it back iteratively. - - Your **final** change should be **minimal and targeted**. - - Keep in mind the following: - - The prompt has multiple configurations and pieces. Take care that your changes - end up in the final prompt for the selected model and configuration. - - The prompt chosen for the eval is intentional. It's often vague or indirect - to see how the agent performs with ambiguous instructions. Changing it should - be a last resort. - - When changing the test prompt, carefully consider whether the prompt still tests - the same scenario. We don't want to lose test fidelity by making the prompts too - direct (i.e.: easy). - - Your primary mechanism for improving the agent's behavior is to make changes to - tool instructions, system prompt (snippets.ts), and/or modules that contribute to the prompt. - - If prompt and description changes are unsuccessful, use logs and debugging to - confirm that everything is working as expected. - - If unable to fix the test, you can make recommendations for architecture changes - that might help stablize the test. Be sure to THINK DEEPLY if offering architecture guidance. - Some facts that might help with this are: - - Agents may be composed of one or more agent loops. - - AgentLoop == 'context + toolset + prompt'. Subagents are one type of agent loop. - - Agent loops perform better when: - - They have direct, unambiguous, and non-contradictory prompts. - - They have fewer irrelevant tools. - - They have fewer goals or steps to perform. - - They have less low value or irrelevant context. - - You may suggest compositions of existing primitives, like subagents, or - propose a new one. - - These recommendations should be high confidence and should be grounded - in observed deficient behaviors rather than just parroting the facts above. - Investigate as needed to ground your recommendations. - -3. **Verify**: - - Run just that one test if needed to validate that it is fixed. Be sure to run vitest in non-interactive mode. - - Running the tests can take a long time, so consider whether you can diagnose via other means or log diagnostics before committing the time. You must minimize the number of test runs needed to diagnose the failure. - - After the test completes, check whether it seems to have improved. - - You will need to run the test 3 times for Gemini 3.0, Gemini 3 flash, and Gemini 2.5 pro to ensure that it is truly stable. Run these runs in parallel, using scripts if needed. - - Some flakiness is expected; if it looks like a transient issue or the test is inherently unstable but passes 2/3 times, you might decide it cannot be improved. - -4. **Report**: - - Provide a summary of the test success rate for each of the tested models. - - Success rate is calculated based on 3 runs per model (e.g., 3/3 = 100%). - - If you couldn't fix it due to persistent flakiness, explain why. - -{{args}} -""" \ No newline at end of file diff --git a/.gemini/commands/promote-behavioral-eval.toml b/.gemini/commands/promote-behavioral-eval.toml deleted file mode 100644 index 9893e9b02b..0000000000 --- a/.gemini/commands/promote-behavioral-eval.toml +++ /dev/null @@ -1,29 +0,0 @@ -description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs." -prompt = """ -You are an expert at analyzing and promoting behavioral evaluations. - -1. **Investigate**: - - Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml. - - DO NOT push any changes or start any runs. The rest of your evaluation will be local. - - Evals are in evals/ directory and are documented by evals/README.md. - - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row. - - NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run. - - If a test meets this criteria, it is a candidate for promotion. - -2. **Promote**: - - For each candidate test, locate the test file in the evals/ directory. - - Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations). - - Ensure you follow any guidelines in evals/README.md for stable tests. - - Your **final** change should be **minimal and targeted** to just promoting the test status. - -3. **Verify**: - - Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode. - - Check that the test is now part of the expected standard or stable test suites. - -4. **Report**: - - Provide a summary of the tests that were promoted. - - Include the success rate evidence (7/7 runs passed for all models) for each promoted test. - - If no tests met the criteria for promotion, clearly state that and summarize the closest candidates. - -{{args}} -""" diff --git a/.gemini/skills/behavioral-evals/SKILL.md b/.gemini/skills/behavioral-evals/SKILL.md new file mode 100644 index 0000000000..f60fb04832 --- /dev/null +++ b/.gemini/skills/behavioral-evals/SKILL.md @@ -0,0 +1,56 @@ +--- +name: behavioral-evals +description: Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests. +--- + +# Behavioral Evals + +## Overview + +Behavioral evaluations (evals) are tests that validate the **agent's decision-making** (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions. + +> [!NOTE] +> **Single Source of Truth**: For core concepts, policies, running tests, and general best practices, always refer to **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)**. + +--- + +## πŸ”„ Workflow Decision Tree + +1. **Does a prompt/tool change need validation?** + * *No* -> Normal integration tests. + * *Yes* -> Continue below. +2. **Is it UI/Interaction heavy?** + * *Yes* -> Use `appEvalTest` (`AppRig`). See **[creating.md](references/creating.md)**. + * *No* -> Use `evalTest` (`TestRig`). See **[creating.md](references/creating.md)**. +3. **Is it a new test?** + * *Yes* -> Set policy to `USUALLY_PASSES`. + * *No* -> `ALWAYS_PASSES` (locks in regression). +4. **Are you fixing a failure or promoting a test?** + * *Fixing* -> See **[fixing.md](references/fixing.md)**. + * *Promoting* -> See **[promoting.md](references/promoting.md)**. + +--- + +## πŸ“‹ Quick Checklist + +### 1. Setup Workspace +Seed the workspace with necessary files using the `files` object to simulate a realistic scenario (e.g., NodeJS project with `package.json`). +* *Details in **[creating.md](references/creating.md)*** + +### 2. Write Assertions +Audit agent decisions using `rig.setBreakpoint()` (AppRig only) or index verification on `rig.readToolLogs()`. +* *Details in **[creating.md](references/creating.md)*** + +### 3. Verify +Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows. +* *See **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)** for running commands.* + +--- + +## πŸ“¦ Bundled Resources + +Detailed procedural guides: +* **[creating.md](references/creating.md)**: Assertion strategies, Rig selection, Mock MCPs. +* **[fixing.md](references/fixing.md)**: Step-by-step automated investigation, architecture diagnosis guidelines. +* **[promoting.md](references/promoting.md)**: Candidate identification criteria and threshold guidelines. + diff --git a/.gemini/skills/behavioral-evals/assets/interactive_eval.ts.txt b/.gemini/skills/behavioral-evals/assets/interactive_eval.ts.txt new file mode 100644 index 0000000000..2d2b7433dc --- /dev/null +++ b/.gemini/skills/behavioral-evals/assets/interactive_eval.ts.txt @@ -0,0 +1,27 @@ +import { describe, expect } from 'vitest'; +import { appEvalTest } from './app-test-helper.js'; + +describe('interactive_feature', () => { + // New tests MUST start as USUALLY_PASSES + appEvalTest('USUALLY_PASSES', { + name: 'should pause for user confirmation', + files: { + 'package.json': JSON.stringify({ name: 'app' }) + }, + prompt: 'Task description here requiring approval', + timeout: 60000, + setup: async (rig) => { + // ⚠️ Breakpoints are ONLY safe in appEvalTest + rig.setBreakpoint(['ask_user']); + }, + assert: async (rig) => { + // 1. Wait for the breakpoint to trigger + const confirmation = await rig.waitForPendingConfirmation('ask_user'); + expect(confirmation).toBeDefined(); + + // 2. Resolve it so the test can finish + await rig.resolveTool(confirmation); + await rig.waitForIdle(); + }, + }); +}); diff --git a/.gemini/skills/behavioral-evals/assets/standard_eval.ts.txt b/.gemini/skills/behavioral-evals/assets/standard_eval.ts.txt new file mode 100644 index 0000000000..3e666dfc37 --- /dev/null +++ b/.gemini/skills/behavioral-evals/assets/standard_eval.ts.txt @@ -0,0 +1,30 @@ +import { describe, expect } from 'vitest'; +import { evalTest } from './test-helper.js'; + +describe('core_feature', () => { + // New tests MUST start as USUALLY_PASSES + evalTest('USUALLY_PASSES', { + name: 'should perform expected agent action', + setup: async (rig) => { + // For mocking offline MCP: + // rig.addMockMcpServer('workspace-server', 'google-workspace'); + }, + files: { + 'src/app.ts': '// some code', + }, + prompt: 'Task description here', + timeout: 60000, // 1 minute safety limit + assert: async (rig, result) => { + // 1. Audit the trajectory (Safe for standard evalTest) + const logs = rig.readToolLogs(); + const hasTool = logs.some((l) => l.toolRequest.name === 'read_file'); + expect(hasTool, 'Agent should have read the file').toBe(true); + + // 2. Assert efficiency (Cost/Turn) + expect(logs.length).toBeLessThan(5); + + // 3. Assert final output + expect(result).toContain('Expected Keyword'); + }, + }); +}); diff --git a/.gemini/skills/behavioral-evals/references/creating.md b/.gemini/skills/behavioral-evals/references/creating.md new file mode 100644 index 0000000000..bcc1baff06 --- /dev/null +++ b/.gemini/skills/behavioral-evals/references/creating.md @@ -0,0 +1,151 @@ +# Creating Behavioral Evals + +## πŸ”¬ Rig Selection + +| Rig Type | Import From | Architecture | Use When | +| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- | +| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. | +| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. | + +--- + +## πŸ—οΈ Scenario Design + +Evals must simulate realistic agent environments to effectively test +decision-making. + +- **Workspace State**: Seed with standard project anchors if testing general + capabilities: + - `package.json` for NodeJS environments. + - Minimal configuration files (`tsconfig.json`, `GEMINI.md`). +- **Structural Complexity**: Provide enough files to force the agent to _search_ + or _navigate_, rather than giving the answer directly. Avoid trivial one-file + tests unless testing exact prompt steering. + +--- + +## ❌ Fail First Principle + +Before asserting a new capability or locking in a fix, **verify that the test +fails first**. + +- It is easy to accidentally write an eval that asserts behaviors that are + already met or pass by default. +- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify + test passes. + +--- + +## βœ‹ Testing Patterns + +### 1. Breakpoints + +Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for +interactive prompts or safety checks. + +```typescript +// ⚠️ Only works with appEvalTest (AppRig) +setup: async (rig) => { + rig.setBreakpoint(['ask_user']); +}, +assert: async (rig) => { + const confirmation = await rig.waitForPendingConfirmation('ask_user'); + expect(confirmation).toBeDefined(); +} +``` + +### 2. Tool Confirmation Race + +When asserting multiple triggers (e.g., "enters plan mode then asks question"): + +```typescript +assert: async (rig) => { + let confirmation = await rig.waitForPendingConfirmation([ + 'enter_plan_mode', + 'ask_user', + ]); + + if (confirmation?.name === 'enter_plan_mode') { + rig.acceptConfirmation('enter_plan_mode'); + confirmation = await rig.waitForPendingConfirmation('ask_user'); + } + expect(confirmation?.toolName).toBe('ask_user'); +}; +``` + +### 3. Audit Tool Logs + +Audit exact operations to ensure efficiency (e.g., no redundant reads). + +```typescript +assert: async (rig, result) => { + await rig.waitForTelemetryReady(); + const toolLogs = rig.readToolLogs(); + + const writeCall = toolLogs.find( + (log) => log.toolRequest.name === 'write_file', + ); + expect(writeCall).toBeDefined(); +}; +``` + +### 4. Mock MCP Facades + +To evaluate tools connected via MCP without hitting live endpoints, load a mock +server configuration in the `setup` hook. + +```typescript +setup: async (rig) => { + rig.addMockMcpServer('workspace-server', 'google-workspace'); +}, +assert: async (rig) => { + await rig.waitForTelemetryReady(); + const toolLogs = rig.readToolLogs(); + const workspaceCall = toolLogs.find( + (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText' + ); + expect(workspaceCall).toBeDefined(); +}; +``` + +--- + +## ⚠️ Safety & Efficiency Guardrails + +### 1. Breakpoint Deadlocks + +Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`, +`rig.run()` waits for the process to exit _before_ assertions run. **This will +hang indefinitely.** + +- **Use Breakpoints** for `appEvalTest` or interactive simulations. +- **Use Audit Tool Logs** (above) for standard trajectory tests. + +### 2. Runaway Timeout + +Always set a budget boundary in the `EvalCase` to prevent runaway loops on +quota: + +```typescript +evalTest('USUALLY_PASSES', { + name: '...', + timeout: 60000, // 1 minute safety limit + // ... +}); +``` + +### 3. Efficiency Assertion (Turn limits) + +Check if a tool is called _early_ using index checks: + +```typescript +assert: async (rig) => { + const toolLogs = rig.readToolLogs(); + const toolCallIndex = toolLogs.findIndex( + (log) => log.toolRequest.name === 'cli_help', + ); + + expect(toolCallIndex).toBeGreaterThan(-1); + expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns +}; +``` diff --git a/.gemini/skills/behavioral-evals/references/fixing.md b/.gemini/skills/behavioral-evals/references/fixing.md new file mode 100644 index 0000000000..fc78870515 --- /dev/null +++ b/.gemini/skills/behavioral-evals/references/fixing.md @@ -0,0 +1,71 @@ +# Fixing Behavioral Evals + +Use this guide when asked to debug, troubleshoot, or fix a failing behavioral +evaluation. + +--- + +## 1. πŸ” Investigate + +1. **Fetch Nightly Results**: Use the `gh` CLI to inspect the latest run from + `evals-nightly.yml` if applicable. + - _Example view URL_: + `https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml` +2. **Isolate**: DO NOT push changes or start remote runs. Confine investigation + to the local workspace. +3. **Read Logs**: + - Eval logs live in `evals/logs/.log`. + - Enable verbose debugging via `export GEMINI_DEBUG_LOG_FILE="debug.log"`. +4. **Diagnose**: Audit tool logs and telemetry. Note if due to setup/assert. + - **Tip**: Proactively add custom logging/diagnostics to check hypotheses. + +--- + +## 2. πŸ› οΈ Fix Strategy + +1. **Targeted Location**: Locate the test case and the corresponding + prompt/code. +2. **Iterative Scope**: Make extreme change first to verify scope, then refine + to a minimal, targeted change. +3. **Assertion Fidelity**: + - Changing the test prompt is a **last resort** (prompts are often vague by + design). + - **Warning**: Do not lose test fidelity by making prompts too direct/easy. + - **Primary Fix Trigger**: Adjust tool descriptions, system prompts + (`snippets.ts`), or **modules that contribute to the prompt template**. + - **Warning**: Prompts have multiple configurations; ensure your fix targets + the correct config for the model in question. +4. **Architecture Options**: If prompt or instruction tuning triggers no + improvement, analyze loop composition. + - **AgentLoop**: Defined by `context + toolset + prompt`. + - **Enhancements**: Loops perform best with direct prompts, fewer irrelevant + tools, low goal density, and minimal low-value/irrelevant context. + - **Modifications**: Compose subagents or isolate tools. Ground in observed + traces. + - **Warning**: Think deeply before offering recommendations; avoid parroting + abstract design guidelines. + +--- + +## 3. βœ… Verify + +1. **Run Local**: Run Vitest in non-interactive mode on just the file. +2. **Log Audit**: Prioritize diagnosing failures via log comparison before + triggering heavy test runs. +3. **Stability Limit**: Run the test **3 times** locally on key models (can use + scripts to run in parallel for speed): + - **Gemini 3.0** + - **Gemini 3 Flash** + - **Gemini 2.5 Pro** +4. **Flakiness Rule**: If it passes 2/3 times, it may be inherent noise + difficult to improve without a structural split. + +--- + +## 4. πŸ“Š Report + +Provide a summary of: + +- Test success rate for each tested model (e.g., 3/3 = 100%). +- Root cause identification and fix explanation. +- If unfixed, provide high-confidence architecture recommendations. diff --git a/.gemini/skills/behavioral-evals/references/promoting.md b/.gemini/skills/behavioral-evals/references/promoting.md new file mode 100644 index 0000000000..d3d3eaf88f --- /dev/null +++ b/.gemini/skills/behavioral-evals/references/promoting.md @@ -0,0 +1,55 @@ +# Promoting Behavioral Evals + +Use this guide when asked to analyze nightly results and promote incubated tests +to stable suites. + +--- + +## 1. πŸ” Investigate candidates + +1. **Audit Nightly Logs**: Use the `gh` CLI to fetch results from + `evals-nightly.yml` (Direct URL: + `https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`). + - **Tip**: The aggregate summary from the most recent run integrates the + last 7 runs of history automatically. + - **Safety**: DO NOT push changes or start remote runs. All verification is + local. +2. **Assess Stability**: Identify tests that pass **100% of the time** across + ALL enabled models over the **last 7 nightly runs** in a row. + - _100% means the test passed 3/3 times for every model and run._ +3. **Promotion Targets**: Tests meeting this criteria are candidates for + promotion from `USUALLY_PASSES` to `ALWAYS_PASSES`. + +--- + +## 2. πŸš₯ Promotion Steps + +1. **Locate File**: Locate the eval file in the `evals/` directory. +2. **Update Policy**: Modify the policy argument to `ALWAYS_PASSES`. + ```typescript + evalTest('ALWAYS_PASSES', { ... }) + ``` +3. **Targeting**: Follow guidelines in `evals/README.md` regarding stable suite + organization. +4. **Constraint**: Your final change must be **minimal and targeted** strictly + to promoting the test status. Do not refactor the test or setup fixtures. + +--- + +## 3. βœ… Verify + +1. **Run Prompted Tests**: Run the promoted test locally using non-interactive + Vitest to confirm structure validity. +2. **Verify Suite Inclusion**: Check that the test is successfully picked up by + standard runnable ranges. + +--- + +## 4. πŸ“Š Report + +Provide a summary of: + +- Which tests were promoted. +- Provide the success rate evidence (e.g., 7/7 runs passed for all models). +- If no candidates qualified, list the next closest candidates and their current + pass rate. diff --git a/.gemini/skills/behavioral-evals/references/running.md b/.gemini/skills/behavioral-evals/references/running.md new file mode 100644 index 0000000000..cf8c46a8d6 --- /dev/null +++ b/.gemini/skills/behavioral-evals/references/running.md @@ -0,0 +1,95 @@ +# Running & Promoting Evals + +## πŸ› οΈ Prerequisites + +Behavioral evals run against the compiled binary. You **must** build and bundle +the project first after making changes: + +```bash +npm run build && npm run bundle +``` + +--- + +## πŸƒβ€β™‚οΈ Running Tests + +### 1. Configure Environment Variables + +Evals require a standard API key. If your `.env` file has multiple keys or +comments, use this precise extraction setup: + +```bash +export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts +``` + +### 2. Commands + +| Command | Scope | Description | +| :---------------------------------- | :-------------- | :------------------------------------------------- | +| `npm run test:always_passing_evals` | `ALWAYS_PASSES` | Fast feedback, runs in CI. | +| `npm run test:all_evals` | All | Runs nightly incubation tests. Sets `RUN_EVALS=1`. | + +### Target Specific File + +_Note: `RUN_EVALS=1` is required for incubated (`USUALLY_PASSES`) tests._ + +```bash +RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts +``` + +--- + +## 🐞 Debugging and Logs + +If a test fails, verify: + +- **Tool Trajectory Logs**:εΊεˆ— of calls in `evals/logs/.log`. +- **Verbose Reasoning**: Capture raw buffer traces by setting + `GEMINI_DEBUG_LOG_FILE`: + ```bash + export GEMINI_DEBUG_LOG_FILE="debug.log" + ``` + +--- + +### 🎯 Verify Model Targeting + +- **Tip:** Standard evals benchmark against model variations. If a test passes + on Flash but fails on Pro (or vice versa), the issue is usually in the **tool + description**, not the prompt definition. Flash is sensitive to "instruction + bloat," while Pro is sensitive to "ambiguous intent." + +--- + +## πŸš₯ deflaking & Promotion + +To maintain CI stability, all new evals follow a strict incubation period. + +### 1. Incubation (`USUALLY_PASSES`) + +New tests must be created with the `USUALLY_PASSES` policy. + +```typescript +evalTest('USUALLY_PASSES', { ... }) +``` + +They run in **Evals: Nightly** workflows and do not block PR merges. + +### 2. Investigate Failures + +If a nightly eval regresses, investigate via agent: + +```bash +gemini /fix-behavioral-eval [optional-run-uri] +``` + +### 3. Promotion (`ALWAYS_PASSES`) + +Once a test scores 100% consistency over multiple nightly cycles: + +```bash +gemini /promote-behavioral-eval +``` + +_Do not promote manually._ The command verifies trajectory logs before updating +the file policy. diff --git a/evals/README.md b/evals/README.md index 6cfecbad07..9e3697a6b8 100644 --- a/evals/README.md +++ b/evals/README.md @@ -6,6 +6,10 @@ for changes to system prompts, tool definitions, and other model-steering mechanisms, and as a tool for assessing feature reliability by model, and preventing regressions. +> [!TIP] **Agent Automation**: If you are pair-programming with Gemini CLI, you +> can leverage the **behavioral-evals skill** to automate fixing failing tests +> or promoting incubation candidates. + ## Why Behavioral Evals? Unlike traditional **integration tests** which verify that the system functions @@ -121,7 +125,7 @@ import { describe, expect } from 'vitest'; import { evalTest } from './test-helper.js'; describe('my_feature', () => { - // New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval + // New tests MUST start as USUALLY_PASSES and be promoted based on consistency metrics evalTest('USUALLY_PASSES', { name: 'should do something', prompt: 'do it', @@ -183,12 +187,10 @@ mandatory deflaking process. 1. **Incubation**: You must create all new tests with the `USUALLY_PASSES` policy. This lets them be monitored in the nightly runs without blocking PRs. -2. **Monitoring**: The test must complete at least 10 nightly runs across all +2. **Monitoring**: The test must complete at least 7 nightly runs across all supported models. -3. **Promotion**: Promotion to `ALWAYS_PASSES` happens exclusively through the - `/promote-behavioral-eval` slash command. This command verifies the 100% - success rate requirement is met across many runs before updating the test - policy. +3. **Promotion**: Promotion to `ALWAYS_PASSES` is conducted by the agent after + verifying the 100% success rate requirement is met across many runs. This promotion process is essential for preventing the introduction of flaky evaluations into the CI. @@ -225,42 +227,21 @@ tool definition has made the model's behavior less reliable. ## Fixing Evaluations -If an evaluation is failing or has a regressed pass rate, you can use the -`/fix-behavioral-eval` command within Gemini CLI to help investigate and fix the -issue. - -### `/fix-behavioral-eval` - -This command is designed to automate the investigation and fixing process for -failing evaluations. It will: +If an evaluation is failing or has a regressed pass rate, ask the agent to +investigate and fix the issue using the **behavioral-evals skill**. The agent +will automate the following process: 1. **Investigate**: Fetch the latest results from the nightly workflow using the `gh` CLI, identify the failing test, and review test trajectory logs in `evals/logs`. 2. **Fix**: Suggest and apply targeted fixes to the prompt or tool definitions. - It prioritizes minimal changes to `prompt.ts`, tool instructions, and - modules that contribute to the prompt. It generally tries to avoid changing - the test itself. -3. **Verify**: Re-run the test 3 times across multiple models (e.g., Gemini - 3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a - success rate. -4. **Report**: Provide a summary of the success rate for each model and details - on the applied fixes. + It prioritizes minimal changes to `prompt.ts` and tool instructions, + avoiding changing the test itself unless necessary. +3. **Verify**: Re-run the test locally across multiple models to ensure + stability. +4. **Report**: Provide a summary of the success rate. -To use it, run: - -```bash -gemini /fix-behavioral-eval -``` - -You can also provide a link to a specific GitHub Action run or the name of a -specific test to focus the investigation: - -```bash -gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/runs/123456789 -``` - -When investigating failures manually, you can also enable verbose agent logs by +When investigating failures manually, you can enable verbose agent logs by setting the `GEMINI_DEBUG_LOG_FILE` environment variable. ### Best practices @@ -273,25 +254,14 @@ instrospecting on its prompt when asked the right questions. ## Promoting evaluations -Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES` -exclusively using the `/promote-behavioral-eval` slash command. Manual promotion -is not allowed to ensure that the 100% success rate requirement is empirically -met. +Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES` by the +agent to ensure that the 100% success rate requirement is empirically met. -### `/promote-behavioral-eval` - -This command automates the promotion of stable tests by: +The agent automates the promotion by: 1. **Investigating**: Analyzing the results of the last 7 nightly runs on the - `main` branch using the `gh` CLI. -2. **Criteria Check**: Identifying tests that have passed 100% of the time for - ALL enabled models across the entire 7-run history. -3. **Promotion**: Updating the test file's policy from `USUALLY_PASSES` to - `ALWAYS_PASSES`. + `main` branch. +2. **Criteria Check**: Ensuring tests passed 100% of the time for ALL enabled + models. +3. **Promotion**: Updating the test file's policy to `ALWAYS_PASSES`. 4. **Verification**: Running the promoted test locally to ensure correctness. - -To run it: - -```bash -gemini /promote-behavioral-eval -```