feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)

This commit is contained in:
Abhi 2026-03-23 17:06:43 -04:00 committed by GitHub
parent fbf38361ad
commit db14cdf92b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 509 additions and 143 deletions

View file

@ -1,60 +0,0 @@
description = "Check status of nightly evals, fix failures for key models, and re-run."
prompt = """
You are an expert at fixing behavioral evaluations.
1. **Investigate**:
- Use 'gh' cli to fetch the results from the latest run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
- DO NOT push any changes or start any runs. The rest of your evaluation will be local.
- Evals are in evals/ directory and are documented by evals/README.md.
- The test case trajectory logs will be logged to evals/logs.
- You should also enable and review the verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable.
- Identify the relevant test. Confine your investigation and validation to just this test.
- Proactively add logging that will aid in gathering information or validating your hypotheses.
2. **Fix**:
- If a relevant test is failing, locate the test file and the corresponding prompt/code.
- It's often helpful to make an extreme, brute force change to see if you are changing the right place to make an improvement and then scope it back iteratively.
- Your **final** change should be **minimal and targeted**.
- Keep in mind the following:
- The prompt has multiple configurations and pieces. Take care that your changes
end up in the final prompt for the selected model and configuration.
- The prompt chosen for the eval is intentional. It's often vague or indirect
to see how the agent performs with ambiguous instructions. Changing it should
be a last resort.
- When changing the test prompt, carefully consider whether the prompt still tests
the same scenario. We don't want to lose test fidelity by making the prompts too
direct (i.e.: easy).
- Your primary mechanism for improving the agent's behavior is to make changes to
tool instructions, system prompt (snippets.ts), and/or modules that contribute to the prompt.
- If prompt and description changes are unsuccessful, use logs and debugging to
confirm that everything is working as expected.
- If unable to fix the test, you can make recommendations for architecture changes
that might help stablize the test. Be sure to THINK DEEPLY if offering architecture guidance.
Some facts that might help with this are:
- Agents may be composed of one or more agent loops.
- AgentLoop == 'context + toolset + prompt'. Subagents are one type of agent loop.
- Agent loops perform better when:
- They have direct, unambiguous, and non-contradictory prompts.
- They have fewer irrelevant tools.
- They have fewer goals or steps to perform.
- They have less low value or irrelevant context.
- You may suggest compositions of existing primitives, like subagents, or
propose a new one.
- These recommendations should be high confidence and should be grounded
in observed deficient behaviors rather than just parroting the facts above.
Investigate as needed to ground your recommendations.
3. **Verify**:
- Run just that one test if needed to validate that it is fixed. Be sure to run vitest in non-interactive mode.
- Running the tests can take a long time, so consider whether you can diagnose via other means or log diagnostics before committing the time. You must minimize the number of test runs needed to diagnose the failure.
- After the test completes, check whether it seems to have improved.
- You will need to run the test 3 times for Gemini 3.0, Gemini 3 flash, and Gemini 2.5 pro to ensure that it is truly stable. Run these runs in parallel, using scripts if needed.
- Some flakiness is expected; if it looks like a transient issue or the test is inherently unstable but passes 2/3 times, you might decide it cannot be improved.
4. **Report**:
- Provide a summary of the test success rate for each of the tested models.
- Success rate is calculated based on 3 runs per model (e.g., 3/3 = 100%).
- If you couldn't fix it due to persistent flakiness, explain why.
{{args}}
"""

View file

@ -1,29 +0,0 @@
description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs."
prompt = """
You are an expert at analyzing and promoting behavioral evaluations.
1. **Investigate**:
- Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
- DO NOT push any changes or start any runs. The rest of your evaluation will be local.
- Evals are in evals/ directory and are documented by evals/README.md.
- Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row.
- NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run.
- If a test meets this criteria, it is a candidate for promotion.
2. **Promote**:
- For each candidate test, locate the test file in the evals/ directory.
- Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations).
- Ensure you follow any guidelines in evals/README.md for stable tests.
- Your **final** change should be **minimal and targeted** to just promoting the test status.
3. **Verify**:
- Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode.
- Check that the test is now part of the expected standard or stable test suites.
4. **Report**:
- Provide a summary of the tests that were promoted.
- Include the success rate evidence (7/7 runs passed for all models) for each promoted test.
- If no tests met the criteria for promotion, clearly state that and summarize the closest candidates.
{{args}}
"""

View file

@ -0,0 +1,56 @@
---
name: behavioral-evals
description: Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests.
---
# Behavioral Evals
## Overview
Behavioral evaluations (evals) are tests that validate the **agent's decision-making** (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
> [!NOTE]
> **Single Source of Truth**: For core concepts, policies, running tests, and general best practices, always refer to **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)**.
---
## 🔄 Workflow Decision Tree
1. **Does a prompt/tool change need validation?**
* *No* -> Normal integration tests.
* *Yes* -> Continue below.
2. **Is it UI/Interaction heavy?**
* *Yes* -> Use `appEvalTest` (`AppRig`). See **[creating.md](references/creating.md)**.
* *No* -> Use `evalTest` (`TestRig`). See **[creating.md](references/creating.md)**.
3. **Is it a new test?**
* *Yes* -> Set policy to `USUALLY_PASSES`.
* *No* -> `ALWAYS_PASSES` (locks in regression).
4. **Are you fixing a failure or promoting a test?**
* *Fixing* -> See **[fixing.md](references/fixing.md)**.
* *Promoting* -> See **[promoting.md](references/promoting.md)**.
---
## 📋 Quick Checklist
### 1. Setup Workspace
Seed the workspace with necessary files using the `files` object to simulate a realistic scenario (e.g., NodeJS project with `package.json`).
* *Details in **[creating.md](references/creating.md)***
### 2. Write Assertions
Audit agent decisions using `rig.setBreakpoint()` (AppRig only) or index verification on `rig.readToolLogs()`.
* *Details in **[creating.md](references/creating.md)***
### 3. Verify
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
* *See **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)** for running commands.*
---
## 📦 Bundled Resources
Detailed procedural guides:
* **[creating.md](references/creating.md)**: Assertion strategies, Rig selection, Mock MCPs.
* **[fixing.md](references/fixing.md)**: Step-by-step automated investigation, architecture diagnosis guidelines.
* **[promoting.md](references/promoting.md)**: Candidate identification criteria and threshold guidelines.

View file

@ -0,0 +1,27 @@
import { describe, expect } from 'vitest';
import { appEvalTest } from './app-test-helper.js';
describe('interactive_feature', () => {
// New tests MUST start as USUALLY_PASSES
appEvalTest('USUALLY_PASSES', {
name: 'should pause for user confirmation',
files: {
'package.json': JSON.stringify({ name: 'app' })
},
prompt: 'Task description here requiring approval',
timeout: 60000,
setup: async (rig) => {
// ⚠️ Breakpoints are ONLY safe in appEvalTest
rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
// 1. Wait for the breakpoint to trigger
const confirmation = await rig.waitForPendingConfirmation('ask_user');
expect(confirmation).toBeDefined();
// 2. Resolve it so the test can finish
await rig.resolveTool(confirmation);
await rig.waitForIdle();
},
});
});

View file

@ -0,0 +1,30 @@
import { describe, expect } from 'vitest';
import { evalTest } from './test-helper.js';
describe('core_feature', () => {
// New tests MUST start as USUALLY_PASSES
evalTest('USUALLY_PASSES', {
name: 'should perform expected agent action',
setup: async (rig) => {
// For mocking offline MCP:
// rig.addMockMcpServer('workspace-server', 'google-workspace');
},
files: {
'src/app.ts': '// some code',
},
prompt: 'Task description here',
timeout: 60000, // 1 minute safety limit
assert: async (rig, result) => {
// 1. Audit the trajectory (Safe for standard evalTest)
const logs = rig.readToolLogs();
const hasTool = logs.some((l) => l.toolRequest.name === 'read_file');
expect(hasTool, 'Agent should have read the file').toBe(true);
// 2. Assert efficiency (Cost/Turn)
expect(logs.length).toBeLessThan(5);
// 3. Assert final output
expect(result).toContain('Expected Keyword');
},
});
});

View file

@ -0,0 +1,151 @@
# Creating Behavioral Evals
## 🔬 Rig Selection
| Rig Type | Import From | Architecture | Use When |
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. |
---
## 🏗️ Scenario Design
Evals must simulate realistic agent environments to effectively test
decision-making.
- **Workspace State**: Seed with standard project anchors if testing general
capabilities:
- `package.json` for NodeJS environments.
- Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
- **Structural Complexity**: Provide enough files to force the agent to _search_
or _navigate_, rather than giving the answer directly. Avoid trivial one-file
tests unless testing exact prompt steering.
---
## ❌ Fail First Principle
Before asserting a new capability or locking in a fix, **verify that the test
fails first**.
- It is easy to accidentally write an eval that asserts behaviors that are
already met or pass by default.
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
test passes.
---
## ✋ Testing Patterns
### 1. Breakpoints
Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
interactive prompts or safety checks.
```typescript
// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
const confirmation = await rig.waitForPendingConfirmation('ask_user');
expect(confirmation).toBeDefined();
}
```
### 2. Tool Confirmation Race
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
```typescript
assert: async (rig) => {
let confirmation = await rig.waitForPendingConfirmation([
'enter_plan_mode',
'ask_user',
]);
if (confirmation?.name === 'enter_plan_mode') {
rig.acceptConfirmation('enter_plan_mode');
confirmation = await rig.waitForPendingConfirmation('ask_user');
}
expect(confirmation?.toolName).toBe('ask_user');
};
```
### 3. Audit Tool Logs
Audit exact operations to ensure efficiency (e.g., no redundant reads).
```typescript
assert: async (rig, result) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const writeCall = toolLogs.find(
(log) => log.toolRequest.name === 'write_file',
);
expect(writeCall).toBeDefined();
};
```
### 4. Mock MCP Facades
To evaluate tools connected via MCP without hitting live endpoints, load a mock
server configuration in the `setup` hook.
```typescript
setup: async (rig) => {
rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const workspaceCall = toolLogs.find(
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
);
expect(workspaceCall).toBeDefined();
};
```
---
## ⚠️ Safety & Efficiency Guardrails
### 1. Breakpoint Deadlocks
Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
`rig.run()` waits for the process to exit _before_ assertions run. **This will
hang indefinitely.**
- **Use Breakpoints** for `appEvalTest` or interactive simulations.
- **Use Audit Tool Logs** (above) for standard trajectory tests.
### 2. Runaway Timeout
Always set a budget boundary in the `EvalCase` to prevent runaway loops on
quota:
```typescript
evalTest('USUALLY_PASSES', {
name: '...',
timeout: 60000, // 1 minute safety limit
// ...
});
```
### 3. Efficiency Assertion (Turn limits)
Check if a tool is called _early_ using index checks:
```typescript
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolCallIndex = toolLogs.findIndex(
(log) => log.toolRequest.name === 'cli_help',
);
expect(toolCallIndex).toBeGreaterThan(-1);
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};
```

View file

@ -0,0 +1,71 @@
# Fixing Behavioral Evals
Use this guide when asked to debug, troubleshoot, or fix a failing behavioral
evaluation.
---
## 1. 🔍 Investigate
1. **Fetch Nightly Results**: Use the `gh` CLI to inspect the latest run from
`evals-nightly.yml` if applicable.
- _Example view URL_:
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`
2. **Isolate**: DO NOT push changes or start remote runs. Confine investigation
to the local workspace.
3. **Read Logs**:
- Eval logs live in `evals/logs/<test_name>.log`.
- Enable verbose debugging via `export GEMINI_DEBUG_LOG_FILE="debug.log"`.
4. **Diagnose**: Audit tool logs and telemetry. Note if due to setup/assert.
- **Tip**: Proactively add custom logging/diagnostics to check hypotheses.
---
## 2. 🛠️ Fix Strategy
1. **Targeted Location**: Locate the test case and the corresponding
prompt/code.
2. **Iterative Scope**: Make extreme change first to verify scope, then refine
to a minimal, targeted change.
3. **Assertion Fidelity**:
- Changing the test prompt is a **last resort** (prompts are often vague by
design).
- **Warning**: Do not lose test fidelity by making prompts too direct/easy.
- **Primary Fix Trigger**: Adjust tool descriptions, system prompts
(`snippets.ts`), or **modules that contribute to the prompt template**.
- **Warning**: Prompts have multiple configurations; ensure your fix targets
the correct config for the model in question.
4. **Architecture Options**: If prompt or instruction tuning triggers no
improvement, analyze loop composition.
- **AgentLoop**: Defined by `context + toolset + prompt`.
- **Enhancements**: Loops perform best with direct prompts, fewer irrelevant
tools, low goal density, and minimal low-value/irrelevant context.
- **Modifications**: Compose subagents or isolate tools. Ground in observed
traces.
- **Warning**: Think deeply before offering recommendations; avoid parroting
abstract design guidelines.
---
## 3. ✅ Verify
1. **Run Local**: Run Vitest in non-interactive mode on just the file.
2. **Log Audit**: Prioritize diagnosing failures via log comparison before
triggering heavy test runs.
3. **Stability Limit**: Run the test **3 times** locally on key models (can use
scripts to run in parallel for speed):
- **Gemini 3.0**
- **Gemini 3 Flash**
- **Gemini 2.5 Pro**
4. **Flakiness Rule**: If it passes 2/3 times, it may be inherent noise
difficult to improve without a structural split.
---
## 4. 📊 Report
Provide a summary of:
- Test success rate for each tested model (e.g., 3/3 = 100%).
- Root cause identification and fix explanation.
- If unfixed, provide high-confidence architecture recommendations.

View file

@ -0,0 +1,55 @@
# Promoting Behavioral Evals
Use this guide when asked to analyze nightly results and promote incubated tests
to stable suites.
---
## 1. 🔍 Investigate candidates
1. **Audit Nightly Logs**: Use the `gh` CLI to fetch results from
`evals-nightly.yml` (Direct URL:
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`).
- **Tip**: The aggregate summary from the most recent run integrates the
last 7 runs of history automatically.
- **Safety**: DO NOT push changes or start remote runs. All verification is
local.
2. **Assess Stability**: Identify tests that pass **100% of the time** across
ALL enabled models over the **last 7 nightly runs** in a row.
- _100% means the test passed 3/3 times for every model and run._
3. **Promotion Targets**: Tests meeting this criteria are candidates for
promotion from `USUALLY_PASSES` to `ALWAYS_PASSES`.
---
## 2. 🚥 Promotion Steps
1. **Locate File**: Locate the eval file in the `evals/` directory.
2. **Update Policy**: Modify the policy argument to `ALWAYS_PASSES`.
```typescript
evalTest('ALWAYS_PASSES', { ... })
```
3. **Targeting**: Follow guidelines in `evals/README.md` regarding stable suite
organization.
4. **Constraint**: Your final change must be **minimal and targeted** strictly
to promoting the test status. Do not refactor the test or setup fixtures.
---
## 3. ✅ Verify
1. **Run Prompted Tests**: Run the promoted test locally using non-interactive
Vitest to confirm structure validity.
2. **Verify Suite Inclusion**: Check that the test is successfully picked up by
standard runnable ranges.
---
## 4. 📊 Report
Provide a summary of:
- Which tests were promoted.
- Provide the success rate evidence (e.g., 7/7 runs passed for all models).
- If no candidates qualified, list the next closest candidates and their current
pass rate.

View file

@ -0,0 +1,95 @@
# Running & Promoting Evals
## 🛠️ Prerequisites
Behavioral evals run against the compiled binary. You **must** build and bundle
the project first after making changes:
```bash
npm run build && npm run bundle
```
---
## 🏃‍♂️ Running Tests
### 1. Configure Environment Variables
Evals require a standard API key. If your `.env` file has multiple keys or
comments, use this precise extraction setup:
```bash
export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts <file_name>
```
### 2. Commands
| Command | Scope | Description |
| :---------------------------------- | :-------------- | :------------------------------------------------- |
| `npm run test:always_passing_evals` | `ALWAYS_PASSES` | Fast feedback, runs in CI. |
| `npm run test:all_evals` | All | Runs nightly incubation tests. Sets `RUN_EVALS=1`. |
### Target Specific File
_Note: `RUN_EVALS=1` is required for incubated (`USUALLY_PASSES`) tests._
```bash
RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts
```
---
## 🐞 Debugging and Logs
If a test fails, verify:
- **Tool Trajectory Logs**:序列 of calls in `evals/logs/<test_name>.log`.
- **Verbose Reasoning**: Capture raw buffer traces by setting
`GEMINI_DEBUG_LOG_FILE`:
```bash
export GEMINI_DEBUG_LOG_FILE="debug.log"
```
---
### 🎯 Verify Model Targeting
- **Tip:** Standard evals benchmark against model variations. If a test passes
on Flash but fails on Pro (or vice versa), the issue is usually in the **tool
description**, not the prompt definition. Flash is sensitive to "instruction
bloat," while Pro is sensitive to "ambiguous intent."
---
## 🚥 deflaking & Promotion
To maintain CI stability, all new evals follow a strict incubation period.
### 1. Incubation (`USUALLY_PASSES`)
New tests must be created with the `USUALLY_PASSES` policy.
```typescript
evalTest('USUALLY_PASSES', { ... })
```
They run in **Evals: Nightly** workflows and do not block PR merges.
### 2. Investigate Failures
If a nightly eval regresses, investigate via agent:
```bash
gemini /fix-behavioral-eval [optional-run-uri]
```
### 3. Promotion (`ALWAYS_PASSES`)
Once a test scores 100% consistency over multiple nightly cycles:
```bash
gemini /promote-behavioral-eval
```
_Do not promote manually._ The command verifies trajectory logs before updating
the file policy.

View file

@ -6,6 +6,10 @@ for changes to system prompts, tool definitions, and other model-steering
mechanisms, and as a tool for assessing feature reliability by model, and
preventing regressions.
> [!TIP] **Agent Automation**: If you are pair-programming with Gemini CLI, you
> can leverage the **behavioral-evals skill** to automate fixing failing tests
> or promoting incubation candidates.
## Why Behavioral Evals?
Unlike traditional **integration tests** which verify that the system functions
@ -121,7 +125,7 @@ import { describe, expect } from 'vitest';
import { evalTest } from './test-helper.js';
describe('my_feature', () => {
// New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval
// New tests MUST start as USUALLY_PASSES and be promoted based on consistency metrics
evalTest('USUALLY_PASSES', {
name: 'should do something',
prompt: 'do it',
@ -183,12 +187,10 @@ mandatory deflaking process.
1. **Incubation**: You must create all new tests with the `USUALLY_PASSES`
policy. This lets them be monitored in the nightly runs without blocking PRs.
2. **Monitoring**: The test must complete at least 10 nightly runs across all
2. **Monitoring**: The test must complete at least 7 nightly runs across all
supported models.
3. **Promotion**: Promotion to `ALWAYS_PASSES` happens exclusively through the
`/promote-behavioral-eval` slash command. This command verifies the 100%
success rate requirement is met across many runs before updating the test
policy.
3. **Promotion**: Promotion to `ALWAYS_PASSES` is conducted by the agent after
verifying the 100% success rate requirement is met across many runs.
This promotion process is essential for preventing the introduction of flaky
evaluations into the CI.
@ -225,42 +227,21 @@ tool definition has made the model's behavior less reliable.
## Fixing Evaluations
If an evaluation is failing or has a regressed pass rate, you can use the
`/fix-behavioral-eval` command within Gemini CLI to help investigate and fix the
issue.
### `/fix-behavioral-eval`
This command is designed to automate the investigation and fixing process for
failing evaluations. It will:
If an evaluation is failing or has a regressed pass rate, ask the agent to
investigate and fix the issue using the **behavioral-evals skill**. The agent
will automate the following process:
1. **Investigate**: Fetch the latest results from the nightly workflow using
the `gh` CLI, identify the failing test, and review test trajectory logs in
`evals/logs`.
2. **Fix**: Suggest and apply targeted fixes to the prompt or tool definitions.
It prioritizes minimal changes to `prompt.ts`, tool instructions, and
modules that contribute to the prompt. It generally tries to avoid changing
the test itself.
3. **Verify**: Re-run the test 3 times across multiple models (e.g., Gemini
3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a
success rate.
4. **Report**: Provide a summary of the success rate for each model and details
on the applied fixes.
It prioritizes minimal changes to `prompt.ts` and tool instructions,
avoiding changing the test itself unless necessary.
3. **Verify**: Re-run the test locally across multiple models to ensure
stability.
4. **Report**: Provide a summary of the success rate.
To use it, run:
```bash
gemini /fix-behavioral-eval
```
You can also provide a link to a specific GitHub Action run or the name of a
specific test to focus the investigation:
```bash
gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/runs/123456789
```
When investigating failures manually, you can also enable verbose agent logs by
When investigating failures manually, you can enable verbose agent logs by
setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
### Best practices
@ -273,25 +254,14 @@ instrospecting on its prompt when asked the right questions.
## Promoting evaluations
Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES`
exclusively using the `/promote-behavioral-eval` slash command. Manual promotion
is not allowed to ensure that the 100% success rate requirement is empirically
met.
Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES` by the
agent to ensure that the 100% success rate requirement is empirically met.
### `/promote-behavioral-eval`
This command automates the promotion of stable tests by:
The agent automates the promotion by:
1. **Investigating**: Analyzing the results of the last 7 nightly runs on the
`main` branch using the `gh` CLI.
2. **Criteria Check**: Identifying tests that have passed 100% of the time for
ALL enabled models across the entire 7-run history.
3. **Promotion**: Updating the test file's policy from `USUALLY_PASSES` to
`ALWAYS_PASSES`.
`main` branch.
2. **Criteria Check**: Ensuring tests passed 100% of the time for ALL enabled
models.
3. **Promotion**: Updating the test file's policy to `ALWAYS_PASSES`.
4. **Verification**: Running the promoted test locally to ensure correctness.
To run it:
```bash
gemini /promote-behavioral-eval
```