mirror of https://github.com/n8n-io/n8n synced 2026-04-21 15:47:20 +00:00

History

José Braulio González Valido 560f300716 test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 ) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-20 14:15:41 +00:00
..
__tests__	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00
checklist	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
cli	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
clients	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
credentials	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
data/workflows	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
harness	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
outcome	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00
system-prompts	feat(ai-builder): Improve eval verifier and mock handler reliability (no-changelog) (#28255 )	2026-04-10 13:57:32 +00:00
index.ts	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
README.md	test: Add Instance AI workflow evals CI pipeline (no-changelog) (#28366 )	2026-04-20 14:15:41 +00:00
types.ts	feat(ai-builder): Workflow evaluation framework with LLM mock execution (#27818 )	2026-04-07 13:31:16 +00:00

README.md

Workflow evaluation framework

Tests whether workflows built by Instance AI actually work by executing them with LLM-generated mock HTTP responses.

Running evals

CLI

# From packages/@n8n/instance-ai/, with n8n running via pnpm dev:ai

# Run all test cases
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --verbose

# Run a single test case
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --verbose

# Keep built workflows for inspection
dotenvx run -f ../../../.env.local -- pnpm eval:instance-ai --filter contact-form --keep-workflows --verbose

Results are printed to the console and written to eval-results.json.

Docker (without pnpm dev:ai)

# Build the Docker image
INCLUDE_TEST_CONTROLLER=true pnpm build:docker

# Start a container
docker run -d --name n8n-eval \
  -e E2E_TESTS=true \
  -e N8N_ENABLED_MODULES=instance-ai \
  -e N8N_AI_ENABLED=true \
  -e N8N_INSTANCE_AI_MODEL_API_KEY=your-key \
  -p 5678:5678 \
  n8nio/n8n:local

# Run evals against it
pnpm eval:instance-ai --base-url http://localhost:5678 --verbose

CI

Evals run automatically on PRs that change Instance AI code (path-filtered). The CI workflow starts a single Docker container and runs the CLI against it. See .github/workflows/test-evals-instance-ai.yml.

The eval job is non-blocking. Results are posted as a PR comment and uploaded as artifacts.

Environment variables

Set these in .env.local:

Variable	Required	Description
`N8N_INSTANCE_AI_MODEL_API_KEY`	Yes	Anthropic API key for the Instance AI agent, mock generation, and verification
`N8N_EVAL_EMAIL`	No	n8n login email (defaults to E2E test owner)
`N8N_EVAL_PASSWORD`	No	n8n login password (defaults to E2E test owner)
`CONTEXT7_API_KEY`	No	Context7 API key for higher rate limits on API doc lookups. Free tier is 1,000 req/month

How it works

Each test run:

Build — sends the test case prompt to Instance AI, which builds a workflow
Phase 1 — analyzes the workflow and generates consistent mock data hints (one Sonnet call per scenario)
Phase 2 — executes the workflow with all HTTP requests intercepted. Each request goes to an LLM that generates a realistic API response using the node's configuration and API documentation from Context7
Verify — an LLM evaluates whether the scenario's success criteria were met and categorizes any failure by root cause (see Failure categories below)

What gets mocked

Mocked nodes — any node that makes HTTP requests (Gmail, Slack, Google Sheets, HTTP Request, etc.). The request is intercepted before it leaves the process. An LLM generates the response.
Pinned nodes — trigger/start nodes get LLM-generated input data injected as pin data
Real nodes — logic nodes (Code, Set, Merge, Filter, Sort, IF, Switch) execute their actual code on the mocked/pinned data

No real credentials or API connections are needed.

Adding test cases

Test cases live in evaluations/data/workflows/*.json:

{
  "prompt": "Create a workflow that...",
  "complexity": "medium",
  "tags": ["build", "webhook", "gmail"],
  "triggerType": "webhook",
  "scenarios": [
    {
      "name": "happy-path",
      "description": "Normal operation",
      "dataSetup": "The webhook receives a submission from Jane (jane@example.com)...",
      "successCriteria": "The workflow executes without errors. An email is sent to jane@example.com..."
    }
  ]
}

Writing good test cases

Prompt tips:

Be specific about node configuration — include document IDs, sheet names, channel names, chat IDs. The agent won't ask for these in eval mode (no multi-turn support yet).
Say "Configure all nodes as completely as possible and don't ask me for credentials, I'll set them up later."
If a built-in node doesn't expose the fields you need (e.g., Linear node doesn't query creator.email), tell the agent to use an HTTP Request node with a custom API call instead.

Scenario tips:

Don't specify exact counts that depend on mock data (e.g., "exactly 7 posts remain"). The LLM generates data non-deterministically. Instead say "some posts are filtered out — fewer remain than the original 10."
The dataSetup field steers the mock data generation. Describe what each service should return, not the exact JSON.
For error scenarios, describe the error condition: "The Telegram node returns an error indicating the chat was not found."
The successCriteria is what the verification LLM checks. Be specific about what constitutes success — "None of the titles in the Slack message should contain the word 'qui'."

Scenarios to include:

happy-path — everything works as expected
Edge cases — empty data, missing fields, single vs multiple items
Error scenarios only if the workflow is expected to handle them gracefully. Most agent-built workflows don't include error handling, so testing "the workflow crashes on invalid input" is a legitimate finding, not a test case failure.

Failure categories

When a scenario fails, the verifier categorizes the root cause:

builder_issue — the agent misconfigured a node, chose the wrong node type, or the workflow structure doesn't match what was asked
mock_issue — the LLM mock returned incorrect data (e.g., _evalMockError, wrong response shape)
framework_issue — Phase 1 failed (empty trigger content), cascading errors from the eval framework itself
verification_failure — the LLM verifier couldn't produce a valid result
build_failure — Instance AI failed to build the workflow or a scenario timed out

Architecture

evaluations/
├── index.ts              # Public API
├── cli/                  # CLI entry point and args parsing
├── clients/              # n8n REST + SSE clients
├── checklist/            # LLM verification with retry
├── credentials/          # Test credential seeding
├── data/workflows/       # Test case JSON files
├── harness/              # Runner: buildWorkflow, executeScenario, cleanupBuild
├── outcome/              # SSE event parsing, workflow discovery
└── system-prompts/       # LLM prompts for verification

packages/cli/src/modules/instance-ai/eval/
├── execution.service.ts  # Phase 1 + Phase 2 orchestration
├── workflow-analysis.ts  # Hint generation (Phase 1)
├── mock-handler.ts       # Per-request mock generation (Phase 2)
├── api-docs.ts           # Context7 API doc fetcher
├── node-config.ts        # Node config serializer
└── pin-data-generator.ts # LLM pin data for bypass nodes (Phase 1.5)

Known limitations

LangChain/AI nodes — use their own SDKs, not intercepted by the HTTP mock layer. These nodes will fail with credential errors. Use pin data for these.
GraphQL APIs — response shape depends on the query, not just the endpoint. Quality depends on the LLM knowing the API schema.
Context7 quota — free tier is 1,000 requests/month, 60/hour. A full suite run uses ~100 requests. When quota is exceeded, the LLM falls back to its training data.
Non-determinism — the agent builds different workflows each run. Pass rates vary between 40-65%.
Large workflows — the verification artifact includes full execution traces. For complex workflows (12+ nodes) this can hit token limits. See TRUST-43 for the tool-based verifier approach.