fleet/tools/signoz/host_cache_dashboard.json
Victor Lyuboslavsky de86536f42
Redis-backed cache for host-by-key lookups (#43936)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #43928 

This PR adds a Redis-backed cache in front of the two host-by-key
lookups on the agent auth paths.

Docs: https://github.com/fleetdm/fleet/pull/44504

## What changes

**Read path (osquery/orbit auth):**

- `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis
before falling through to MySQL.
- Successful lookups are cached for 60s ± 10% jitter (configurable via
`FLEET_REDIS_HOST_CACHE_TTL`).
- `NotFound` results are cached for 5s as a negative entry, dampening
repeated probes for keys that
do not exist (deleted hosts whose agents are still polling, attacker
scans, retry storms).
- Concurrent lookups for the same key collapse into one DB query via
`singleflight`. The shared
query runs under a context detached from any one caller's deadline so
the leader giving up does
not abort the work for joiners. The shared query is itself bounded by a
30s timeout so a wedged
  DB call cannot pin the singleflight slot indefinitely.

**Write path (invalidations):**

- These methods now invalidate the cache after a successful inner call:
`UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`,
`UpdateHostRefetchRequested`,
`UpdateHostRefetchCriticalQueriesUntil`,
`UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`,
`EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`,
`CleanupExpiredHosts`,
  `CleanupIncomingHosts`, `AddHostsToTeam`.
- `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and
`CleanupIncomingHosts` use a pipelined
batch invalidator so 10k-host operations stay in the millisecond range
instead of taking minutes
  of sequential round-trips.
- Inner-call errors are not invalidations: a failing write leaves cached
state intact.

**Configuration:**

- New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and
`FLEET_REDIS_HOST_CACHE_TTL`
  (default `60s`).
- Server refuses to start if the cache is enabled with `TTL <= 0`.

**Observability:**

- Three new OTEL counters under the `fleet` meter:
  - `fleet.host_cache.lookups{result=hit|negative_hit|miss}`
  - `fleet.host_cache.errors{op=get|set|del}`
-
`fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}`
- A pre-built SigNoz dashboard ships in
`tools/signoz/host_cache_dashboard.json`.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
- [x] Timeouts are implemented and retries are limited to avoid infinite
loops

## Testing

- [x] Added/updated automated tests
- [x] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional Redis-backed host lookup cache for osquery and orbit auth,
with automatic invalidation and metrics/monitoring dashboard.

* **Bug Fixes**
* Fixed host-removal batching so cache-related removals use correct
chunks.

* **Tests**
* Added comprehensive host-cache unit tests covering hits, negative
cache, invalidation, concurrency, and JSON round-trips.

* **Chores**
* New config flags to enable the cache and set TTL (default 60s ±10%
jitter).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-01 12:06:16 -05:00

228 lines
7.7 KiB
JSON

{
"title": "Fleet host cache",
"description": "Observability for the Redis-backed host lookup cache fronting LoadHostByNodeKey and LoadHostByOrbitNodeKey. Shows hit rate, lookup volume by result, error volume by operation, and invalidation volume by write-path reason.",
"tags": ["redis", "cache", "host-cache"],
"layout": [
{ "i": "row-overview", "x": 0, "y": 0, "w": 12, "h": 1 },
{ "i": "hit-rate", "x": 0, "y": 1, "w": 4, "h": 4 },
{ "i": "lookups", "x": 4, "y": 1, "w": 8, "h": 4 },
{ "i": "errors", "x": 0, "y": 5, "w": 6, "h": 4 },
{ "i": "invalidations", "x": 6, "y": 5, "w": 6, "h": 4 }
],
"widgets": [
{
"id": "row-overview",
"panelTypes": "row",
"title": "Host cache overview",
"query": { "queryType": "builder", "promql": [], "clickhouse_sql": [], "builder": { "queryData": [], "queryFormulas": [] } },
"selectedLogFields": [],
"selectedTracesFields": [],
"thresholds": [],
"contextLinks": { "linksData": [] }
},
{
"id": "hit-rate",
"panelTypes": "graph",
"title": "Hit rate",
"description": "A/B where A = rate of hits + negative_hits (both Redis-served, both avoid MySQL), B = rate of all lookups. Target: >= 80% at steady state once the cache warms. Watch for drops during invalidation storms (mass team transfers, re-enrollments).",
"yAxisUnit": "percentunit",
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"disabled": true,
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "result IN ['hit', 'negative_hit']" },
"groupBy": [],
"orderBy": [],
"selectColumns": [],
"functions": []
},
{
"queryName": "B",
"dataSource": "metrics",
"expression": "B",
"disabled": true,
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [],
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": [
{
"queryName": "F1",
"expression": "A / B",
"legend": "hit rate"
}
]
}
},
"thresholds": [
{ "index": "1", "keyIndex": 0, "thresholdColor": "Orange", "thresholdFormat": "Line", "thresholdOperator": "<", "thresholdUnit": "percentunit", "thresholdValue": 0.8 }
],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "lookups",
"panelTypes": "graph",
"title": "Lookups/sec by result",
"description": "Stacked area of cache reads split by outcome. hit = served from Redis; negative_hit = cached NotFound; miss = fell through to MySQL.",
"yAxisUnit": "cps",
"isStacked": true,
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "result", "dataType": "string", "type": "tag" }
],
"legend": "{{result}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "errors",
"panelTypes": "graph",
"title": "Errors/sec by op",
"description": "Redis / JSON errors on the cache path, labeled by operation (get | set | del). Should be flat-zero in steady state; spikes indicate Redis flake or poisoned cache entries.",
"yAxisUnit": "cps",
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.errors",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "op", "dataType": "string", "type": "tag" }
],
"legend": "{{op}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "invalidations",
"panelTypes": "graph",
"title": "Invalidations/sec by reason",
"description": "Cache invalidations on write paths. update = UpdateHost/SerialUpdateHost/osquery intervals/refetch; enroll = NewHost/EnrollOsquery/EnrollOrbit; team = AddHostsToTeam; delete = DeleteHost*/CleanupExpiredHosts/CleanupIncomingHosts; cert = UpdateHostIdentityCertHostIDBySerial.",
"yAxisUnit": "cps",
"isStacked": true,
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.invalidations",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "reason", "dataType": "string", "type": "tag" }
],
"legend": "{{reason}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
}
]
}