fleet/tools/signoz/host_cache_dashboard.json

229 lines
7.7 KiB
JSON
Raw Normal View History

Redis-backed cache for host-by-key lookups (#43936) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** Resolves #43928 This PR adds a Redis-backed cache in front of the two host-by-key lookups on the agent auth paths. Docs: https://github.com/fleetdm/fleet/pull/44504 ## What changes **Read path (osquery/orbit auth):** - `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis before falling through to MySQL. - Successful lookups are cached for 60s ± 10% jitter (configurable via `FLEET_REDIS_HOST_CACHE_TTL`). - `NotFound` results are cached for 5s as a negative entry, dampening repeated probes for keys that do not exist (deleted hosts whose agents are still polling, attacker scans, retry storms). - Concurrent lookups for the same key collapse into one DB query via `singleflight`. The shared query runs under a context detached from any one caller's deadline so the leader giving up does not abort the work for joiners. The shared query is itself bounded by a 30s timeout so a wedged DB call cannot pin the singleflight slot indefinitely. **Write path (invalidations):** - These methods now invalidate the cache after a successful inner call: `UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`, `UpdateHostRefetchRequested`, `UpdateHostRefetchCriticalQueriesUntil`, `UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`, `EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`, `CleanupExpiredHosts`, `CleanupIncomingHosts`, `AddHostsToTeam`. - `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and `CleanupIncomingHosts` use a pipelined batch invalidator so 10k-host operations stay in the millisecond range instead of taking minutes of sequential round-trips. - Inner-call errors are not invalidations: a failing write leaves cached state intact. **Configuration:** - New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and `FLEET_REDIS_HOST_CACHE_TTL` (default `60s`). - Server refuses to start if the cache is enabled with `TTL <= 0`. **Observability:** - Three new OTEL counters under the `fleet` meter: - `fleet.host_cache.lookups{result=hit|negative_hit|miss}` - `fleet.host_cache.errors{op=get|set|del}` - `fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}` - A pre-built SigNoz dashboard ships in `tools/signoz/host_cache_dashboard.json`. # Checklist for submitter If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. - [x] Timeouts are implemented and retries are limited to avoid infinite loops ## Testing - [x] Added/updated automated tests - [x] Where appropriate, [automated tests simulate multiple hosts and test for host isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing) (updates to one hosts's records do not affect another) - [x] QA'd all new/changed functionality manually <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Optional Redis-backed host lookup cache for osquery and orbit auth, with automatic invalidation and metrics/monitoring dashboard. * **Bug Fixes** * Fixed host-removal batching so cache-related removals use correct chunks. * **Tests** * Added comprehensive host-cache unit tests covering hits, negative cache, invalidation, concurrency, and JSON round-trips. * **Chores** * New config flags to enable the cache and set TTL (default 60s ±10% jitter). <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-01 17:06:16 +00:00
{
"title": "Fleet host cache",
"description": "Observability for the Redis-backed host lookup cache fronting LoadHostByNodeKey and LoadHostByOrbitNodeKey. Shows hit rate, lookup volume by result, error volume by operation, and invalidation volume by write-path reason.",
"tags": ["redis", "cache", "host-cache"],
"layout": [
{ "i": "row-overview", "x": 0, "y": 0, "w": 12, "h": 1 },
{ "i": "hit-rate", "x": 0, "y": 1, "w": 4, "h": 4 },
{ "i": "lookups", "x": 4, "y": 1, "w": 8, "h": 4 },
{ "i": "errors", "x": 0, "y": 5, "w": 6, "h": 4 },
{ "i": "invalidations", "x": 6, "y": 5, "w": 6, "h": 4 }
],
"widgets": [
{
"id": "row-overview",
"panelTypes": "row",
"title": "Host cache overview",
"query": { "queryType": "builder", "promql": [], "clickhouse_sql": [], "builder": { "queryData": [], "queryFormulas": [] } },
"selectedLogFields": [],
"selectedTracesFields": [],
"thresholds": [],
"contextLinks": { "linksData": [] }
},
{
"id": "hit-rate",
"panelTypes": "graph",
"title": "Hit rate",
"description": "A/B where A = rate of hits + negative_hits (both Redis-served, both avoid MySQL), B = rate of all lookups. Target: >= 80% at steady state once the cache warms. Watch for drops during invalidation storms (mass team transfers, re-enrollments).",
"yAxisUnit": "percentunit",
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"disabled": true,
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "result IN ['hit', 'negative_hit']" },
"groupBy": [],
"orderBy": [],
"selectColumns": [],
"functions": []
},
{
"queryName": "B",
"dataSource": "metrics",
"expression": "B",
"disabled": true,
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [],
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": [
{
"queryName": "F1",
"expression": "A / B",
"legend": "hit rate"
}
]
}
},
"thresholds": [
{ "index": "1", "keyIndex": 0, "thresholdColor": "Orange", "thresholdFormat": "Line", "thresholdOperator": "<", "thresholdUnit": "percentunit", "thresholdValue": 0.8 }
],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "lookups",
"panelTypes": "graph",
"title": "Lookups/sec by result",
"description": "Stacked area of cache reads split by outcome. hit = served from Redis; negative_hit = cached NotFound; miss = fell through to MySQL.",
"yAxisUnit": "cps",
"isStacked": true,
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.lookups",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "result", "dataType": "string", "type": "tag" }
],
"legend": "{{result}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "errors",
"panelTypes": "graph",
"title": "Errors/sec by op",
"description": "Redis / JSON errors on the cache path, labeled by operation (get | set | del). Should be flat-zero in steady state; spikes indicate Redis flake or poisoned cache entries.",
"yAxisUnit": "cps",
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.errors",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "op", "dataType": "string", "type": "tag" }
],
"legend": "{{op}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
},
{
"id": "invalidations",
"panelTypes": "graph",
"title": "Invalidations/sec by reason",
"description": "Cache invalidations on write paths. update = UpdateHost/SerialUpdateHost/osquery intervals/refetch; enroll = NewHost/EnrollOsquery/EnrollOrbit; team = AddHostsToTeam; delete = DeleteHost*/CleanupExpiredHosts/CleanupIncomingHosts; cert = UpdateHostIdentityCertHostIDBySerial.",
"yAxisUnit": "cps",
"isStacked": true,
"legendPosition": "bottom",
"query": {
"queryType": "builder",
"promql": [],
"clickhouse_sql": [],
"builder": {
"queryData": [
{
"queryName": "A",
"dataSource": "metrics",
"expression": "A",
"stepInterval": 60,
"aggregations": [
{
"metricName": "fleet.host_cache.invalidations",
"temporality": "Cumulative",
"timeAggregation": "rate",
"spaceAggregation": "sum"
}
],
"filter": { "expression": "" },
"groupBy": [
{ "key": "reason", "dataType": "string", "type": "tag" }
],
"legend": "{{reason}}",
"orderBy": [],
"selectColumns": [],
"functions": []
}
],
"queryFormulas": []
}
},
"thresholds": [],
"selectedLogFields": [],
"selectedTracesFields": [],
"contextLinks": { "linksData": [] }
}
]
}