<!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** Resolves #43928 This PR adds a Redis-backed cache in front of the two host-by-key lookups on the agent auth paths. Docs: https://github.com/fleetdm/fleet/pull/44504 ## What changes **Read path (osquery/orbit auth):** - `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis before falling through to MySQL. - Successful lookups are cached for 60s ± 10% jitter (configurable via `FLEET_REDIS_HOST_CACHE_TTL`). - `NotFound` results are cached for 5s as a negative entry, dampening repeated probes for keys that do not exist (deleted hosts whose agents are still polling, attacker scans, retry storms). - Concurrent lookups for the same key collapse into one DB query via `singleflight`. The shared query runs under a context detached from any one caller's deadline so the leader giving up does not abort the work for joiners. The shared query is itself bounded by a 30s timeout so a wedged DB call cannot pin the singleflight slot indefinitely. **Write path (invalidations):** - These methods now invalidate the cache after a successful inner call: `UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`, `UpdateHostRefetchRequested`, `UpdateHostRefetchCriticalQueriesUntil`, `UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`, `EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`, `CleanupExpiredHosts`, `CleanupIncomingHosts`, `AddHostsToTeam`. - `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and `CleanupIncomingHosts` use a pipelined batch invalidator so 10k-host operations stay in the millisecond range instead of taking minutes of sequential round-trips. - Inner-call errors are not invalidations: a failing write leaves cached state intact. **Configuration:** - New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and `FLEET_REDIS_HOST_CACHE_TTL` (default `60s`). - Server refuses to start if the cache is enabled with `TTL <= 0`. **Observability:** - Three new OTEL counters under the `fleet` meter: - `fleet.host_cache.lookups{result=hit|negative_hit|miss}` - `fleet.host_cache.errors{op=get|set|del}` - `fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}` - A pre-built SigNoz dashboard ships in `tools/signoz/host_cache_dashboard.json`. # Checklist for submitter If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. - [x] Timeouts are implemented and retries are limited to avoid infinite loops ## Testing - [x] Added/updated automated tests - [x] Where appropriate, [automated tests simulate multiple hosts and test for host isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing) (updates to one hosts's records do not affect another) - [x] QA'd all new/changed functionality manually <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Optional Redis-backed host lookup cache for osquery and orbit auth, with automatic invalidation and metrics/monitoring dashboard. * **Bug Fixes** * Fixed host-removal batching so cache-related removals use correct chunks. * **Tests** * Added comprehensive host-cache unit tests covering hits, negative cache, invalidation, concurrency, and JSON round-trips. * **Chores** * New config flags to enable the cache and set TTL (default 60s ±10% jitter). <!-- end of auto-generated comment: release notes by coderabbit.ai --> |
||
|---|---|---|
| .. | ||
| database_custom_dashboard.json | ||
| host_cache_dashboard.json | ||
| README.md | ||
Running SigNoz locally with Fleet
SigNoz is an open-source observability platform that provides traces, metrics, and logs in a single UI. This guide explains how to run SigNoz locally for Fleet development with optimized settings for reduced latency.
Prerequisites
- Docker and Docker Compose
- A locally-built Fleet server (see Testing and local development)
Setup
- Clone the SigNoz repository at a specific release:
git clone --branch v0.110.1 --depth 1 https://github.com/SigNoz/signoz.git
cd signoz/deploy
- Modify the SigNoz UI port to avoid conflict with Fleet (which uses port 8080):
In docker/docker-compose.yaml, change the signoz service port mapping:
services:
signoz:
ports:
- "8085:8080" # Changed from 8080:8080 to avoid conflict with Fleet
- (Optional) For reduced latency during development, modify
docker/otel-collector-config.yaml:
processors:
batch:
send_batch_size: 10000
send_batch_max_size: 11000
timeout: 200ms # reduced from 10s for dev
# ...
signozspanmetrics/delta:
# ...
metrics_flush_interval: 5s # reduced from 60s for dev
- Start SigNoz:
cd docker
docker compose up -d
Give it a minute for all services to initialize. The SigNoz UI will be available at http://localhost:8085.
Configuring Fleet
Start the Fleet server with OpenTelemetry tracing and logging enabled:
export FLEET_LOGGING_TRACING_ENABLED=true
export FLEET_LOGGING_OTEL_LOGS_ENABLED=true
export OTEL_SERVICE_NAME=fleet
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
./build/fleet serve
Note: All log levels (including debug) are always sent to SigNoz regardless of the
--logging_debugflag. That flag only controls stderr output.
Low-latency configuration (optional)
For faster feedback during development, you can reduce the batch processing delays on the Fleet side:
# Batch span processor delay (default 5000ms)
export OTEL_BSP_SCHEDULE_DELAY=1000
# Log batch processor settings
export OTEL_BLRP_EXPORT_TIMEOUT=1000
export OTEL_BLRP_SCHEDULE_DELAY=500
export OTEL_BLRP_MAX_EXPORT_BATCH_SIZE=1
./build/fleet serve
Using SigNoz
After starting Fleet with the above configuration, you should start seeing traces, logs, and metrics in SigNoz UI at http://localhost:8085.
Pre-canned dashboards
JSON exports of Fleet-specific SigNoz dashboards live alongside this README. Import them from the SigNoz UI via Dashboards → New dashboard → Import JSON (top-right dropdown).
database_custom_dashboard.json— MySQL query metrics (RPS, latency, slow queries) derived fromdb.sql.*instrumentation.host_cache_dashboard.json— Redis-backed host lookup cache (LoadHostByNodeKey/LoadHostByOrbitNodeKey). Shows hit rate over time, lookups/sec by result, errors/sec by op, and invalidations/sec by write-path reason. RequiresFLEET_REDIS_HOST_CACHE_ENABLED=true(default on).