fleet/tools/signoz/README.md
Victor Lyuboslavsky de86536f42
Redis-backed cache for host-by-key lookups (#43936)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #43928 

This PR adds a Redis-backed cache in front of the two host-by-key
lookups on the agent auth paths.

Docs: https://github.com/fleetdm/fleet/pull/44504

## What changes

**Read path (osquery/orbit auth):**

- `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis
before falling through to MySQL.
- Successful lookups are cached for 60s ± 10% jitter (configurable via
`FLEET_REDIS_HOST_CACHE_TTL`).
- `NotFound` results are cached for 5s as a negative entry, dampening
repeated probes for keys that
do not exist (deleted hosts whose agents are still polling, attacker
scans, retry storms).
- Concurrent lookups for the same key collapse into one DB query via
`singleflight`. The shared
query runs under a context detached from any one caller's deadline so
the leader giving up does
not abort the work for joiners. The shared query is itself bounded by a
30s timeout so a wedged
  DB call cannot pin the singleflight slot indefinitely.

**Write path (invalidations):**

- These methods now invalidate the cache after a successful inner call:
`UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`,
`UpdateHostRefetchRequested`,
`UpdateHostRefetchCriticalQueriesUntil`,
`UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`,
`EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`,
`CleanupExpiredHosts`,
  `CleanupIncomingHosts`, `AddHostsToTeam`.
- `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and
`CleanupIncomingHosts` use a pipelined
batch invalidator so 10k-host operations stay in the millisecond range
instead of taking minutes
  of sequential round-trips.
- Inner-call errors are not invalidations: a failing write leaves cached
state intact.

**Configuration:**

- New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and
`FLEET_REDIS_HOST_CACHE_TTL`
  (default `60s`).
- Server refuses to start if the cache is enabled with `TTL <= 0`.

**Observability:**

- Three new OTEL counters under the `fleet` meter:
  - `fleet.host_cache.lookups{result=hit|negative_hit|miss}`
  - `fleet.host_cache.errors{op=get|set|del}`
-
`fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}`
- A pre-built SigNoz dashboard ships in
`tools/signoz/host_cache_dashboard.json`.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
- [x] Timeouts are implemented and retries are limited to avoid infinite
loops

## Testing

- [x] Added/updated automated tests
- [x] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional Redis-backed host lookup cache for osquery and orbit auth,
with automatic invalidation and metrics/monitoring dashboard.

* **Bug Fixes**
* Fixed host-removal batching so cache-related removals use correct
chunks.

* **Tests**
* Added comprehensive host-cache unit tests covering hits, negative
cache, invalidation, concurrency, and JSON round-trips.

* **Chores**
* New config flags to enable the cache and set TTL (default 60s ±10%
jitter).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-01 12:06:16 -05:00

3 KiB

Running SigNoz locally with Fleet

SigNoz is an open-source observability platform that provides traces, metrics, and logs in a single UI. This guide explains how to run SigNoz locally for Fleet development with optimized settings for reduced latency.

Prerequisites

Setup

  1. Clone the SigNoz repository at a specific release:
git clone --branch v0.110.1 --depth 1 https://github.com/SigNoz/signoz.git
cd signoz/deploy
  1. Modify the SigNoz UI port to avoid conflict with Fleet (which uses port 8080):

In docker/docker-compose.yaml, change the signoz service port mapping:

services:
  signoz:
    ports:
      - "8085:8080"  # Changed from 8080:8080 to avoid conflict with Fleet
  1. (Optional) For reduced latency during development, modify docker/otel-collector-config.yaml:
processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 200ms  # reduced from 10s for dev
  # ...
  signozspanmetrics/delta:
    # ...
    metrics_flush_interval: 5s  # reduced from 60s for dev
  1. Start SigNoz:
cd docker
docker compose up -d

Give it a minute for all services to initialize. The SigNoz UI will be available at http://localhost:8085.

Configuring Fleet

Start the Fleet server with OpenTelemetry tracing and logging enabled:

export FLEET_LOGGING_TRACING_ENABLED=true
export FLEET_LOGGING_OTEL_LOGS_ENABLED=true
export OTEL_SERVICE_NAME=fleet
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

./build/fleet serve

Note: All log levels (including debug) are always sent to SigNoz regardless of the --logging_debug flag. That flag only controls stderr output.

Low-latency configuration (optional)

For faster feedback during development, you can reduce the batch processing delays on the Fleet side:

# Batch span processor delay (default 5000ms)
export OTEL_BSP_SCHEDULE_DELAY=1000

# Log batch processor settings
export OTEL_BLRP_EXPORT_TIMEOUT=1000
export OTEL_BLRP_SCHEDULE_DELAY=500
export OTEL_BLRP_MAX_EXPORT_BATCH_SIZE=1

./build/fleet serve

Using SigNoz

After starting Fleet with the above configuration, you should start seeing traces, logs, and metrics in SigNoz UI at http://localhost:8085.

Pre-canned dashboards

JSON exports of Fleet-specific SigNoz dashboards live alongside this README. Import them from the SigNoz UI via Dashboards → New dashboard → Import JSON (top-right dropdown).

  • database_custom_dashboard.json — MySQL query metrics (RPS, latency, slow queries) derived from db.sql.* instrumentation.
  • host_cache_dashboard.json — Redis-backed host lookup cache (LoadHostByNodeKey / LoadHostByOrbitNodeKey). Shows hit rate over time, lookups/sec by result, errors/sec by op, and invalidations/sec by write-path reason. Requires FLEET_REDIS_HOST_CACHE_ENABLED=true (default on).