fleet

Elgato_dark/fleet

Fork 0

mirror of https://github.com/fleetdm/fleet synced 2026-05-24 09:28:54 +00:00

Commit graph

Author	SHA1	Message	Date
Victor Lyuboslavsky	de86536f42	Redis-backed cache for host-by-key lookups (#43936 ) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> Related issue: Resolves #43928 This PR adds a Redis-backed cache in front of the two host-by-key lookups on the agent auth paths. Docs: https://github.com/fleetdm/fleet/pull/44504 ## What changes Read path (osquery/orbit auth): - `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis before falling through to MySQL. - Successful lookups are cached for 60s ± 10% jitter (configurable via `FLEET_REDIS_HOST_CACHE_TTL`). - `NotFound` results are cached for 5s as a negative entry, dampening repeated probes for keys that do not exist (deleted hosts whose agents are still polling, attacker scans, retry storms). - Concurrent lookups for the same key collapse into one DB query via `singleflight`. The shared query runs under a context detached from any one caller's deadline so the leader giving up does not abort the work for joiners. The shared query is itself bounded by a 30s timeout so a wedged DB call cannot pin the singleflight slot indefinitely. Write path (invalidations): - These methods now invalidate the cache after a successful inner call: `UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`, `UpdateHostRefetchRequested`, `UpdateHostRefetchCriticalQueriesUntil`, `UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`, `EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`, `CleanupExpiredHosts`, `CleanupIncomingHosts`, `AddHostsToTeam`. - `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and `CleanupIncomingHosts` use a pipelined batch invalidator so 10k-host operations stay in the millisecond range instead of taking minutes of sequential round-trips. - Inner-call errors are not invalidations: a failing write leaves cached state intact. Configuration: - New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and `FLEET_REDIS_HOST_CACHE_TTL` (default `60s`). - Server refuses to start if the cache is enabled with `TTL <= 0`. Observability: - Three new OTEL counters under the `fleet` meter: - `fleet.host_cache.lookups{result=hit\|negative_hit\|miss}` - `fleet.host_cache.errors{op=get\|set\|del}` - `fleet.host_cache.invalidations{reason=update\|enroll\|team\|delete\|cert}` - A pre-built SigNoz dashboard ships in `tools/signoz/host_cache_dashboard.json`. # Checklist for submitter If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. - [x] Timeouts are implemented and retries are limited to avoid infinite loops ## Testing - [x] Added/updated automated tests - [x] Where appropriate, [automated tests simulate multiple hosts and test for host isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing) (updates to one hosts's records do not affect another) - [x] QA'd all new/changed functionality manually <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Optional Redis-backed host lookup cache for osquery and orbit auth, with automatic invalidation and metrics/monitoring dashboard. * Bug Fixes * Fixed host-removal batching so cache-related removals use correct chunks. * Tests * Added comprehensive host-cache unit tests covering hits, negative cache, invalidation, concurrency, and JSON round-trips. * Chores * New config flags to enable the cache and set TTL (default 60s ±10% jitter). <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-05-01 12:06:16 -05:00
Konstantin Sykulev	6ed3ba6801	Added OTEL DB stats metrics, renamed trace attributes to expected OTEL names (#42097 ) 1. Added DB metrics via otelsql.RegisterDBStatsMetrics() `db.sql.connection.open` `db.sql.connection.max_open` `db.sql.connection.wait` `db.sql.connection.wait_duration` `db.sql.connection.closed_max_idle` `db.sql.connection.closed_max_idle_time` `db.sql.latency.` 2. renamed these metrics to signoz convention/expected names `db.sql.connection.open` -> `db.client.connection.usage` `db.sql.connection.max_open` -> `db.client.connection.max` `db.sql.connection.wait` -> `db.client.connection.wait_count` `db.sql.connection.wait_duration` -> `db.client.connection.wait_time` `db.sql.connection.closed_max_idle` -> `db.client.connection.idle.max` `db.sql.connection.closed_max_idle_time` -> `db.client.connection.idle.min` 3. created custom dashboard to display these metrics, (import via json) <img width="1580" height="906" alt="Screenshot 2026-03-19 at 2 44 43 PM" src="https://github.com/user-attachments/assets/f1b64ed6-e534-4490-8955-bc1205dd21d4" /> 4. Fixed metrics for service db dashboards Signoz expects `db.system` : Identifies the database type (e.g., postgresql, mysql, mongodb). `db.statement` : The actual query being executed (e.g., SELECT FROM users). `db.operation` : The type of operation (e.g., SELECT, INSERT). `service.name` : The name of the service making the call. We needed to set the `db.system` attribute explicitly. `db.operation` is missing because otelsql doesn't capture this by default. Decided not to add this for now as the dashboards work without. Can be a future enhancement. <img width="1563" height="487" alt="Screenshot 2026-03-19 at 2 45 18 PM" src="https://github.com/user-attachments/assets/51028e16-ee2c-45a9-9025-26f17b0db67a" /> # Checklist for submitter ## Testing - [x] QA'd all new/changed functionality manually <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * New Features * Added a new observability dashboard for database and connection performance metrics, including RPS, latency, connection pool saturation, and queue statistics. * Enhanced database metrics collection with automatic registration of connection and query performance indicators. * Standardized OpenTelemetry metric naming to align with industry conventions for improved observability compatibility. <!-- end of auto-generated comment: release notes by coderabbit.ai -->	2026-03-20 11:07:58 -05:00
Victor Lyuboslavsky	ac508b9a40	Added contributor docs for SigNoz. (#39402 ) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> Related issue: Resolves #38607	2026-02-09 15:28:28 -06:00

Author

SHA1

Message

Date

Victor Lyuboslavsky

de86536f42

Redis-backed cache for host-by-key lookups (#43936 )

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #43928 

This PR adds a Redis-backed cache in front of the two host-by-key
lookups on the agent auth paths.

Docs: https://github.com/fleetdm/fleet/pull/44504

## What changes

**Read path (osquery/orbit auth):**

- `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis
before falling through to MySQL.
- Successful lookups are cached for 60s ± 10% jitter (configurable via
`FLEET_REDIS_HOST_CACHE_TTL`).
- `NotFound` results are cached for 5s as a negative entry, dampening
repeated probes for keys that
do not exist (deleted hosts whose agents are still polling, attacker
scans, retry storms).
- Concurrent lookups for the same key collapse into one DB query via
`singleflight`. The shared
query runs under a context detached from any one caller's deadline so
the leader giving up does
not abort the work for joiners. The shared query is itself bounded by a
30s timeout so a wedged
  DB call cannot pin the singleflight slot indefinitely.

**Write path (invalidations):**

- These methods now invalidate the cache after a successful inner call:
`UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`,
`UpdateHostRefetchRequested`,
`UpdateHostRefetchCriticalQueriesUntil`,
`UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`,
`EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`,
`CleanupExpiredHosts`,
  `CleanupIncomingHosts`, `AddHostsToTeam`.
- `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and
`CleanupIncomingHosts` use a pipelined
batch invalidator so 10k-host operations stay in the millisecond range
instead of taking minutes
  of sequential round-trips.
- Inner-call errors are not invalidations: a failing write leaves cached
state intact.

**Configuration:**

- New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and
`FLEET_REDIS_HOST_CACHE_TTL`
  (default `60s`).
- Server refuses to start if the cache is enabled with `TTL <= 0`.

**Observability:**

- Three new OTEL counters under the `fleet` meter:
  - `fleet.host_cache.lookups{result=hit|negative_hit|miss}`
  - `fleet.host_cache.errors{op=get|set|del}`
-
`fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}`
- A pre-built SigNoz dashboard ships in
`tools/signoz/host_cache_dashboard.json`.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
- [x] Timeouts are implemented and retries are limited to avoid infinite
loops

## Testing

- [x] Added/updated automated tests
- [x] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional Redis-backed host lookup cache for osquery and orbit auth,
with automatic invalidation and metrics/monitoring dashboard.

* **Bug Fixes**
* Fixed host-removal batching so cache-related removals use correct
chunks.

* **Tests**
* Added comprehensive host-cache unit tests covering hits, negative
cache, invalidation, concurrency, and JSON round-trips.

* **Chores**
* New config flags to enable the cache and set TTL (default 60s ±10%
jitter).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

2026-05-01 12:06:16 -05:00

Konstantin Sykulev

6ed3ba6801

Added OTEL DB stats metrics, renamed trace attributes to expected OTEL names (#42097 )

1. Added DB metrics via otelsql.RegisterDBStatsMetrics()
`db.sql.connection.open`
`db.sql.connection.max_open`
`db.sql.connection.wait`
`db.sql.connection.wait_duration`
`db.sql.connection.closed_max_idle`
`db.sql.connection.closed_max_idle_time`
`db.sql.latency.*`
2. renamed these metrics to signoz convention/expected names
`db.sql.connection.open` -> `db.client.connection.usage`
`db.sql.connection.max_open` -> `db.client.connection.max`
`db.sql.connection.wait` -> `db.client.connection.wait_count`
`db.sql.connection.wait_duration` -> `db.client.connection.wait_time`
`db.sql.connection.closed_max_idle` -> `db.client.connection.idle.max`
`db.sql.connection.closed_max_idle_time` ->
`db.client.connection.idle.min`
3. created custom dashboard to display these metrics, (import via json)
<img width="1580" height="906" alt="Screenshot 2026-03-19 at 2 44 43 PM"
src="https://github.com/user-attachments/assets/f1b64ed6-e534-4490-8955-bc1205dd21d4"
/>
4. Fixed metrics for service db dashboards
Signoz expects

`db.system` : Identifies the database type (e.g., postgresql, mysql,
mongodb).
`db.statement` : The actual query being executed (e.g., SELECT * FROM
users).
`db.operation` : The type of operation (e.g., SELECT, INSERT).
`service.name` : The name of the service making the call.

We needed to set the `db.system` attribute explicitly.

`db.operation` is missing because otelsql doesn't capture this by
default. Decided not to add this for now as the dashboards work without.
Can be a future enhancement.

<img width="1563" height="487" alt="Screenshot 2026-03-19 at 2 45 18 PM"
src="https://github.com/user-attachments/assets/51028e16-ee2c-45a9-9025-26f17b0db67a"
/>


# Checklist for submitter

## Testing
- [x] QA'd all new/changed functionality manually

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added a new observability dashboard for database and connection
performance metrics, including RPS, latency, connection pool saturation,
and queue statistics.
* Enhanced database metrics collection with automatic registration of
connection and query performance indicators.
* Standardized OpenTelemetry metric naming to align with industry
conventions for improved observability compatibility.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

2026-03-20 11:07:58 -05:00

Victor Lyuboslavsky

ac508b9a40

Added contributor docs for SigNoz. (#39402 )

<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #38607

2026-02-09 15:28:28 -06:00

3 commits