Commit graph

9 commits

Author SHA1 Message Date
Victor Lyuboslavsky
de86536f42
Redis-backed cache for host-by-key lookups (#43936)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #43928 

This PR adds a Redis-backed cache in front of the two host-by-key
lookups on the agent auth paths.

Docs: https://github.com/fleetdm/fleet/pull/44504

## What changes

**Read path (osquery/orbit auth):**

- `LoadHostByNodeKey` and `LoadHostByOrbitNodeKey` now check Redis
before falling through to MySQL.
- Successful lookups are cached for 60s ± 10% jitter (configurable via
`FLEET_REDIS_HOST_CACHE_TTL`).
- `NotFound` results are cached for 5s as a negative entry, dampening
repeated probes for keys that
do not exist (deleted hosts whose agents are still polling, attacker
scans, retry storms).
- Concurrent lookups for the same key collapse into one DB query via
`singleflight`. The shared
query runs under a context detached from any one caller's deadline so
the leader giving up does
not abort the work for joiners. The shared query is itself bounded by a
30s timeout so a wedged
  DB call cannot pin the singleflight slot indefinitely.

**Write path (invalidations):**

- These methods now invalidate the cache after a successful inner call:
`UpdateHost`, `SerialUpdateHost`, `UpdateHostOsqueryIntervals`,
`UpdateHostRefetchRequested`,
`UpdateHostRefetchCriticalQueriesUntil`,
`UpdateHostIdentityCertHostIDBySerial`, `EnrollOsquery`,
`EnrollOrbit`, `NewHost`, `DeleteHost`, `DeleteHosts`,
`CleanupExpiredHosts`,
  `CleanupIncomingHosts`, `AddHostsToTeam`.
- `AddHostsToTeam`, `DeleteHosts`, `CleanupExpiredHosts`, and
`CleanupIncomingHosts` use a pipelined
batch invalidator so 10k-host operations stay in the millisecond range
instead of taking minutes
  of sequential round-trips.
- Inner-call errors are not invalidations: a failing write leaves cached
state intact.

**Configuration:**

- New flags `FLEET_REDIS_HOST_CACHE_ENABLED` (default `true`) and
`FLEET_REDIS_HOST_CACHE_TTL`
  (default `60s`).
- Server refuses to start if the cache is enabled with `TTL <= 0`.

**Observability:**

- Three new OTEL counters under the `fleet` meter:
  - `fleet.host_cache.lookups{result=hit|negative_hit|miss}`
  - `fleet.host_cache.errors{op=get|set|del}`
-
`fleet.host_cache.invalidations{reason=update|enroll|team|delete|cert}`
- A pre-built SigNoz dashboard ships in
`tools/signoz/host_cache_dashboard.json`.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
- [x] Timeouts are implemented and retries are limited to avoid infinite
loops

## Testing

- [x] Added/updated automated tests
- [x] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Optional Redis-backed host lookup cache for osquery and orbit auth,
with automatic invalidation and metrics/monitoring dashboard.

* **Bug Fixes**
* Fixed host-removal batching so cache-related removals use correct
chunks.

* **Tests**
* Added comprehensive host-cache unit tests covering hits, negative
cache, invalidation, concurrency, and JSON round-trips.

* **Chores**
* New config flags to enable the cache and set TTL (default 60s ±10%
jitter).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-01 12:06:16 -05:00
Victor Lyuboslavsky
9628f49cb8
Improved the performance of Windows MDM profile reconciliation (#44075)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #44052 

Improve performance by reducing the time for the synchronous API call to
update profiles or switch teams. And spreading out the application of
profiles by processing 2000 hosts every 30 seconds.

1. **Windows profile reconciliation is no longer synchronous to
bulk-set.**
Apple, Android, and Apple-declaration paths still write their pending
state inside the bulk-set transaction. The Windows path commits the
transactional inputs and lets the existing `mdm_windows_profile_manager`
cron pick the work up on its next tick. The visible effect is that
`host_mdm_windows_profiles` is no longer guaranteed to be populated by
the time bulk-set returns; it converges within one cron interval.

2. **The Windows reconciler now processes hosts in bounded batches, with
a persisted cursor.**
Previous behavior was "scan the universe of pending Windows hosts on
every tick." New behavior is a host-window query bounded by batch size
and a `host_uuid` cursor, advanced after the batch commits successfully
and persisted across ticks. A failed tick leaves the cursor untouched so
the same window is retried.

3. **Two replication races are now explicitly handled.**
- Admin-delete vs reconcile: the existence check the reconciler uses to
avoid touching a just-deleted profile reads from the primary, not a
replica.
- Insert lag in the reconciler's own listings: hosts that appear in the
cursor query but are not yet visible in the scoped listings advance the
cursor instead of jamming the loop.

4. **`updates.WindowsConfigProfile` from `BulkSetPendingMDMHostProfiles`
is now always false in production.**
The only consumer ORs it with the transactional signal from
`BatchSetMDMProfiles`, which is the accurate source. The bulk-set call
no longer attempts to compute or return that activity signal itself.

5. **Tests opt in to the old synchronous behavior via a named hook.**
Default test behavior matches production (deferred). Legacy tests whose
assertions require Windows rows immediately after bulk-set call an
explicit enable-hook and rely on `t.Cleanup` to restore.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

## Testing

- [x] Added/updated automated tests
- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Windows MDM profile reconciliation batching improvements enable large
team transfers and bulk profile change operations to complete faster,
with profile updates rolling out in the background without blocking host
check-ins or other MDM activity.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-28 15:37:43 -05:00
Zach Wasserman
0cdde239b9
Add activity feed entries for host deletion and expiration (#34720)
**Related issue:** Resolves #33513 

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

## Testing

- [x] Added/updated automated tests
- [x] QA'd all new/changed functionality manually
2025-10-31 09:37:31 -07:00
Lucas Manuel Rodriguez
33a15831c0
Add missing platform_like during orbit enrollment (#32671)
#30877

We need to send `platform_like` during orbit enrollment for proper setup
experience for Linux

If some of the following don't apply, delete the relevant line.

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

## Testing

- [X] Added/updated automated tests

- [X] QA'd all new/changed functionality manually

## fleetd/orbit/Fleet Desktop

- [X] Verified compatibility with the latest released version of Fleet
(see [Must
rule](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/workflows/fleetd-development-and-release-strategy.md))
- [ ] Verified that fleetd runs on macOS, Linux and Windows
- [x] Verified auto-update works from the released version of component
to the new version (see [tools/tuf/test](../tools/tuf/test/README.md))
2025-09-05 16:05:19 -03:00
Victor Lyuboslavsky
85a98d83dd
Refactor EnrollOrbit/EnrollHost (#30872)
Fixes #30473 

Refactore Datastore.EnrollHost and Datastore.EnrollOrbit methods to use
functional options. Doing this refactor before adding new options to
those methods. This should make the code more maintainable and easier to
understand.

No functional changes here. Just refactoring.

# Checklist for submitter

- [x] Added/updated automated tests


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Streamlined host and Orbit enrollment methods to use a flexible
options-based pattern instead of fixed parameter lists.
* Updated related tests and service logic to use the new options
approach, improving clarity and extensibility for enrollment operations.

* **New Features**
* Introduced configuration options for host and Orbit enrollment,
allowing more explicit and customizable parameter setting during
enrollment.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-07-15 17:22:02 -03:00
Scott Gress
59f96651b6
Update to Go 1.24.1 (#27506)
For #26713 

# Details

This PR updates Fleet and its related tools and binaries to use Go
version 1.24.1.

Scanning through the changelog, I didn't see anything relevant to Fleet
that requires action. The only possible breaking change I spotted was:

> As [announced](https://tip.golang.org/doc/go1.23#linux) in the Go 1.23
release notes, Go 1.24 requires Linux kernel version 3.2 or later.

Linux kernel 3.2 was released in January of 2012, so I think we can
commit to dropping support for earlier kernel versions.

The new [tools directive](https://tip.golang.org/doc/go1.24#tools) is
interesting as it means we can move away from using `tools.go` files,
but it's not a required update.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [x] Manual QA for all new/changed functionality
- For Orbit and Fleet Desktop changes:
- [X] Make sure fleetd is compatible with the latest released version of
Fleet
   - [x] Orbit runs on macOS  , Linux   and Windows. 
- [x] Manual QA must be performed in the three main OSs, macOS ,
Windows and Linux .
2025-03-31 11:14:09 -05:00
Martin Angers
e3ddb5f3ce
Support matching a host in orbit enrollment using the serial number (#9612) 2023-02-28 12:55:04 -05:00
gillespi314
94dd1c3745
Ingest pending MDM hosts (#9065)
Co-authored-by @roperzh
2022-12-26 15:32:39 -06:00
Martin Angers
81f0e0ccfa
Track active hosts count and enforce limit (#6099) 2022-06-13 16:29:32 -04:00