Commit graph

29 commits

Author SHA1 Message Date
Victor Lyuboslavsky
70ffac6341
Incremental migration to slog (#40120)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #40054 

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [ ] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
  - Already added in previous PR

## Testing

- [x] Added/updated automated tests
- [x] QA'd all new/changed functionality manually

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Updated internal logging infrastructure across multiple server
components to use standardized logging methods and improved context
propagation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-02-19 15:35:35 -06:00
Scott Gress
9a6a366b3b
Improve performance when recording schedule query results (#38524)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #35603

# Details

This PR aims to optimize the system for recording scheduled query
results in the database. Previously, each time a result set was received
from a host, the Fleet server would count all of the current result rows
in the db for that query before deciding whether to save more. This
count becomes more expensive as the DB size grows, until it becomes the
"long" pole in the recording process. With this PR, the system changes
in the following ways:

* When result rows are received from the host, no count is immediately
taken. Instead, a Redis key is checked which holds a current approximate
count of rows in the table. If the count is over the configured row
limit, no rows are saved. Otherwise, rows are saved and the count is
adjusted accordingly (it can go down, e.g. if a host previously returned
5 rows for a query and now returns 3). Keep in mind that we only store
one set of results per host for a scheduled query; when a host reports
results for a query, we delete that hosts previous results and write the
new ones if there's room.
* As an additional failsafe against runaway queries, if a result set
contains more than 1000 rows, it is rejected.
* Once a minute, a cron job runs which deletes all rows over the limit
for each query and resets the counter for all queries to the actual # of
rows in the table.

The end result is:

* No more expensive counts on every distributed write request for
scheduled queries
* Results for a single query can burst to over the limit for a short
time, but will get cleaned up after a minute
* Because of concurrency and race issues where multiple hosts might get
the same count from Redis before inserting rows, the actual # of results
in the db can burst higher than the limit. In testing w/ osquery-perf
with 1000 hosts started simultaneously, sending 500 rows at a time, a
50,000 row limit and a query running every 10 seconds, I saw the table
get up to 60,000 rows at times before being cleaned up. This is a very
bad case; in the real world we'd have a lot more jitter in the
reporting, and queries would not typically return this many rows.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)

## Testing

- [X] Added/updated automated tests
Added a new test to verify that results are still discarded if table
size is > limit, updated existing tests.
- [X] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [X] QA'd all new/changed functionality manually
Ran osquery-perf with 1000 hosts and a 50,000 row limit per query, using
queries that returned 1, 500 and 1000 rows at a time. Verified that the
limits were respected (subject to the amount of flex discussed above).
I'm doing some A/B tests now using local MySQL metrics and will report
back.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Automated periodic cleanup of excess query results to retain recent
data and free storage
  * Redis-backed query result counting to track per-query result volumes

* **Performance Improvements**
  * Optimized recording of scheduled query results for reduced overhead
* Cleanup runs in configurable batches to lower database contention and
balance storage use

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-01-27 10:33:47 -06:00
Lucas Manuel Rodriguez
b5626e17e6
Fix lingering live queries keys in Redis (#33928)
Resolves #33254

This can be reproduced locally by running the following "high load"
test:

Run 500 hosts using osquery-perf:
```
go run ./cmd/osquery-perf --enroll_secret ... \
  --host_count 500 \
  --server_url https://localhost:8080 \
  --live_query_fail_prob 0.0 \
  --live_query_no_results_prob 0.0 \
  --orbit_prob 0.0 \
  --http_message_signature_prob 0.0
```

Run `stress_test_live_queries.sh`:
```
#!/bin/bash

while true; do
        curl -v -k -X POST -H "Authorization: Bearer $TEST_TOKEN" https://localhost:8080/api/latest/fleet/queries/$SAVED_QUERY_ID/run -d '{"host_ids": [<500 comma-separated host ids>]}'
done
```

Use "Redis Insight" or the like and you will start to see
`livequery:{$CAMPAIGN_ID}` keys with `No limit` (which is the bug):

<img width="1380" height="227" alt="Screenshot 2025-10-07 at 3 10 26 PM"
src="https://github.com/user-attachments/assets/30434348-3217-40c4-8ebc-bab5ceb4daa9"
/>

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

## Testing

- [x] Added/updated automated tests

- [x] QA'd all new/changed functionality manually

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- Bug Fixes
- Prevent lingering Redis keys for live queries by ensuring keys are
cleaned up and not recreated when completing/canceling non-existent
queries.
- Improves resource usage and avoids stale state in live query
processing.

- Tests
- Added tests verifying proper retrieval/completion behavior and that no
Redis key is created for non-existent live queries.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Ian Littman <iansltx@gmail.com>
2025-10-08 06:36:38 -03:00
Tim Lee
9a09b52201
Fix flakey livequery test (#21666) 2024-08-29 10:03:45 -06:00
Tim Lee
52cbb3e10f
17379 cache live queries (#21387) 2024-08-26 10:32:57 -06:00
Martin Angers
6c0e56ea73
Address multiple redis-related issues observed with live queries (#16855)
#16331 

Doc updates in a separate PR:
https://github.com/fleetdm/fleet/pull/17214

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality (smoke-tested locally
with osquery-perf simulating 100 hosts, ran a live query, a saved live
query, stopped naturally and stopped before the end, and again via
fleetctl)

---------

Co-authored-by: Victor Lyuboslavsky <victor@fleetdm.com>
Co-authored-by: Victor Lyuboslavsky <victor.lyuboslavsky@gmail.com>
2024-02-27 19:35:27 -06:00
Tomas Touceda
8457e55b53
Bump go to 1.19.1 (#7690)
* Bump go to 1.19.1

* Bump remaining go-version to the 1.19.1

* Add extra paths for test-go

* Oops, putting the right path in the right place

* gofmt file

* gofmt ALL THE THINGS

* Moar changes

* Actually, go.mod doesn't like minor versions
2022-09-12 20:32:43 -03:00
Martin Angers
81f0e0ccfa
Track active hosts count and enforce limit (#6099) 2022-06-13 16:29:32 -04:00
Lucas Manuel Rodriguez
11af33e9a1
Allow troubleshooting of mocked live query store (#6197) 2022-06-13 10:18:03 -03:00
Martin Angers
afb3310937
Migrate team-related endpoints to new pattern (#3740) 2022-01-19 10:52:14 -05:00
Martin Angers
4143a37056
Fix redis scan keys issue for live queries (#3107) 2021-12-14 16:30:26 -05:00
Martin Angers
69a4985cac
Use new error handling approach in other packages (#2954) 2021-11-22 09:13:26 -05:00
Martin Angers
a8735d55bb
Implement async processing of hosts for label queries (#2288) 2021-11-01 14:13:16 -04:00
Martin Angers
057d4e8b2e
Add configuration and support for Redis to read from replicas (#2509) 2021-10-18 09:32:17 -04:00
Martin Angers
1fa5ce16b8
Add configurable Redis connection retries and following of cluster redirections (#2045)
Closes #1969
2021-09-15 08:50:32 -04:00
Tomas Touceda
b2efc9f51c
Make redis conn timeout and keep alive configurable (#1968)
* Make redis conn timeout and keep alive configurable

* Document new configs

* Correct config name
2021-09-08 17:55:12 -03:00
Martin Angers
9a0871a2f1
Address issues related to Redis Cluster support (#1885)
Closes #1847 .
2021-09-01 16:32:57 -04:00
Tomas Touceda
3d8a766ca1
Make receive calls to redis conn thread safe (#1641)
* Make receive calls to redis conn thread safe

Also removes REDIS_TEST env var. Redis is lightweight and fast, no need
to skip these tests.

* No need to increase the wait
2021-08-11 17:34:35 -03:00
Zach Wasserman
c5280c0517
Add v4 suffix in go.mod (#1224) 2021-06-25 21:46:51 -07:00
Zach Wasserman
2ad557e3b3 Merge branch 'main' into teams 2021-06-18 09:42:20 -07:00
dsbaha
47b423ee29
Add Redis cluster support (#1045)
This should support Redis in both cluster and non-cluster modes.

Updates were made separately to github.com/throttled/throttled to support the slight changes in types.

Co-authored-by: Joseph Macaulay <joseph.macaulay@uber.com>
Co-authored-by: Zach Wasserman <zach@fleetdm.com>
2021-06-18 08:51:47 -07:00
Zach Wasserman
fb32f0cf40
Remove kolide types and packages from backend (#974)
Generally renamed `kolide` -> `fleet`
2021-06-06 15:07:29 -07:00
Zach Wasserman
9f71fcf440
Speed up MySQL tests (#585)
Improves MySQL test time (on my 2020 MBP) to ~18s from ~125s.

- Use separate databases for each test to allow parallelization.
- Run migrations only once at beginning of tests and then reload
  generated schema.
- Add `--innodb-file-per-table=OFF` for ~20% additional speedup.
2021-04-03 11:42:27 -07:00
Mike Arpaia
af96e52a00
Update the Go import paths to new repo name (#27) 2020-11-11 09:59:12 -08:00
Kilian
c61ba759dd
Add redis use_tls cfg (#2311)
Adding config parameter 'redis.use_tls' to enable tls communications with redis e.g. AWS ElastiCache

Closes #2247
2020-10-01 16:25:48 -07:00
Stephan Miehe
cf4d8ecfee
Add redis database number support (#2269)
Fixes #2268
2020-07-30 08:57:25 -07:00
Zachary Wasserman
7494513400 Clean up and comments before merge. 2020-07-21 14:05:46 -07:00
Zachary Wasserman
7f757d3144 Extract functionName into helper
Cleans up some repetition in tests.
2020-07-21 14:05:46 -07:00
Zachary Wasserman
0502412e15 Move live query operations from MySQL to Redis
This change optimizes live queries by pushing the computation of query
targets to the creation time of the query, and efficiently caching the
targets in Redis. This results in a huge performance improvement at both
steady-state, and when running live queries.

- Live queries are stored using a bitfield in Redis, and takes
advantage of bitfield operations to be extremely efficient.

- Only run Redis live query test when REDIS_TEST is set in environment

- Ensure that live queries are only sent to hosts when there is a client
listening for results. Addresses an existing issue in Fleet along with
appropriate cleanup for the refactored live query backend.
2020-07-21 14:05:46 -07:00