From 03ce7dd940348287bbd2e3a0a9d96943128c6966 Mon Sep 17 00:00:00 2001 From: Lucas Manuel Rodriguez Date: Thu, 1 Jun 2023 14:27:58 -0300 Subject: [PATCH] Add guide to help troubleshoot live queries (#12082) This guide are the lessons learned during the troubleshooting for #10957. It attempts to reduce pain for future oncall issues with live queries. PS: AFAICS, this should close https://github.com/fleetdm/fleet/issues/6141. --- .../Troubleshooting-live-queries.md | 140 ++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 docs/Using-Fleet/Troubleshooting-live-queries.md diff --git a/docs/Using-Fleet/Troubleshooting-live-queries.md b/docs/Using-Fleet/Troubleshooting-live-queries.md new file mode 100644 index 0000000000..d2737662ea --- /dev/null +++ b/docs/Using-Fleet/Troubleshooting-live-queries.md @@ -0,0 +1,140 @@ +# Troubleshooting live queries + +## How do live queries work? + +Following is the lifecycle of a live query in Fleet. (For simplicity we'll assume two Fleet instances (0 and 1) and two devices (0 and 1). + +```mermaid + +sequenceDiagram + participant browser as Browser/fleetctl; + participant fleet as Fleet 0; + participant fleet2 as Fleet 1; + participant mysql as MySQL; + participant redis as Redis; + participant device0 as Device 0; + participant device1 as Device 1; + + # Start live query campaign (stage 1) + browser-->>fleet: POST /api/latest/fleet/queries/run
query: "SELECT version from osquery_info#59;"
targets: Device A, Device B; + fleet-->>mysql: Create live query campaign; + mysql-->>fleet: Created campaign with ID 42; + fleet-->>redis: Store query: "SELECT version from osquery_info#59;"
targets: Device A, Device B; + fleet-->>browser: Campaign created with ID 42; + + # Subscribe for live query campaign (stage 2) + browser-->>fleet: GET /api/latest/fleet/results
campaign with ID 42 (Upgrade websocket); + fleet-->>browser: Upgraded: websocket; + fleet-->>redis: Subscribe to live query campaign 42; + + # Device0 checks in, run query and send results back (stage 3) + device0-->>fleet: distributed/read (check in); + fleet-->>redis: Get live queries for device 0; + redis-->>fleet: Return "SELECT version from osquery_info#59;"; + fleet-->>device0: "SELECT version from osquery_info#59;"; + note right of device0: Execute
"SELECT version from osquery_info#59;"; + device0-->>fleet: distributed/write results=[{"version": "5.8.2"}]; + fleet-->>redis: Store results
[{"version": "5.8.2"}] for device 0, campaign 42; + + redis-->>fleet: Receive results
[{"version": "5.8.2"}] of device 0 from subscription, campaign 42; + fleet-->browser: Stream websocket message with results
[{"version": "5.8.2"}] for device 0; + note left of browser: Render results
[{"version": "5.8.2"}] for device 0; + + # Device1 checks in, run query and send results back (stage 3) + device1-->>fleet2: distributed/read (check in); + fleet2-->>redis: Get live queries for device 1; + redis-->>fleet2: Return "SELECT version from osquery_info#59;"; + fleet2-->>device1: "SELECT version from osquery_info#59;"; + note right of device1: Execute
"SELECT version from osquery_info#59;"; + device1-->>fleet2: distributed/write results=[{"version": "5.7.0"}]; + fleet2-->>redis: Store results
[{"version": "5.7.0"}] for device 1, campaign 42; + + redis-->>fleet: Receive results
[{"version": "5.7.0"}] of device 1 from subscription, campaign 42; + fleet-->browser: Stream websocket message with results
[{"version": "5.7.0"}] for device 1; + note left of browser: Render results
[{"version": "5.7.0"}] for device 1; +``` + +Notes: +- Multiple fleet instances collect results from devices and store them in Redis, but when retrieving results via websockets, the browser or fleetctl is connected to one Fleet instance. + +## Troubleshooting + +From diagram above we can see that live queries have a lot of moving parts. +Below we'll look at things that can fail when attempting to run live queries on thousands of devices. + +## 1. Redis + +Redis is used to store the results of live queries, thus if live queries are not working as expected, the first thing to check is Redis. + +1. Check CPU and memory of the Redis instances during a live query campaign. +2. Fleet connects to Redis as a pubsub client to retrieve query results. The results are buffered in Redis up to a limit, default value for such limit is `client-output-buffer-limit pubsub 32mb 8mb 60`. +Change that setting in Redis to `client-output-buffer-limit pubsub 0 0 0` to remove the limits (see https://redis.io/docs/management/config-file/). +PD: AWS Elasticache Redis has a different name for these settings: `client-output-buffer-limit-pubsub-hard-limit`, `client-output-buffer-limit-pubsub-soft-limit` and `client-output-buffer-limit-pubsub-soft-seconds`. + +## 2. Fleet + +Check CPU and memory of the Fleet instances during a live query campaign. +You might need to scale Fleet vertically or horizontally if your device count is high. + +## 3. Network + +When it comes to live queries, there are multiple network connections to check: +- Target devices connecting to Fleet. +- Fleet connection to Redis. +- Fleet connection to MySQL. +- Browser websocket connection to Fleet. + +A way to verify all these connections are working as expected, run the following dummy query: +```sql +SELECT 1 WHERE 1 = 0; +``` + +Such query will return no results but if you see "(100% responded)" then that confirms that all connections seem to be working nominally. + +### 3.1 Websockets + +Live queries use websockets to stream results back to the browser. +If the dummy query above didn't work, then your infrastructure may not be allowing websocket connections. +A way to rule this out is to use the synchronous live query API. +The synchronous API a simplified implementation of live queries that does not use websockets. (It's not designed to run live queries on thousands of devices.) +```sh +curl \ + -X GET \ + -H "Authorization: Bearer $API_TOKEN" \ + https://fleet.example.com/api/latest/fleet/queries/run \ + -d '{"query_ids": [340], "host_ids": [375]}' +``` +This API will wait for ~100 seconds by default and collect results for the hosts that checked in and successfully ran the query. + +## 4. Problematic query + +If the infrastructure is working correctly but the query is hanging or crashing osquery in devices, then results may never reach Fleet. + +To rule this out, you should also try out the dummy query `SELECT 1 WHERE 1 = 0;`. +If you see "(100% responded)" with the dummy query but not with your query, then this might be an issue with: + - the query crashing osquery on some devices (watchdog killing the osquery process). + - the query hanging or taking too long to run on some devices. + - the query returning too many results (that may reach network limits). Try reducing the number of results by using `LIMIT N;` on the query. + +To troubleshoot hangs or crashes you should take a look at the Orbit/osquery logs on the devices. + +## 5. Settings + +An important setting when it comes to live query campaign duration is the `distributed_interval`. This value indicates how often devices check in to Fleet to run queries. +If this value is too high, then your live query might time out before getting all results. + +PS: At Fleet we recommend this setting to be between 10 and 30 seconds (It's a sweet spot to allow for quick live query responses and not overload the infrastructure.) + +## 6. Try fleetctl or another browser + +Try running the same live query with fleetctl (from the same device): +```sh +fleetctl query \ + --query "SELECT version from osquery_info;" \ + --hosts "device0,device1" \ + --exit +``` +If this works and the browser is not working then it might be a rendering issue on the browser. +You should also try running the live query on different browsers. + + \ No newline at end of file