mirror of
https://github.com/fleetdm/fleet
synced 2026-04-21 21:47:20 +00:00
169 lines
8.2 KiB
Markdown
169 lines
8.2 KiB
Markdown
|
|
# Debugging
|
|||
|
|
|
|||
|
|
## Goals of this guide
|
|||
|
|
|
|||
|
|
This is NOT meant to be an exhaustive list of possible issues in Fleet and how to solve them.
|
|||
|
|
|
|||
|
|
This is a guide for going from a vague statement such as "things are not working correctly" to a more narrowed down and
|
|||
|
|
specific assessment. This doesn't mean necessarily a solution; but, with a more specific assessment, it'll be easier for
|
|||
|
|
the Engineering team to help.
|
|||
|
|
|
|||
|
|
Note that even if you follow all those steps, the Engineering team might have follow-up questions.
|
|||
|
|
|
|||
|
|
## Basic data that is needed
|
|||
|
|
|
|||
|
|
While it's not needed strictly 100% of the times, in most cases it's extremely useful to have a clear understanding of
|
|||
|
|
the basic characteristics of the Fleet deployment with the issues:
|
|||
|
|
|
|||
|
|
- Amount of total hosts.
|
|||
|
|
- Amount of online hosts.
|
|||
|
|
- Amount of scheduled queries.
|
|||
|
|
- Amount and size (CPU/Mem) of the Fleet instances.
|
|||
|
|
- Fleet instances CPU and Memory usage while the issue has been happening.
|
|||
|
|
- MySQL flavor/version in use.
|
|||
|
|
- MySQL server capacity (CPU/Mem).
|
|||
|
|
- MySQL CPU and Memory usage while the issue has been happening.
|
|||
|
|
- Are MySQL read replicas configured? If so, how many?
|
|||
|
|
- Redis version and server capacity (CPU/Mem).
|
|||
|
|
- Is Redis running in cluster mode?
|
|||
|
|
- Redis CPU and Memory usage while the issue has been happening.
|
|||
|
|
- The output of `fleetctl debug archive`.
|
|||
|
|
|
|||
|
|
## Triaging the issue
|
|||
|
|
|
|||
|
|
The first step in understanding an issue better is figuring out in what area of the system the issue is happening. There
|
|||
|
|
are two main areas an issue might fall in: server side or client side.
|
|||
|
|
|
|||
|
|
A server side issue is one where one of the few pieces of infrastructure on the server encounters an issue. Some of
|
|||
|
|
these pieces are: the MySQL database, the load balancer, a Fleet server instance.
|
|||
|
|
|
|||
|
|
A client side issue is one where the issue occurs on the software that runs on the hosts (i.e. the machine that runs
|
|||
|
|
osquery/orbit/Fleet Desktop).
|
|||
|
|
|
|||
|
|
There are issues that expand both areas, but in most cases the issue happens in one area and the other is more of
|
|||
|
|
symptom rather than the issue itself. So we'll continue this text with the assumption that multi-area issues are rare
|
|||
|
|
and even if facing them, the following should help narrow it down.
|
|||
|
|
|
|||
|
|
While the classification of client and server side issues is easy, it's also not realistic. So let's expand the
|
|||
|
|
categories a bit more and let's "mark" them with keyword:
|
|||
|
|
|
|||
|
|
1. Fleet itself (the binary/docker image running the Fleet API): `SERVER`
|
|||
|
|
1. A specific part of the Fleet UI is slow: `PARTIALSERVER`
|
|||
|
|
2. MySQL: `MYSQL`
|
|||
|
|
3. Redis: `REDIS`
|
|||
|
|
4. Infrastructure: `INFRA`
|
|||
|
|
5. osquery / orbit / Fleet Desktop: `OSQUERY`
|
|||
|
|
|
|||
|
|
With this areas in mind, here's a list of possible issues and what are you should look into:
|
|||
|
|
|
|||
|
|
- A specific device (or a handful of devices) is not behaving as expected -> `OSQUERY`
|
|||
|
|
- A specific device appears online but last fetch at is old -> `OSQUERY`
|
|||
|
|
- The Fleet UI is slow overall -> `SERVER`
|
|||
|
|
- A specific page (or a handful of pages, but not all) in the Fleet UI is slow -> `PARTIALSERVER`
|
|||
|
|
- New devices cannot enroll -> `OSQUERY`
|
|||
|
|
- Live query results come in very slowly -> `REDIS` or `SERVER`
|
|||
|
|
- osquery Extensions are not working correctly -> `OSQUERY`
|
|||
|
|
- fleetctl is getting errors when applying yamls -> `SERVER`
|
|||
|
|
- Migrations are taking too long -> `MYSQL`
|
|||
|
|
- I see connection/network errors on the fleetctl or osquery logs, but not on my Fleet logs -> `INFRA`
|
|||
|
|
|
|||
|
|
### SERVER
|
|||
|
|
|
|||
|
|
Whenever diagnosing a server side issue, one of the first steps is to look at Fleet itself. In particular, that means
|
|||
|
|
looking at the logs across all instances that are running. How to look at these logs would vary depending on your
|
|||
|
|
deployment. If, for instance, it's an AWS deployment, and you're using our terraform files as guidance, you'd use
|
|||
|
|
CloudWatch.
|
|||
|
|
|
|||
|
|
Fleet by default will log errors, and those are the first thing to look for. If you have debug logging enabled, you can
|
|||
|
|
filter errors by filtering the keyword `err`.
|
|||
|
|
|
|||
|
|
These logs will be the first way to triage a server side error. For example, if there are timeouts happening in APIs,
|
|||
|
|
you should continue by looking at `MYSQL` and then `REDIS`. Otherwise, if it looks like a more illustrative error, this
|
|||
|
|
would be a good point to reach out with all the information gathered.
|
|||
|
|
|
|||
|
|
If there are no errors in the logs and everything looks normal, check `INFRA`.
|
|||
|
|
|
|||
|
|
### PARTIALSERVER
|
|||
|
|
|
|||
|
|
Sometimes Fleet operates without any errors but accessing a specific part of the web UI are slow. As a starting point it
|
|||
|
|
would be good to get a screenshot of the Network tab in the Developer Tools of your browser. The main data that needs to
|
|||
|
|
be visible are: Name, Status, and Time (in Chrome's terms).
|
|||
|
|
[Here's how to accomplish this using Google Chrome](https://developer.chrome.com/docs/devtools/network/).
|
|||
|
|
|
|||
|
|
Besides from this, it might be good to continue with `MYSQL` and `REDIS`.
|
|||
|
|
|
|||
|
|
Depending on the API, there will likely be followup questions about amount of data, but this would be a good point to
|
|||
|
|
check in with Engineering.
|
|||
|
|
|
|||
|
|
### MYSQL
|
|||
|
|
|
|||
|
|
Most of the data needed to understand an issue in MySQL should've been gathered already by the basic data specified at
|
|||
|
|
the beginning of this document. However, there is a chance that Fleet is running with a database user that is not
|
|||
|
|
capable of querying the information needed. So here are the queries that would output a good first step in information
|
|||
|
|
gathering:
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
show engine innodb status;
|
|||
|
|
show processlist;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If read replicas are configured, another piece of important data is whether there has been any replication lag
|
|||
|
|
registered.
|
|||
|
|
|
|||
|
|
With all this gathered, it's a good time to reach out to the Engineering team.
|
|||
|
|
|
|||
|
|
### REDIS
|
|||
|
|
|
|||
|
|
In most cases, the data gathered at the beginning of this document should be enough to understand what might be
|
|||
|
|
happening with Redis. However, if more details are needed, running the
|
|||
|
|
[monitor command](https://redis.io/commands/monitor/) should shed more light in the issue.
|
|||
|
|
|
|||
|
|
**WARNING**: if Redis is suffering from performance issues, running monitor will only increase the problem.
|
|||
|
|
|
|||
|
|
A less invasive way to check for more stats, if Elasticache is being used (or another system with more reporting) other
|
|||
|
|
metrics like current connections, replication lag if applicable, if one instance is largely overused compare to others
|
|||
|
|
in cluster mode, number of commands per key type could help identify what is wrong.
|
|||
|
|
|
|||
|
|
### OSQUERY
|
|||
|
|
|
|||
|
|
Just like with the Fleet server, the best way to understand issues on the client side is to look at logs.
|
|||
|
|
|
|||
|
|
If you are running vanilla osquery in the host, please restart the host with `--tls_dump` and `--verbose`. This will
|
|||
|
|
allow us to see more details as to what's happening in the communication with Fleet (or lack there of). Check the
|
|||
|
|
[official documentation](https://osquery.readthedocs.io/en/stable/deployment/logging/) for details as to how to locate
|
|||
|
|
the logs and other configurations.
|
|||
|
|
|
|||
|
|
If you are running Orbit, you should add `--debug` to the command line options. This will get debug logs for Orbit and
|
|||
|
|
also for osquery automatically. Check the [Orbit README](https://github.com/fleetdm/fleet/blob/main/orbit/README.md#logs)
|
|||
|
|
for more details as to where to find Orbit specific logs.
|
|||
|
|
|
|||
|
|
If you are running Fleet Desktop there's no change needed, you should see the log file in the following directories
|
|||
|
|
depending on the platform:
|
|||
|
|
|
|||
|
|
- Linux: `$XDG_STATE_HOME/Fleet` or `$HOME/.local/state/Fleet`
|
|||
|
|
- macOS: `$HOME/Library/Logs/Fleet`
|
|||
|
|
- Windows: `%LocalAppData%/Fleet`
|
|||
|
|
|
|||
|
|
The log file name is `fleet-desktop.log`.
|
|||
|
|
|
|||
|
|
If the issue is related to osquery extensions, the following data would be needed:
|
|||
|
|
|
|||
|
|
- osquery version
|
|||
|
|
- OS it's running on
|
|||
|
|
- What does the extension do?
|
|||
|
|
- How is the extension queried/deployed?
|
|||
|
|
- What language is the extension implemented in?
|
|||
|
|
- What's the nature of the problem? (i.e. whether the extension is respawning, or whether the extension can’t connect,
|
|||
|
|
or extension is up/working and then dies and can’t reconnect)
|
|||
|
|
|
|||
|
|
With this data, it's time to reach out to Engineering.
|
|||
|
|
|
|||
|
|
### `INFRA`
|
|||
|
|
|
|||
|
|
At this level, what you want to look into are Load Balancer logs, errors, and configurations. For instance, does the LB
|
|||
|
|
have a request size limit? If the LB is not terminating TLS, is that configured properly on the Fleet side?
|
|||
|
|
|
|||
|
|
Make sure as well that your cloud provider is not having issues of their own. For instances,
|
|||
|
|
[here](https://health.aws.amazon.com/health/status) is where to check for status for AWS.
|
|||
|
|
|
|||
|
|
<meta name="pageOrderInSection" value="600">
|