This PR closes https://github.com/fleetdm/fleet/issues/21108 @noahtalerman, I double-checked all redirects, and they are working. Clicking through the URLs in [this spreadsheet](https://docs.google.com/spreadsheets/d/1djVynIMuJK4pT5ziJW12CluVqcaoxxnCLaBO3VXfAt4/edit?usp=sharing) is a pretty quick way to go through them all. Note that "Audit logs" and "Understanding host vitals" redirect to the contributor docs on GitHub, so they will throw a 404 until this is merged. Some new guides benefitted from a name change, so they make more sense as stand-alone guides, and also so that we don't have to mess around with more redirects later. Those name changes followed [this convention](https://fleetdm.com/handbook/company/communications#headings-and-titles), which was recently documented in the handbook. Have fun! --------- Co-authored-by: Eric <eashaw@sailsjs.com> Co-authored-by: Noah Talerman <noahtal@umich.edu>
2.8 KiB
Osquery watchdog
Osquery will run a watcher process to keep track of any child process and any managed extensions. What follows is a description of what happens during the watcher REPL and under what circumstances the child process and/or managed extensions are terminated.
As a first step, the watcher checks the state of the child worker process, which could be either Alive or Non-existent. If the process is Alive, we make sure the process is within its assigned resource quota, by checking:
-
That the maximum CPU utilization limit is not exceeded (which is controlled by osquery's
--watchdog_latency_limitflag). -
The maximum memory limit is not exceeded (which is controlled by osquery's
--watchdog_memory_limitflag).
If the child process is within the resource limits, then it is deemed alive and well. Otherwise, we terminate the process by following these steps:
- We send a
SIGUSR1to the child process. - We send a
SIGTERMto the child process. - After a delay (configured by osquery's
--watchdog_forced_shutdown_delayflag) we send aSIGKILLto the child process.
If the child process is Non-existent, either because it didn't exist in the first place or because it was terminated, the watcher will try to spawn a new child process. But first, it will check whether the maximum number of allowed process re-spawns was reached. If it was, then the osquery process shutdowns.
After checking the state of the child worker, we check the state of every managed extension, which could be Alive or Non-existent.
If the managed extension is Alive, the watcher will check both the CPU utilization and memory consumption (the same checks we perform for the child process). If the managed extension is deemed unstable, we terminate the extension by following these steps:
- We send a
SIGTERMto the managed extension. - After a delay (configured by osquery's
--watchdog_forced_shutdown_delayflag), we send aSIGKILLto the managed extension.
If the managed extension is Non-existent (either because it was Non-existent in the first place or because it was terminated due to resource contention), the watcher will try to 'launch' the managed extension. But first, it will check the respawn limit. If the respawn limit was reached or if for some reason the extension could be spawned, then the osquery process is shut down.
Lastly, we check the state of the watcher process itself. If it is deemed unhealthy because of resource contention, then the osquery process is shut down.