fleet/server/service/schedule
Scott Gress 6bd9cc8a44
Monitor and alert on errors in cron jobs (#24347)
for #19930 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality

# Details

This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).

The actual changes are:

**On the server side:**

- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table

**On the monitor side:**

- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors

# Testing

New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:

```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
2024-12-19 15:55:29 -06:00
..
README.md feat: initial readme for cron jobs (#23563) 2024-11-06 09:13:45 -05:00
schedule.go Monitor and alert on errors in cron jobs (#24347) 2024-12-19 15:55:29 -06:00
schedule_test.go Monitor and alert on errors in cron jobs (#24347) 2024-12-19 15:55:29 -06:00
testing_utils.go Monitor and alert on errors in cron jobs (#24347) 2024-12-19 15:55:29 -06:00

schedule: the Fleet cron job machinery

Fleet has several pieces of functionality that are implemented as cron jobs, which run on a schedule. Package schedule implements the machinery needed for queueing and running these jobs.

List of cron jobs

See server/fleet/cron_schedules.go for a list of the currently implemented cron jobs and information about what they do.

Cron jobs are created and registered in the cmd/fleet package because they have to be run at server start. The actual implementation of the cron job logic is usually elsewhere however, typically in a service layer method (and related datastore methods).

How to add a new cron job

See this PR for a nice example of how to add a simple cron job.

  1. Do you need a new cron job? You can add sub-jobs to an existing cron job; for example, if you're adding some functionality for cleaning up unused data, you might want to implement it as a sub-job in the cleanups_then_aggregation cron.
  2. Add a cron job name. If you determine that you do need a new cron job, create a descriptive name in cron_schedules.go. Make sure you leave a comment explaining what the job does.
  3. Implement your functionality. Do this wherever it makes sense. In the example PR, the functionality exists in the server/mdm/maintainedapps/ingest.go file. However, you'll most likely implement a service layer method and related datastore layer methods.
  4. Add a function that returns a *schedule.Schedule in cmd/fleet/cron.go. This function will be used to register your cron job so it can actually run. This function should call whatever you implemented in step 3. This is also where you can set the interval on which your cron job will run.
  5. Register the cron job in cmd/fleet/serve.go. You'll use cronSchedules.StartCronSchedule to register the cron job by passing it an anonymous function that calls the function you wrote in step 3.