fleet/server/datastore
Scott Gress 6bd9cc8a44
Monitor and alert on errors in cron jobs (#24347)
for #19930 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality

# Details

This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).

The actual changes are:

**On the server side:**

- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table

**On the monitor side:**

- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors

# Testing

New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:

```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
2024-12-19 15:55:29 -06:00
..
cached_mysql Enable staticcheck Go linter. (#23487) 2024-11-05 11:16:24 -06:00
filesystem Feature branch for the Store Bootstrap Packages in S3 story (#21260) 2024-08-13 08:27:10 -04:00
mysql Monitor and alert on errors in cron jobs (#24347) 2024-12-19 15:55:29 -06:00
mysqlredis Support matching a host in orbit enrollment using the serial number (#9612) 2023-02-28 12:55:04 -05:00
redis Updating golangci-lint to 1.61.0 (#22973) 2024-10-18 12:38:26 -05:00
s3 Feature branch for the Store Bootstrap Packages in S3 story (#21260) 2024-08-13 08:27:10 -04:00