Fixes#32331
# Checklist for submitter
- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
## Testing
- [x] QA'd all new/changed functionality manually
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- New Features
- Expanded OpenTelemetry coverage to include scheduled jobs for improved
tracing of cron executions.
- Documentation
- Updated changelog to reflect expanded OpenTelemetry coverage.
- Refactor
- Updated scheduled job processing to use contextual tracing across
operations with improved span lifecycle handling.
- Tests
- Adjusted tests to pass context into scheduled job runs.
- Chores
- Removed obsolete commented debug logs in SCIM middleware.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
For #26657
This PR fixes an issue that causes cron monitoring alerts to be sent
repeatedly after the first instance; that is, if a cron job fails once
then the monitor reports the failure every time it runs until the server
is restarted. This was due to the errors being held in the Schedule
object which persists for the lifetime of the process, rather than being
recreated for each run. The solution is to clear the errors from the
Schedule object before each run.
Added a test that fails on main and passes on this branch.
for #19930
# Checklist for submitter
- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality
# Details
This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).
The actual changes are:
**On the server side:**
- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table
**On the monitor side:**
- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors
# Testing
New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:
```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found. Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
We discussed at backend sync today (2024-11-05) that we'd like to start
adding READMEs in the codebase for very tactical documentation.
This is an inital README for the cron/scheduling machinery.
`go-kit/kit/log` was deprecated and generating warnings
# Checklist for submitter
If some of the following don't apply, delete the relevant line.
<!-- Note that API documentation changes are now addressed by the
product design team. -->
- [x] Manual QA for all new/changed functionality
#16480
# Checklist for submitter
- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
#9486
Now cron jobs should recover from a Fleet outage after ~ two hours.
- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
- ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
#9515
Sample output after running `fleetctl trigger --name
cleanups_then_aggregation`:
```sh
./build/fleet serve --dev --dev_license 2>&1 | tee ~/fleet.txt
level=info ts=2023-03-09T19:27:17.324691Z component=redis mode=standalone
level=info ts=2023-03-09T19:27:17.360565Z instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA=="
level=info ts=2023-03-09T19:27:17.372767Z msg="started cron schedules: automations, cleanups_then_aggregation, integrations, usage_statistics, vulnerabilities"
ts=2023-03-09T19:27:17.391404Z transport=https address=0.0.0.0:8080 msg=listening
level=error ts=2023-03-09T19:27:19.973841Z query=fleet_detail_query_software_macos message="distributed query is denylisted" hostID=58
level=info ts=2023-03-09T19:27:21.262799Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=pending
ts=2023-03-09T19:27:22.218129Z inf="skipping verification of encryption keys as MDM is not fully configured"
level=info ts=2023-03-09T19:27:22.224179Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=completed
```
- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
- ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~