Commit graph

28 commits

Author SHA1 Message Date
Victor Lyuboslavsky
8207c3e01d
Added OTEL support for cron jobs. (#33083)
Fixes #32331 

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.

## Testing

- [x] QA'd all new/changed functionality manually


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- New Features
- Expanded OpenTelemetry coverage to include scheduled jobs for improved
tracing of cron executions.
- Documentation
  - Updated changelog to reflect expanded OpenTelemetry coverage.
- Refactor
- Updated scheduled job processing to use contextual tracing across
operations with improved span lifecycle handling.
- Tests
  - Adjusted tests to pass context into scheduled job runs.
- Chores
  - Removed obsolete commented debug logs in SCIM middleware.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-09-17 17:02:38 -05:00
Scott Gress
1a1d7bae78
Clear cron schedule errors before each run (#26775)
For #26657

This PR fixes an issue that causes cron monitoring alerts to be sent
repeatedly after the first instance; that is, if a cron job fails once
then the monitor reports the failure every time it runs until the server
is restarted. This was due to the errors being held in the Schedule
object which persists for the lifetime of the process, rather than being
recreated for each run. The solution is to clear the errors from the
Schedule object before each run.

Added a test that fails on main and passes on this branch.
2025-03-03 16:41:48 -06:00
Scott Gress
6bd9cc8a44
Monitor and alert on errors in cron jobs (#24347)
for #19930 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality

# Details

This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).

The actual changes are:

**On the server side:**

- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table

**On the monitor side:**

- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors

# Testing

New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:

```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
2024-12-19 15:55:29 -06:00
Jahziel Villasana-Espinoza
a23980347b
feat: initial readme for cron jobs (#23563)
We discussed at backend sync today (2024-11-05) that we'd like to start
adding READMEs in the codebase for very tactical documentation.

This is an inital README for the cron/scheduling machinery.
2024-11-06 09:13:45 -05:00
Victor Lyuboslavsky
f85b6f776f
Updating golangci-lint to 1.61.0 (#22973) 2024-10-18 12:38:26 -05:00
Tim Lee
1ecdad24ad
Remove panic recovery in CI tests (#22644) 2024-10-09 18:29:14 -06:00
Gabriel Hernandez
dec951f9f6 Merge branch 'main' into feat-fleet-app-library 2024-09-13 13:54:10 +01:00
Victor Lyuboslavsky
3eccbb1bd0
Uninstall migration cron job (#22036) 2024-09-12 20:07:56 -05:00
Martin Angers
b30e765554
Maintained Apps: add cron job, fix ingestion following latest specs (#21959) 2024-09-10 16:27:18 -04:00
Martin Angers
51709eadb6
Bugfix: cron startup scheduling is delayed too long if no prior run exists (#21784) 2024-09-03 15:50:43 -04:00
Victor Lyuboslavsky
fdfc12982b
Improvements to go tests in CI (#21545)
#21546 
Some improvements to overall go test CI run time.
2024-08-26 08:55:53 -05:00
Roberto Dip
1cc13a09fb
🧹 friday cleanup party: substitute deprecated import of go-kit (#19774)
`go-kit/kit/log` was deprecated and generating warnings

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [x] Manual QA for all new/changed functionality
2024-06-17 10:27:31 -03:00
Lucas Manuel Rodriguez
e8ca959888
Add enterprise integration test for calendar events (#17900)
Integration tests for the calendar feature: #17441.

Adding coverage screenshots for the calendar cron and the osquery
distributed/write coverage:

![Screenshot 2024-03-27 at 14 20
44](https://github.com/fleetdm/fleet/assets/2073526/40d394ab-2208-4bec-981b-fe22fae8b5c1)
![Screenshot 2024-03-27 at 14 21
20](https://github.com/fleetdm/fleet/assets/2073526/1e4c8611-21ba-48a6-82f8-a163594f7f01)
2024-04-04 14:58:31 -03:00
Martin Angers
c5b988d600
Fix stack trace of captured errors in Sentry, capture errors in more code paths (#16966)
#16480 

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
2024-02-22 15:10:28 -03:00
Lucas Manuel Rodriguez
b0475d998e
Run cleanup of cron_stats outside of the schedule package to prevent outages from breaking cron jobs (#10439)
#9486

Now cron jobs should recover from a Fleet outage after ~ two hours.

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
  - ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
2023-03-13 16:15:30 -03:00
Lucas Manuel Rodriguez
93e150666b
Add instanceID to schedule logging (#10413)
#9515

Sample output after running `fleetctl trigger --name
cleanups_then_aggregation`:
```sh
./build/fleet serve --dev --dev_license 2>&1 | tee ~/fleet.txt
level=info ts=2023-03-09T19:27:17.324691Z component=redis mode=standalone
level=info ts=2023-03-09T19:27:17.360565Z instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA=="
level=info ts=2023-03-09T19:27:17.372767Z msg="started cron schedules: automations, cleanups_then_aggregation, integrations, usage_statistics, vulnerabilities"
ts=2023-03-09T19:27:17.391404Z transport=https address=0.0.0.0:8080 msg=listening
level=error ts=2023-03-09T19:27:19.973841Z query=fleet_detail_query_software_macos message="distributed query is denylisted" hostID=58
level=info ts=2023-03-09T19:27:21.262799Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=pending
ts=2023-03-09T19:27:22.218129Z inf="skipping verification of encryption keys as MDM is not fully configured"
level=info ts=2023-03-09T19:27:22.224179Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=completed
```

- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
  - ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
2023-03-13 15:37:03 -03:00
Roberto Dip
aa7466b819
fix test race in schedule mock (#10309)
This fixes the races that are occurring on tests
([example](https://github.com/fleetdm/fleet/actions/runs/4339799935))
2023-03-06 12:24:40 -03:00
gillespi314
21c6733c1b
Release schedule lock when triggered run spans schedule interval (#10240) 2023-03-03 12:14:10 -06:00
Lucas Manuel Rodriguez
c4733a4792
Fix race on TestTriggerMultipleInstances (#9122) 2022-12-26 12:32:37 -03:00
gillespi314
836553ba60
Fix cron trigger bug (#8950) 2022-12-16 12:00:42 -06:00
gillespi314
6fb3a87ae9
Enable errcheck linter for golangci-lint (#8899) 2022-12-05 16:50:49 -06:00
gillespi314
d5c096fa02
Implement schedule triggers (#8747) 2022-11-28 13:28:06 -06:00
gillespi314
4a73d4a887
Adjust flaky tests (#8811) 2022-11-25 12:09:55 -06:00
gillespi314
b99ce3865b
Adjust cron schedule tests (#8754) 2022-11-18 11:26:51 -06:00
gillespi314
267aaf0dbe
Add holdLock and releaseLock methods to schedule package (#8464) 2022-11-16 15:14:38 -06:00
gillespi314
34688f531a
Refactor webhooks cron to new schedule package (#7840) 2022-09-20 14:26:36 -05:00
gillespi314
6a3d9959fc
Refactor vulnerabilities cron to scheduler package (#7650) 2022-09-16 10:08:51 -05:00
gillespi314
e2194be61c
Add schedule package and refactor cron jobs for cleanups, aggregations, and usage statistics (#6618) 2022-08-10 11:00:56 -05:00