Commit graph

8 commits

Author SHA1 Message Date
Victor Lyuboslavsky
6fc7132350
Trigger vuln processing when it runs on a separate server (#39612)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #35239

Docs PR: #39770

## Remote trigger approach
When FLEET_VULNERABILITIES_DISABLE_SCHEDULE=true, the main Fleet server
registers a RemoteTriggerSchedule instead of the real vulnerability
schedule. When a user runs fleetctl trigger --name=vulnerabilities:
1. Main server: RemoteTriggerSchedule.Trigger() inserts a cron_stats
record with status=queued.
2. Worker server: The vulnerability schedule runs with
WithTriggerPollInterval(60s), which starts a poll goroutine that checks
the DB every 60s for queued records.
3. Pickup: When the poll goroutine finds a queued record, it sends the
stats ID on the trigger channel (non-blocking).
4. Execution: The trigger handler acquires the lock, claims the record
via ClaimCronStats (updating status to pending and instance to the
actual worker ID), runs all jobs, and marks it completed.

Key details:
- The trigger channel carries an int: 0 for in-process triggers, >0 for
DB-polled stats IDs. This lets runWithStats reuse the existing record
instead of inserting a new one.
- Both Schedule.Trigger() and RemoteTriggerSchedule.Trigger() treat
pending and queued as conflicts to prevent duplicate runs.
- Queued records expire after 2 hours via CleanupCronStats, same as
pending records.
- The poll goroutine only signals; it doesn't modify DB state. The
handler claims when ready.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added support for remote trigger execution in vulnerability scheduling
workflows.
* Implemented periodic polling mechanism to detect and process
externally triggered vulnerability scans.

* **Bug Fixes**
* Enhanced trigger status tracking to properly handle queued scan jobs.

* **Improvements**
* Strengthened scheduling system with improved timeout and cancellation
management capabilities.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-02-17 09:18:03 -06:00
Scott Gress
6bd9cc8a44
Monitor and alert on errors in cron jobs (#24347)
for #19930 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality

# Details

This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).

The actual changes are:

**On the server side:**

- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table

**On the monitor side:**

- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors

# Testing

New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:

```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
2024-12-19 15:55:29 -06:00
Victor Lyuboslavsky
f85b6f776f
Updating golangci-lint to 1.61.0 (#22973) 2024-10-18 12:38:26 -05:00
Roberto Dip
aa7466b819
fix test race in schedule mock (#10309)
This fixes the races that are occurring on tests
([example](https://github.com/fleetdm/fleet/actions/runs/4339799935))
2023-03-06 12:24:40 -03:00
gillespi314
21c6733c1b
Release schedule lock when triggered run spans schedule interval (#10240) 2023-03-03 12:14:10 -06:00
gillespi314
836553ba60
Fix cron trigger bug (#8950) 2022-12-16 12:00:42 -06:00
gillespi314
d5c096fa02
Implement schedule triggers (#8747) 2022-11-28 13:28:06 -06:00
gillespi314
267aaf0dbe
Add holdLock and releaseLock methods to schedule package (#8464) 2022-11-16 15:14:38 -06:00