<!-- Add the related story/sub-task/bug number, like Resolves#123, or
remove if NA -->
**Related issue:** Resolves#38889
PLEASE READ BELOW before looking at file changes
Before converting individual files/packages to slog, we generally need
to make these 2 changes to make the conversion easier:
- Replace uses of `kitlog.With` since they are not fully compatible with
our kitlog adapter
- Directly use the kitlog adapter logger type instead of the kitlog
interface, which will let us have direct access to the underlying slog
logger: `*logging.Logger`
Note: that I did not replace absolutely all uses of `kitlog.Logger`, but
I did remove all uses of `kitlog.With` except for these due to
complexity:
- server/logging/filesystem.go and the other log writers (webhook,
firehose, kinesis, lambda, pubsub, nats)
- server/datastore/mysql/nanomdm_storage.go (adapter pattern)
- server/vulnerabilities/nvd/* (cascades to CLI tools)
- server/service/osquery_utils/queries.go (callback type signatures
cascade broadly)
- cmd/maintained-apps/ (standalone, so can be transitioned later all at
once)
Most of the changes in this PR follow these patterns:
- `kitlog.Logger` type → `*logging.Logger`
- `kitlog.With(logger, ...)` → `logger.With(...)`
- `kitlog.NewNopLogger() → logging.NewNopLogger()`, including similar
variations such as `logging.NewLogfmtLogger(w)` and
`logging.NewJSONLogger(w)`
- removed many now-unused kitlog imports
Unique changes that the PR review should focus on:
- server/platform/logging/kitlog_adapter.go: Core adapter changes
- server/platform/logging/logging.go: New convenience functions
- server/service/integration_logger_test.go: Test changes for slog
# Checklist for submitter
If some of the following don't apply, delete the relevant line.
- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- Was added in previous PR
## Testing
- [x] Added/updated automated tests
- [x] QA'd all new/changed functionality manually
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Refactor**
* Migrated the codebase to a unified internal structured logging system
for more consistent, reliable logs and observability.
* No user-facing functionality changed; runtime behavior and APIs remain
compatible.
* **Tests**
* Updated tests to use the new logging helpers to ensure consistent test
logging and validation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
<!-- Add the related story/sub-task/bug number, like Resolves#123, or
remove if NA -->
**Related issue:** Resolves#35434
- added child span for worker jobs
- removed getTracer since that is no longer needed with the modern OTEL
library, which caches the tracers itself
# Checklist for submitter
## Testing
- [x] QA'd all new/changed functionality manually
Fixes#32331
# Checklist for submitter
- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
## Testing
- [x] QA'd all new/changed functionality manually
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
- New Features
- Expanded OpenTelemetry coverage to include scheduled jobs for improved
tracing of cron executions.
- Documentation
- Updated changelog to reflect expanded OpenTelemetry coverage.
- Refactor
- Updated scheduled job processing to use contextual tracing across
operations with improved span lifecycle handling.
- Tests
- Adjusted tests to pass context into scheduled job runs.
- Chores
- Removed obsolete commented debug logs in SCIM middleware.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
For #26657
This PR fixes an issue that causes cron monitoring alerts to be sent
repeatedly after the first instance; that is, if a cron job fails once
then the monitor reports the failure every time it runs until the server
is restarted. This was due to the errors being held in the Schedule
object which persists for the lifetime of the process, rather than being
recreated for each run. The solution is to clear the errors from the
Schedule object before each run.
Added a test that fails on main and passes on this branch.
for #19930
# Checklist for submitter
- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality
# Details
This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).
The actual changes are:
**On the server side:**
- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table
**On the monitor side:**
- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors
# Testing
New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:
```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found. Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
We discussed at backend sync today (2024-11-05) that we'd like to start
adding READMEs in the codebase for very tactical documentation.
This is an inital README for the cron/scheduling machinery.
`go-kit/kit/log` was deprecated and generating warnings
# Checklist for submitter
If some of the following don't apply, delete the relevant line.
<!-- Note that API documentation changes are now addressed by the
product design team. -->
- [x] Manual QA for all new/changed functionality
#16480
# Checklist for submitter
- [x] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
#9486
Now cron jobs should recover from a Fleet outage after ~ two hours.
- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
- ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~
#9515
Sample output after running `fleetctl trigger --name
cleanups_then_aggregation`:
```sh
./build/fleet serve --dev --dev_license 2>&1 | tee ~/fleet.txt
level=info ts=2023-03-09T19:27:17.324691Z component=redis mode=standalone
level=info ts=2023-03-09T19:27:17.360565Z instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA=="
level=info ts=2023-03-09T19:27:17.372767Z msg="started cron schedules: automations, cleanups_then_aggregation, integrations, usage_statistics, vulnerabilities"
ts=2023-03-09T19:27:17.391404Z transport=https address=0.0.0.0:8080 msg=listening
level=error ts=2023-03-09T19:27:19.973841Z query=fleet_detail_query_software_macos message="distributed query is denylisted" hostID=58
level=info ts=2023-03-09T19:27:21.262799Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=pending
ts=2023-03-09T19:27:22.218129Z inf="skipping verification of encryption keys as MDM is not fully configured"
level=info ts=2023-03-09T19:27:22.224179Z cron=cleanups_then_aggregation schedule=cleanups_then_aggregation instanceID="V9mArnX3lPhlIS0enyFau9eWi/dpjUPmOzJ3rwQUkX+l2aU1AMM4lQfdaDFZfeyJSHBwrIt/km1ghmRcyhdWqA==" status=completed
```
- [X] Changes file added for user-visible changes in `changes/` or
`orbit/changes/`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- ~[ ] Documented any API changes (docs/Using-Fleet/REST-API.md or
docs/Contributing/API-for-contributors.md)~
- ~[ ] Documented any permissions changes~
- ~[ ] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)~
- ~[ ] Added support on fleet's osquery simulator `cmd/osquery-perf` for
new osquery data ingestion features.~
- ~[ ] Added/updated tests~
- [X] Manual QA for all new/changed functionality
- ~For Orbit and Fleet Desktop changes:~
- ~[ ] Manual QA must be performed in the three main OSs, macOS, Windows
and Linux.~
- ~[ ] Auto-update manual QA, from released version of component to new
version (see [tools/tuf/test](../tools/tuf/test/README.md)).~