fleet/changes/43910-implement-chart-module
Scott Gress 28908e6083
Dashboard charts backend (#43910)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** For #42812 

# Details

This PR implements a new bounded context, `chart`, with a single
endpoint `/charts`. The context encompasses a framework for recording
and querying and aggregating historical data for Fleet hosts, and
returning that data via the API for the purpose of charting.

This initial iteration has a full implementation of a dataset called
"uptime" which captures which hosts were online hour-by-hour (online
meaning, having been "seen" at some point during that hour). It has a
partial implementation of a "cve" dataset which will capture which hosts
were vulnerable to which CVEs during a given day.

### Data storage

Data is stored in an SCD (slowly-changing dimension) format in the
`host_scd_data` table, where the main "value" in a row is stored in the
`host_bitmap` column, which is a `mediumblob` where each bit encodes a
host ID (bit one represents host ID 1, bit 1444 represents host ID 1444,
etc.). The set of bits set on a row represents that hosts for which that
dataset is "on" during a given time period represented by the
`valid_from` (inclusive) and `valid_to` (exclusive) dates, where a
`valid_to` can have the special "sentinel" value 9999-12-31T00:00:00.000
meaning that the row is still "open" (the value represents everything
from `valid_from` to the present). Additionally an `entity_id` column
can be used for datasets with multiple dimensions, e.g. CVE exposure or
software usage which would have entity IDs representing CVEs or software
items respectively.

### Data collection

Data is collected via a cron job that runs every 10 minutes. Each
dataset has its own `Collect` method which will sample the data for the
given moment. For example the "uptime" dataset gathers the set of hosts
that are online at the moment, and the "cve" dataset will gather the set
of hosts that are vulnerable to each CVE at that moment. The sample can
then be recorded using one of two strategies:

* `accumulate`: bitwise OR the sample with any data already recorded for
the current hour, or add a new pre-closed row for that hour.
* `snapshot`: if there is no open row, create one with the sample and
`valid_to set` to the sentinel. Otherwise:
  * If the sample has the same value as the current open row, do nothing
* If the sample has a different value and the current open row's
`valid_from` is within the same hour, update the current row's value
* If the sample has a different value and the current open row's
`valid_from` is not within the same hour, close the current open row and
start a new one with `valid_from` = the start of the current hour

### Data retrieval 

1. Gets the set of host IDs to retrieve data for. This starts with the
set of host IDs in the requested fleet (or all the hosts a user has
access to if no `fleet_id` param was passed to the `/charts` endpoint),
and further whittled down by any filter options supplied with the
request (labels, platforms, etc.).
2. Finds all `host_scd_data` rows for the requested dataset and date
range (i.e. all rows whose `valid_from` is < the date range end and
`valid_to` is > the date range start).
3. Calculates the date ranges of the "buckets" to return datapoints for.
For the uptime chart we default to 3-hour buckets, so we want 8 buckets
per day.
4. Iterates over each bucket and finds the row or rows from
host_scd_data that cover that bucket range. For datasets using the
"accumulate" strategy, the values for those rows are ORed together. For
"snapshot"s, we take the one active at the bucket end time to represent
the bucket (e.g. "which hosts had a given CVE at the end of the day")

### Tools

This PR includes two dev tools that don't require deep review:

* **chart-backfill** - used to backfill data to various datasets for
testing
* **charts-collect** - used to collect data from a live server via the
API and put into a local hosts_scd_data table

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.

- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements), JS
inline code is prevented especially for url redirects, and untrusted
data interpolated into shell scripts/commands is validated against shell
metacharacters.

## Testing

- [X] Added/updated automated tests
- [X] Where appropriate, [automated tests simulate multiple hosts and
test for host
isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing)
(updates to one hosts's records do not affect another)

- [X] QA'd all new/changed functionality manually
  - With [front-end branch](https://github.com/fleetdm/fleet/pull/43878)
<img width="712" height="434" alt="image"
src="https://github.com/user-attachments/assets/b2ccce49-b5fd-4076-b47f-0eea6a53260c"
/>

## Database migrations

- [X] Checked schema for all modified table for columns that will
auto-update timestamps during migration.
- [X] Confirmed that updating the timestamps is acceptable, and will not
cause unwanted side effects.
- [X] Ensured the correct collation is explicitly set for character
columns (`COLLATE utf8mb4_unicode_ci`).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added charting bounded context: HTTP API for metrics (uptime, CVE),
dataset registry, hosted dataset collection, background
collection/cleanup with opt-out env.
* New utilities: host bitmap operations and string-list/uint-list
parsers.
  * New CLI tools to collect and backfill chart data.

* **Database**
  * Migration and schema to store host time-series SCD chart data.

* **Tests**
* Extensive unit and integration tests for service, storage, caching,
cron, and utilities.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-23 12:43:23 -05:00

1 line
93 B
Text

- Implemented the chart bounded context and schema to support charting capabilities in Fleet