fleet/changes/43910-implement-chart-module

2 lines
93 B
Text
Raw Normal View History

Dashboard charts backend (#43910) <!-- Add the related story/sub-task/bug number, like Resolves #123, or remove if NA --> **Related issue:** For #42812 # Details This PR implements a new bounded context, `chart`, with a single endpoint `/charts`. The context encompasses a framework for recording and querying and aggregating historical data for Fleet hosts, and returning that data via the API for the purpose of charting. This initial iteration has a full implementation of a dataset called "uptime" which captures which hosts were online hour-by-hour (online meaning, having been "seen" at some point during that hour). It has a partial implementation of a "cve" dataset which will capture which hosts were vulnerable to which CVEs during a given day. ### Data storage Data is stored in an SCD (slowly-changing dimension) format in the `host_scd_data` table, where the main "value" in a row is stored in the `host_bitmap` column, which is a `mediumblob` where each bit encodes a host ID (bit one represents host ID 1, bit 1444 represents host ID 1444, etc.). The set of bits set on a row represents that hosts for which that dataset is "on" during a given time period represented by the `valid_from` (inclusive) and `valid_to` (exclusive) dates, where a `valid_to` can have the special "sentinel" value 9999-12-31T00:00:00.000 meaning that the row is still "open" (the value represents everything from `valid_from` to the present). Additionally an `entity_id` column can be used for datasets with multiple dimensions, e.g. CVE exposure or software usage which would have entity IDs representing CVEs or software items respectively. ### Data collection Data is collected via a cron job that runs every 10 minutes. Each dataset has its own `Collect` method which will sample the data for the given moment. For example the "uptime" dataset gathers the set of hosts that are online at the moment, and the "cve" dataset will gather the set of hosts that are vulnerable to each CVE at that moment. The sample can then be recorded using one of two strategies: * `accumulate`: bitwise OR the sample with any data already recorded for the current hour, or add a new pre-closed row for that hour. * `snapshot`: if there is no open row, create one with the sample and `valid_to set` to the sentinel. Otherwise: * If the sample has the same value as the current open row, do nothing * If the sample has a different value and the current open row's `valid_from` is within the same hour, update the current row's value * If the sample has a different value and the current open row's `valid_from` is not within the same hour, close the current open row and start a new one with `valid_from` = the start of the current hour ### Data retrieval 1. Gets the set of host IDs to retrieve data for. This starts with the set of host IDs in the requested fleet (or all the hosts a user has access to if no `fleet_id` param was passed to the `/charts` endpoint), and further whittled down by any filter options supplied with the request (labels, platforms, etc.). 2. Finds all `host_scd_data` rows for the requested dataset and date range (i.e. all rows whose `valid_from` is < the date range end and `valid_to` is > the date range start). 3. Calculates the date ranges of the "buckets" to return datapoints for. For the uptime chart we default to 3-hour buckets, so we want 8 buckets per day. 4. Iterates over each bucket and finds the row or rows from host_scd_data that cover that bucket range. For datasets using the "accumulate" strategy, the values for those rows are ORed together. For "snapshot"s, we take the one active at the bucket end time to represent the bucket (e.g. "which hosts had a given CVE at the end of the day") ### Tools This PR includes two dev tools that don't require deep review: * **chart-backfill** - used to backfill data to various datasets for testing * **charts-collect** - used to collect data from a live server via the API and put into a local hosts_scd_data table # Checklist for submitter If some of the following don't apply, delete the relevant line. - [X] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. - [X] Input data is properly validated, `SELECT *` is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters. ## Testing - [X] Added/updated automated tests - [X] Where appropriate, [automated tests simulate multiple hosts and test for host isolation](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/reference/patterns-backend.md#unit-testing) (updates to one hosts's records do not affect another) - [X] QA'd all new/changed functionality manually - With [front-end branch](https://github.com/fleetdm/fleet/pull/43878) <img width="712" height="434" alt="image" src="https://github.com/user-attachments/assets/b2ccce49-b5fd-4076-b47f-0eea6a53260c" /> ## Database migrations - [X] Checked schema for all modified table for columns that will auto-update timestamps during migration. - [X] Confirmed that updating the timestamps is acceptable, and will not cause unwanted side effects. - [X] Ensured the correct collation is explicitly set for character columns (`COLLATE utf8mb4_unicode_ci`). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added charting bounded context: HTTP API for metrics (uptime, CVE), dataset registry, hosted dataset collection, background collection/cleanup with opt-out env. * New utilities: host bitmap operations and string-list/uint-list parsers. * New CLI tools to collect and backfill chart data. * **Database** * Migration and schema to store host time-series SCD chart data. * **Tests** * Extensive unit and integration tests for service, storage, caching, cron, and utilities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-23 17:43:23 +00:00
- Implemented the chart bounded context and schema to support charting capabilities in Fleet