Restructure "Scaling Fleet" handbook page for ease of reference (#16850)

This PR consolidates various subheadings into one list that appears
"above the fold" to make it easier for contributors to find the info
they are looking for on the page. As it was previously, important info
was getting buried under the "Connect to Dogfood" instructions, which
gave the wrong impression about the scope of the page content.
This commit is contained in:
Sarah Gillespie 2024-02-15 22:24:03 -06:00 committed by GitHub
parent 79e6ae6840
commit 64b85f87f7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,78 +1,11 @@
# Scaling Fleet
Nowadays, Fleet, as a Go server, scales horizontally very well. Its not very CPU or memory intensive. In terms of load in infrastructure, from highest to lowest are: MySQL, Redis, and Fleet.
In general, we should burn a bit of CPU or memory on the Fleet side if it allows us to reduce the load on MySQL or Redis.
In many, caching helps, but given that we are not doing load balancing based on host id (i.e., make sure that the same host ends up in the same Fleet server). This goes only so far. Caching host-specific data is not done because round-robin LB means all Fleet instances end up circling the total list of hosts.
### How to prevent most of this
The best way weve got so far to prevent any scaling issues is to load test things. **Every new feature must have its corresponding osquery-perf implementation as part of the PR, and it should be tested at a reasonable scale for the feature**.
Besides that, you should consider the answer(s) to the following question: how can I know that the feature Im working on is working and performing well enough? Add any logs, metrics, or anything that will help us debug and understand whats happening when things unavoidably go wrong or take longer than anticipated.
**HOWEVER** (and forgive this Captain Obvious comment): do NOT optimize before you KNOW you have to. Dont hesitate to take an extra day on your feature/bug work to load test things properly.
## What have we learned so far?
This is a document that evolves and will likely always be incomplete. If you feel like something is missing, either add it or bring it up in any way you consider.
## Connecting to Dogfood MySQL & Redis
### Prerequisites
1. Setup [VPN](https://github.com/fleetdm/confidential/blob/main/vpn/README.md)
2. Configure [SSO](https://github.com/fleetdm/confidential/tree/main/infrastructure/sso#how-to-use-sso)
### Connecting
#### MySQL
Get the database host:
```shell
DB_HOST=$(aws rds describe-db-clusters --filter Name=db-cluster-id,Values=fleet-dogfood --query "DBClusters[0].Endpoint" --output=text)
```
Get the database user:
```shell
DB_USER=$(aws rds describe-db-clusters --filter Name=db-cluster-id,Values=fleet-dogfood --query "DBClusters[0].MasterUsername" --output=text)
```
Get the database password:
```shell
DB_PASSWORD=$(aws secretsmanager get-secret-value --secret-id fleet-dogfood-database-password --query "SecretString" --output=text)
```
Connect:
```shell
mysql -h"${DB_HOST}" -u"${DB_USER}" -p"${DB_PASSWORD}"
```
#### Redis
Get the Redis Host:
```shell
REDIS_HOST=$(aws elasticache describe-replication-groups --replication-group-id fleetdm-redis --query "ReplicationGroups[0].NodeGroups[0].PrimaryEndpoint.Address" --output=text)
```
Connect:
```shell
redis-cli -h "${REDIS_HOST}"
```
## Foreign keys and locking
Among the first things you learn in database data modeling is: that if one table references a row in another, that reference should be a foreign key. This provides a lot of assurances and makes coding basic things much simpler.
However, this database feature doesnt come without a cost. The one to focus on here is locking, and heres a great summary of the issue: https://www.percona.com/blog/2006/12/12/innodb-locking-and-foreign-keys/
The TLDR is: understand very well how a table will be used. If we do bulk inserts/updates, InnoDB might lock more than you anticipate and cause issues. This is not an argument to not do bulk inserts/updates, but to be very careful when you add a foreign key.
In particular, host_id is a foreign key weve been skipping in all the new additional host data tables, which is not something that comes for free, as with that, [we have to keep the data consistent by hand with cleanups](https://github.com/fleetdm/fleet/blob/71a237042a9c39a45bc8f9c76465e5ff6039eba9/server/datastore/mysql/hosts.go#L444).
### In this section
### What have we learned so far?
- [How Fleet scales](#how-fleet-scales)
- [How to prevent most of this](#how-to-prevent-most-of-this)
- [Foreign keys and locking](#foreign-keys-and-locking)
- [Insert on duplicate update](#insert-on-duplicate-update)
- [Host extra data and JOINs](#host-extra-data-and-joins)
- [What DB tables matter more when thinking about performance?](#what-db-tables-matter-more-when-thinking-about-performance)
@ -82,8 +15,33 @@ In particular, host_id is a foreign key weve been skipping in all the new add
- [Counts and aggregated data](#counts-and-aggregated-data)
- [Caching data such as app config](#caching-data-such-as-app-config)
- [Redis SCAN](#redis-scan)
- [Fleet docs](#fleet-docs)
- [Community support](#community-support)
- [Connecting to Dogfood MySQL & Redis](#connecting-to-dogfood-mysql--redis)
### How Fleet scales
Nowadays, Fleet, as a Go server, scales horizontally very well. Its not very CPU or memory intensive. In terms of load in infrastructure, from highest to lowest are: MySQL, Redis, and Fleet.
In general, we should burn a bit of CPU or memory on the Fleet side if it allows us to reduce the load on MySQL or Redis.
In many cases, caching helps, but given that we are not doing load balancing based on host id (i.e., make sure that the same host ends up in the same Fleet server). This goes only so far. Caching host-specific data is not done because round-robin LB means all Fleet instances end up circling the total list of hosts.
### How to prevent most of this
The best way weve got so far to prevent any scaling issues is to load test things. **Every new feature must have its corresponding osquery-perf implementation as part of the PR, and it should be tested at a reasonable scale for the feature**.
Besides that, you should consider the answer(s) to the following question: how can I know that the feature Im working on is working and performing well enough? Add any logs, metrics, or anything that will help us debug and understand whats happening when things unavoidably go wrong or take longer than anticipated.
**HOWEVER** (and forgive this Captain Obvious comment): do NOT optimize before you KNOW you have to. Dont hesitate to take an extra day on your feature/bug work to load test things properly.
### Foreign keys and locking
Among the first things you learn in database data modeling is: that if one table references a row in another, that reference should be a foreign key. This provides a lot of assurances and makes coding basic things much simpler.
However, this database feature doesnt come without a cost. The one to focus on here is locking, and heres a great summary of the issue: https://www.percona.com/blog/2006/12/12/innodb-locking-and-foreign-keys/
The TLDR is: understand very well how a table will be used. If we do bulk inserts/updates, InnoDB might lock more than you anticipate and cause issues. This is not an argument to not do bulk inserts/updates, but to be very careful when you add a foreign key.
In particular, host_id is a foreign key weve been skipping in all the new additional host data tables, which is not something that comes for free, as with that, [we have to keep the data consistent by hand with cleanups](https://github.com/fleetdm/fleet/blob/71a237042a9c39a45bc8f9c76465e5ff6039eba9/server/datastore/mysql/hosts.go#L444).
### Insert on duplicate update
@ -173,7 +131,55 @@ Another place to cache things would be Redis. The improvement here is that all i
### Redis SCAN
Redis has solved many scaling problems in general, but its not devoid of scaling problems of its own. In particular, we learned that the SCAN command scans the whole key space before it does the filtering. This can be very slow, depending on the state of the system. If Redis is slow, a lot suffers from it.
Redis has solved many scaling problems in general, but its not devoid of scaling problems of its
own. In particular, we learned that the SCAN command scans the whole key space before it does the
filtering. This can be very slow, depending on the state of the system. If Redis is slow, a lot
suffers from it.
### Connecting to Dogfood MySQL & Redis
When investigating performance issues, it can be helpful to connect directly to the MySQL and Redis
instances to run queries and inspect data. Below are instructions for connecting to the Dogfood
MySQL and Redis instances.
#### Prerequisites
1. Setup [VPN](https://github.com/fleetdm/confidential/blob/main/vpn/README.md)
2. Configure [SSO](https://github.com/fleetdm/confidential/tree/main/infrastructure/sso#how-to-use-sso)
#### MySQL
Get the database host:
```shell
DB_HOST=$(aws rds describe-db-clusters --filter Name=db-cluster-id,Values=fleet-dogfood --query "DBClusters[0].Endpoint" --output=text)
```
Get the database user:
```shell
DB_USER=$(aws rds describe-db-clusters --filter Name=db-cluster-id,Values=fleet-dogfood --query "DBClusters[0].MasterUsername" --output=text)
```
Get the database password:
```shell
DB_PASSWORD=$(aws secretsmanager get-secret-value --secret-id fleet-dogfood-database-password --query "SecretString" --output=text)
```
Connect:
```shell
mysql -h"${DB_HOST}" -u"${DB_USER}" -p"${DB_PASSWORD}"
```
#### Redis
Get the Redis Host:
```shell
REDIS_HOST=$(aws elasticache describe-replication-groups --replication-group-id fleetdm-redis --query "ReplicationGroups[0].NodeGroups[0].PrimaryEndpoint.Address" --output=text)
```
Connect:
```shell
redis-cli -h "${REDIS_HOST}"
```
<meta name="maintainedBy" value="lukeheath">
<meta name="title" value="Scaling Fleet">