21 KiB
Terraform for Loadtesting Environment
The interface into this code is designed to be minimal. If you require changes beyond whats described here, contact #g-infra.
Deployment sizing
When loadtesting, it is important to size your load test for the number of hosts you plan to use. Please see https://fleetdm.com/docs/deploy/reference-architectures for some examples.
These are set via variables and should be applied to every terraform operation. Below is an example for a modest (~5k) number of hosts:
# When first applying. Assuming tag exists
terraform apply -var tag=hosts-5k-test -var fleet_containers=5 -var db_instance_type=db.t4g.medium -var redis_instance_type=cache.t4g.small
# When adding loadtest containers.
terraform apply -var tag=hosts-5k-test -var fleet_containers=5 -var db_instance_type=db.t4g.medium -var redis_instance_type=cache.t4g.small -var -var loadtest_containers=10
Deploying your code to the loadtesting environment
IMPORTANT:
- We advise to use a separate clone of the https://github.com/fleetdm/fleet repository because
terraformoperations are lengthy. Terraform uses the local files as the configuration files.- When performing a load test you target a specific branch and not
main(referenced below as$BRANCH_NAME). Themainbranch changes often and it might trigger rebuilts of the images. The cloned repository that you will use to run the terraform operations doesn't need to be in$BRANCH_NAME, such$BRANCH_NAMEis the Fleet version that will be deployed to the load test environment.- These scripts were tested with terraform 1.10.4.
- Push your
$BRANCH_NAMEbranch to https://github.com/fleetdm/fleet and trigger a manual run of the Docker publish workflow (make sure to select the branch). - arm64 (M1/M2/etc) Mac Only: run
helpers/setup-darwin_arm64.shto build terraform plugins that lack arm64 builds in the registry. Alternatively, you can use the amd64 terraform binary, which works with Rosetta 2. - Log into AWS SSO on
loadtestingviaaws sso login. (If you have multiple profiles, export theAWS_PROFILEvariable.) For configuration, seeinfrastructure/ssofolder's readme in theconfidentialprivate repo. - Initialize your terraform environment with
terraform init. - Select a workspace for your test:
terraform workspace new WORKSPACE-NAME; terraform workspace select WORKSPACE-NAME. Ensure yourWORKSPACE-NAMEis less than or equal to 17 characters and contains only lowercase alphanumeric characters and hyphens, as it is used to generate names for AWS resources. - Apply terraform with your branch name with
terraform apply -var tag=BRANCH_NAMEand typeyesto approve execution of the plan. This takes a while to complete (many minutes, > ~30m). Note that for a few minutes afterterraform apply, the Fleet instances may be failing to start with a permission issue (to read a database secret), but this should resolve automatically after a bit and ECS will begin to start the Fleet instances, but they may still fail due to missing database migrations (this will show up in the instances' logs). At this point you can move on to the next step. - Run database migrations (see Running migrations). You will get 500 errors and your containers will not run if you do not do this. After running this step, you might need to wait a few minutes until the environment is up and running.
- Perform your tests (see Running a loadtest). Your deployment will be available at
https://WORKSPACE-NAME.loadtest.fleetdm.com. Reach out to the infrastructure team to get the credentials to log in. - For instructions on how to deploy new code changes to Fleet to the environment, see Deploying code changes to Fleet. This is useful to test performance improvements without having to set up a new loadtest environment.
- When you're done, clean up the environment with
terraform destroy(it will prompt for the branch name). If A destroy fails, see ECR Cleanup Troubleshooting for the most common reason.
Running migrations
After applying terraform with the commands above and before performing your tests, run the following command:
aws ecs run-task --region us-east-2 --cluster fleet-"$(terraform workspace show)"-backend --task-definition fleet-"$(terraform workspace show)"-migrate:"$(terraform output -raw fleet_migration_revision)" --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets="$(terraform output -raw fleet_migration_subnets)",securityGroups="$(terraform output -raw fleet_migration_security_groups)"}"
MDM
If you need to run a load test with MDM enabled and configured you will need to set MDM certificates, keys and tokens to the Fleet configuration.
- Place the files in a known location:
/Users/foobar/mdm/fleet-mdm-apple-scep.crt
/Users/foobar/mdm/fleet-mdm-apple-scep.key
/Users/foobar/mdm/mdmcert.download.push.pem
/Users/foobar/mdm/mdmcert.download.push.key
/Users/foobar/mdm/downloadtoken.p7m
/Users/foobar/mdm/fleet-apple-mdm-bm-public-key.crt
/Users/foobar/mdm/fleet-apple-mdm-bm-private.key
- Then set the
fleet_configterraform var the following way (make sure to add any extra configuration you need to this JSON):
export TF_VAR_fleet_config='{"FLEET_DEV_MDM_APPLE_DISABLE_PUSH":"1","FLEET_DEV_MDM_APPLE_DISABLE_DEVICE_INFO_CERT_VERIFY":"1","FLEET_MDM_APPLE_SCEP_CHALLENGE":"foobar","FLEET_MDM_APPLE_SCEP_CERT_BYTES":"'$(cat /Users/foobar/mdm/fleet-mdm-apple-scep.crt | gsed -z 's/\n/\\n/g')'","FLEET_MDM_APPLE_SCEP_KEY_BYTES":"'$(cat /Users/foobar/mdm/fleet-mdm-apple-scep.key | gsed -z 's/\n/\\n/g')'","FLEET_MDM_APPLE_APNS_CERT_BYTES":"'$(cat /Users/foobar/mdm/mdmcert.download.push.pem | gsed -z 's/\n/\\n/g')'","FLEET_MDM_APPLE_APNS_KEY_BYTES":"'$(cat /Users/foobar/mdm/mdmcert.download.push.key | gsed -z 's/\n/\\n/g')'","FLEET_MDM_APPLE_BM_SERVER_TOKEN_BYTES":"'$(cat /Users/foobar/mdm/downloadtoken.p7m | gsed -z 's/\n/\\n/g' | gsed 's/"smime\.p7m"/\\"smime.p7m\\"/g' | tr -d '\r\n')'","FLEET_MDM_APPLE_BM_CERT_BYTES":"'$(cat /Users/foobar/mdm/fleet-apple-mdm-bm-public-key.crt | gsed -z 's/\n/\\n/g')'","FLEET_MDM_APPLE_BM_KEY_BYTES":"'$(cat /Users/foobar/mdm/fleet-apple-mdm-bm-private.key | gsed -z 's/\n/\\n/g')'"}'
- The above is needed because the newline characters in the certificate/key/token files.
- The value set in
FLEET_MDM_APPLE_SCEP_CHALLENGEmust match whatever you set inosquery-perf'smdm_scep_challengeargument. - The above
export TF_VAR_fleet_config=...command was tested onbash. It did not work inzsh. - Note that we are also setting
FLEET_DEV_MDM_APPLE_DISABLE_PUSH=1. We don't want to generate push notifications against fake UUIDs (otherwise it may cause Apple to rate limit due to invalid requests). - Note that we are also setting
FLEET_DEV_MDM_APPLE_DISABLE_DEVICE_INFO_CERT_VERIFY=1to skip verification of Apple certificates for OTA enrollments. This has an impact on real devices because they will not be notified of any command to execute (it may take a reboot for them to reach out to Fleet for more commands).
- Add the following
osquery-perfarguments to loadtesting.tf
-mdm_prob 1.0-mdm_scep_challengeset to the same value asFLEET_MDM_APPLE_SCEP_CHALLENGEabove.
Enabling Cloudfront
Do not commit your
BRANCH_NAMEif any files exist inresources/TERRAFORM_WORKSPACE/without a .encrypted extension. This step assumes that you've already successfully executed terraform apply and have akms_key_idfrom the output of the terraform apply command.
-
Under the terraform directory, create directory
resources/TERRAFORM_WORKSPACE/. YourTERRAFORM_WORKSPACEvalue can be retrieved withterraform workspace show. -
Change directory to
resources/TERRAFORM_WORKSPACE/ -
Create your keys
openssl genrsa -out cloudfront.key 2048
openssl rsa -pubout -in cloudfront.key -out cloudfront.pem
- Create
encrypt.sh(store the script in in the terraform directory, two directories up../..)
#!/bin/bash
set -e
function usage() {
cat <<-EOUSAGE
Usage: $(basename ${0}) <KMS_KEY_ID> <SOURCE> <DESTINATION> [AWS_PROFILE]
This script encrypts an plaintext file from SOURCE into an
AWS KMS encrypted DESTINATION file. Optionally you
may provide the AWS_PROFILE you wish to use to run the aws kms
commands.
EOUSAGE
exit 1
}
[ $# -lt 3 ] && usage
if [ -n "${4}" ]; then
export AWS_PROFILE=${4}
fi
aws kms encrypt --key-id "${1:?}" --plaintext fileb://<(cat "${2:?}") --output text --query CiphertextBlob > "${3:?}"
- Make the script executable by running
chmod +x ../../encrypt.sh - Encrypt the objects using
encrypt.sh
for i in *; do ../../encrypt.sh <KMS_KEY_ID> $i $i.encrypted; done
for i in *.encrypted; do rm ${i/.encrypted/}; done
-
Change back to the terraform directory (two directories up ../..)
-
If the name of your public/private cloudfront key is not
cloudfront.pem|.key, updatelocals.tf
# Set the following variable value to the base name of your cloudfront key, no extension.
cloudfront_key_basename = cloudfront
- Copy and make
cloudfront.tf
cp template/cloudfront.tf.disabled cloudfront.tf
- In
locals.tfuncomment the following line, underextra_secrets
module.cloudfront-software-installers.extra_secrets,
Example: You should end up with something that looks like the following block
extra_secrets = merge(
module.cloudfront-software-installers.extra_secrets
)
- Initialize terraform and upgrade any necessary dependencies
terraform init -upgrade
- Apply the terraform
terraform apply -var tag=BRANCH_NAME
- In
locals.tfuncomment the following line, underextra_execution_iam_policies.
module.cloudfront-software-installers.extra_execution_iam_policies,
You should end up with something that looks like this.
extra_execution_iam_policies = concat(
module.cloudfront-software-installers.extra_execution_iam_policies,
[]
)
- Apply the terraform
terraform apply -var tag=BRANCH_NAME
-
Cloudfront should now be enabled.
-
If you had previously uploaded any software installers, they need to be re-encrypted by finding and targeting your bucket with the following commands.
# List buckets matching your BRANCH_NAME
aws s3 ls | grep BRANCH_NAME
# Replace <bucket-name> with the software_instalelrs bucket name.
aws s3 cp s3://<bucket-name>/ s3://<bucket-name>/ --recursive
Running a loadtest
We run simulated hosts in containers of 500 at a time. Once the infrastructure is running, you can run the following command:
terraform apply -var tag=BRANCH_NAME -var loadtest_containers=8
With the variable loadtest_containers you can specify how many containers of 500 hosts you want to start. In the example above, it will run 4000. If the fleet instances need special configuration, you can pass them as environment variables to the fleet_config terraform variable, which is a map, using the following syntax (note the use of single quotes around the whole fleet_config variable assignment, and the use of double quotes inside its map value):
terraform apply -var tag=BRANCH_NAME -var loadtest_containers=8 -var='fleet_config={"FLEET_OSQUERY_ENABLE_ASYNC_HOST_PROCESSING":"host_last_seen=true","FLEET_OSQUERY_ASYNC_HOST_COLLECT_INTERVAL":"host_last_seen=10s"}'
Monitoring the infrastructure
This document covers the load test key metrics to capture or keep an eye on. Results are collected in this spreadsheet for release-specific load tests.
There are a few main places of interest to monitor the load and resource usage:
- The Application Performance Monitoring (APM) dashboard: access it on your Fleet load-testing URL on port
:5601and path/app/apm, e.g.https://loadtest.fleetdm.com:5601/app/apm. Note to do this without the VPN you will need to add your public IP Address to the load balancer for TCP Port 5601. At the time of this writing, this will take you directly to the security group for the load balancer if logged into the Load Testing account. - The APM dashboard can also be accessed via private IP over the VPN. Use the following one-liner to get the URL:
aws ec2 describe-instances --region=us-east-2 | jq -r '.Reservations[].Instances[] | select(.State.Name == "running") | select(.Tags[] | select(.Key == "ansible_playbook_file") | .Value == "elasticsearch.yml") | "http://" + .PrivateIpAddress + ":5601/app/apm"'. This connects directly to the EC2 instance and doesn't use the load balancer. - To monitor mysql database load, go to AWS RDS, select "Performance Insights" and the database instance to monitor (you may want to turn off auto-refresh).
- To monitor Redis load, go to Amazon ElastiCache, select the redis cluster to monitor, and go to "Metrics".
Deploying code changes to Fleet
You can deploy new code changes to an environment the following way:
- Push the code changes to the
BRANCH_NAME, trigger a manual run of the Docker publish workflow (make sure to select the branch) and wait for it to complete. - Find the docker image IDs corresponding to your branch:
docker images | grep 'BRANCH_NAME' | awk '{print $3}'
- Remove such image IDs with
docker rmi $IMAGE_ID. - Run the following to trigger a re-deploy of the Fleet instances with the new Fleet docker image:
# - You must set `loadtest_containers` to the current count (otherwise it will bring the currently running simulated hosts down)
# - If we don't specify the `-target`s then it will bring the loadtest containers down and re-deploy them with the new image, we don't want that because
# you will end up with twice the hosts enrolled (half online, half offline).
terraform apply -var tag=BRANCH_NAME -var loadtest_containers=XXX -target=aws_ecs_service.fleet -target=aws_ecs_task_definition.backend -target=aws_ecs_task_definition.migration -target=aws_s3_bucket_acl.osquery-results -target=aws_s3_bucket_acl.osquery-status -target=docker_registry_image.fleet
NOTE: When performing a migration test, set -var fleet_containers=0 and -var loadtest_containers=XXX where XXX is the current number of loadtest containers, when running the above command. This will bring down any running fleet containers during the migration, while leaving the loadtest containers up and running.
Once the re-deploy on the new branch is finished, you will need to run migrations again:
aws ecs run-task --region us-east-2 --cluster fleet-"$(terraform workspace show)"-backend --task-definition fleet-"$(terraform workspace show)"-migrate:"$(terraform output -raw fleet_migration_revision)" --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets="$(terraform output -raw fleet_migration_subnets)",securityGroups="$(terraform output -raw fleet_migration_security_groups)"}"
Once the migrations have completed, run the following command to bring the fleet containers back up (substituing in the correct BRANCH_NAME, loadtest_containers and fleet_containers values):
terraform apply -var tag=BRANCH_NAME -var loadtest_containers=XXX -var fleet_containers=XX -target=aws_ecs_service.fleet -target=aws_ecs_task_definition.backend -target=aws_ecs_task_definition.migration -target=aws_s3_bucket_acl.osquery-results -target=aws_s3_bucket_acl.osquery-status -target=docker_registry_image.fleet -target=aws_appautoscaling_target.ecs_target
Using -target=aws_appautoscaling_target.ecs_target will prevent your instance from shutting down prematurely if there are performance issues, to allow for further investigation.
Deploying code changes to osquery-perf
Following are the steps to deploy new code changes to osquery-perf (known as loadtest image in ECS) on a running loadtest environment.
osquery-perf simulator in ECS doesn't keep state so you cannot change existing hosts to use new osquery-perf code. The following is to add new hosts with new/modified osquery-perf code. (This happens if during a load test the developer realizes there's bug in osquery-perf or if it's not simulating osquery properly.)
You must push your code changes to the
$BRANCH_NAME.
- Bring all
loadtestcontainers to0by running terraform apply withloadtest_containers=0. - Delete all existing hosts (by selecting all on the UI).
- Delete all your local
loadtestimages, the image tags are of the form:loadtest-$BRANCH_NAME-$TAG(these are theloadtestimages pushed to ECR). (Usedocker image listto get theirIMAGE IDand then rundocker rmi -f $ID.) - Delete local images of the form
REPOSITORY=<none>andTAG=<none>that were built recently (these are the builder images). (Usedocker image listto get theirIMAGE IDand then rundocker rmi -f $ID.) - Log in to Amazon ECR (Elastic Container Registry) and delete the corresponding
loadtestimage. - By executing the
terraform applywith-loadtest_containers=Nit will trigger a rebuild of theloadtestimage.
Troubleshooting
Using a release tag instead of a branch
Since the tag name on Dockerhub doesn't match the tag name on GitHub, this presents a special use case when wanting to deploy a release tag. In this case, you can use the optional -var git_branch in order to specify the separate tag. For example, you would use the following to deploy a loadtest of version 4.28.0:
terraform apply -var tag=v4.28.0 -var git_branch=fleet-v4.28.0 -var loadtest_containers=8
General Troubleshooting
If terraform fails for some reason, you can make it output extra information to stderr by setting the TF_LOG environment variable to "DEBUG" or "TRACE", e.g.:
TF_LOG=DEBUG terraform apply ...
See https://www.terraform.io/internals/debugging for more details.
ECR Cleanup Troubleshooting
In a few instances, it is possible for an ECR repository to still have images left, preventing a full terraform destroy of a Loadtesting instance. Use the following one-liner to clean these up before re-running terraform destroy:
REPOSITORY_NAME=fleet-$(terraform workspace show); aws ecr list-images --repository-name ${REPOSITORY_NAME} --query 'imageIds[*]' --output text | while read digest tag; do aws ecr batch-delete-image --repository-name ${REPOSITORY_NAME} --image-ids imageDigest=${digest}; done
Errors with macOS Docker Desktop
If you are getting the following error when running terraform apply:
│ Error: Error pinging Docker server: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
│
│ with provider["registry.terraform.io/kreuzwerker/docker"],
│ on init.tf line 45, in provider "docker":
│ 45: provider "docker" {
Run:
$ docker context ls
NAME DESCRIPTION DOCKER ENDPOINT ERROR
default Current DOCKER_HOST based configuration unix:///var/run/docker.sock
desktop-linux * Docker Desktop unix:///Users/foobar/.docker/run/docker.sock
Then add the entry with *, in this case host = unix:///Users/foobar/.docker/run/docker.sock to infrastructure/loadtesting/terraform/init.tf:
[...]
provider "docker" {
# Configuration options
registry_auth {
address = "${data.aws_caller_identity.current.account_id}.dkr.ecr.us-east-2.amazonaws.com"
username = data.aws_ecr_authorization_token.token.user_name
password = data.aws_ecr_authorization_token.token.password
}
host = "unix:///Users/foobar/.docker/run/docker.sock"
}
[...]
If you are getting the following error when running terraform apply:
│ Error: Error building docker image: 1: The command '/bin/sh -c git clone -b $TAG --depth=1 --no-tags --progress --no-recurse-submodules https://github.com/fleetdm/fleet.git && cd /go/fleet/cmd/osquery-perf/ && go build .' returned a non-zero code: 1
│
│ with docker_registry_image.loadtest,
│ on ecr.tf line 46, in resource "docker_registry_image" "loadtest":
│ 46: resource "docker_registry_image" "loadtest" {
-
Check your Docker virtual machine settings. Open Docker Desktop, then open the settings (
cmd-,or the gear in the top right of the screen). Scroll down to the "Virtual Machine Options" section. -
If you currently have the "Apple virtualization framework" setting selected, select the "Docker VMM" option instead. Click "Apply & restart" in the bottom right.
Once Docker has restarted, re-run terraform apply and you should be good to go!