When loadtesting, it is important to size your load test for the number of hosts you plan to use. Please see https://fleetdm.com/docs/deploy/reference-architectures for some examples.
These are set via [variables](https://github.com/fleetdm/fleet/blob/main/infrastructure/loadtesting/terraform/variables.tf) and should be applied to every terraform operation. Below is an example for a modest (~5k) number of hosts:
> - We advise to use a separate clone of the https://github.com/fleetdm/fleet repository because `terraform` operations are lengthy. Terraform uses the local files as the configuration files.
> - When performing a load test you target a specific branch and not `main` (referenced below as `$BRANCH_NAME`). The `main` branch changes often and it might trigger rebuilts of the images. The cloned repository that you will use to run the terraform operations doesn't need to be in `$BRANCH_NAME`, such `$BRANCH_NAME` is the Fleet version that will be deployed to the load test environment.
1. Push your `$BRANCH_NAME` branch to https://github.com/fleetdm/fleet and trigger a manual run of the [Docker publish](https://github.com/fleetdm/fleet/actions/workflows/goreleaser-snapshot-fleet.yaml) workflow (make sure to select the branch).
1. arm64 (M1/M2/etc) Mac Only: run `helpers/setup-darwin_arm64.sh` to build terraform plugins that lack arm64 builds in the registry. Alternatively, you can use the amd64 terraform binary, which works with Rosetta 2.
1. Log into AWS SSO on `loadtesting` via `aws sso login`. (If you have multiple profiles, export the `AWS_PROFILE` variable.) For configuration, see `infrastructure/sso` folder's readme in the `confidential` private repo.
1. Select a workspace for your test: `terraform workspace new WORKSPACE-NAME; terraform workspace select WORKSPACE-NAME`. Ensure your `WORKSPACE-NAME` is less than or equal to 17 characters and contains only lowercase alphanumeric characters and hyphens, as it is used to generate names for AWS resources.
1. Apply terraform with your branch name with `terraform apply -var tag=BRANCH_NAME` and type `yes` to approve execution of the plan. This takes a while to complete (many minutes, > ~30m). Note that for a few minutes after `terraform apply`, the Fleet instances may be failing to start with a permission issue (to read a database secret), but this should resolve automatically after a bit and ECS will begin to start the Fleet instances, but they may still fail due to missing database migrations (this will show up in the instances' logs). At this point you can move on to the next step.
1. Run database migrations (see [Running migrations](#running-migrations)). You will get 500 errors and your containers will not run if you do not do this. After running this step, you might need to wait a few minutes until the environment is up and running.
1. Perform your tests (see [Running a loadtest](#running-a-loadtest)). Your deployment will be available at `https://WORKSPACE-NAME.loadtest.fleetdm.com`. Reach out to the infrastructure team to get the credentials to log in.
1. For instructions on how to deploy new code changes to Fleet to the environment, see [Deploying code changes to Fleet](#deploying-code-changes-to-fleet). This is useful to test performance improvements without having to set up a new loadtest environment.
1. When you're done, clean up the environment with `terraform destroy` (it will prompt for the branch name). If A destroy fails, see [ECR Cleanup Troubleshooting](#ecr-cleanup-troubleshooting) for the most common reason.
- The above `export TF_VAR_fleet_config=...` command was tested on `bash`. It did not work in `zsh`.
- Note that we are also setting `FLEET_DEV_MDM_APPLE_DISABLE_PUSH=1`. We don't want to generate push notifications against fake UUIDs (otherwise it may cause Apple to rate limit due to invalid requests).
- Note that we are also setting `FLEET_DEV_MDM_APPLE_DISABLE_DEVICE_INFO_CERT_VERIFY=1` to skip verification of Apple certificates for OTA enrollments.
This has an impact on real devices because they will not be notified of any command to execute (it may take a reboot for them to reach out to Fleet for more commands).
3. Add the following `osquery-perf` arguments to [loadtesting.tf](./loadtesting.tf)
-`-mdm_prob 1.0`
-`-mdm_scep_challenge` set to the same value as `FLEET_MDM_APPLE_SCEP_CHALLENGE` above.
1. Under the terraform directory, create directory `resources/TERRAFORM_WORKSPACE/`. Your `TERRAFORM_WORKSPACE` value can be retrieved with `terraform workspace show`.
15. If you had previously uploaded any software installers, they need to be re-encrypted by finding and targeting your bucket with the following commands.
```
# List buckets matching your BRANCH_NAME
aws s3 ls | grep BRANCH_NAME
# Replace <bucket-name> with the software_instalelrs bucket name.
With the variable `loadtest_containers` you can specify how many containers of 500 hosts you want to start. In the example above, it will run 4000. If the `fleet` instances need special configuration, you can pass them as environment variables to the `fleet_config` terraform variable, which is a map, using the following syntax (note the use of single quotes around the whole `fleet_config` variable assignment, and the use of double quotes inside its map value):
This [document](https://docs.google.com/document/d/1V6QtFzcGDsLnn2PIvGin74DAxdAN_3likjxSssOMMQI/edit?tab=t.0) covers the load test key metrics to capture or keep an eye on. Results are collected in [this spreadsheet](https://docs.google.com/spreadsheets/d/1FOF0ykFVoZ7DJSTfrveip0olfyRQsY9oT1uXCCZmuKc/edit?gid=0#gid=0) for release-specific load tests.
* The Application Performance Monitoring (APM) dashboard: access it on your Fleet load-testing URL on port `:5601` and path `/app/apm`, e.g. `https://loadtest.fleetdm.com:5601/app/apm`. Note to do this without the VPN you will need to add your public IP Address to the load balancer for TCP Port 5601. At the time of this writing, [this](https://us-east-2.console.aws.amazon.com/vpc/home?region=us-east-2#SecurityGroup:groupId=sg-0e67d910a662720f8) will take you directly to the security group for the load balancer if logged into the Load Testing account.
* The APM dashboard can also be accessed via private IP over the VPN. Use the following one-liner to get the URL: `aws ec2 describe-instances --region=us-east-2 | jq -r '.Reservations[].Instances[] | select(.State.Name == "running") | select(.Tags[] | select(.Key == "ansible_playbook_file") | .Value == "elasticsearch.yml") | "http://" + .PrivateIpAddress + ":5601/app/apm"'`. This connects directly to the EC2 instance and doesn't use the load balancer.
* To monitor mysql database load, go to AWS RDS, select "Performance Insights" and the database instance to monitor (you may want to turn off auto-refresh).
* To monitor Redis load, go to Amazon ElastiCache, select the redis cluster to monitor, and go to "Metrics".
1. Push the code changes to the `BRANCH_NAME`, trigger a manual run of the [Docker publish](https://github.com/fleetdm/fleet/actions/workflows/goreleaser-snapshot-fleet.yaml) workflow (make sure to select the branch) and wait for it to complete.
2. Find the docker image IDs corresponding to your branch:
3. Remove such image IDs with `docker rmi $IMAGE_ID`.
4. Run the following to trigger a re-deploy of the Fleet instances with the new Fleet docker image:
```sh
# - You must set `loadtest_containers` to the current count (otherwise it will bring the currently running simulated hosts down)
# - If we don't specify the `-target`s then it will bring the loadtest containers down and re-deploy them with the new image, we don't want that because
# you will end up with twice the hosts enrolled (half online, half offline).
NOTE: When performing a migration test, set `-var fleet_containers=0` and `-var loadtest_containers=XXX` where `XXX` is the current number of loadtest containers, when running the above command. This will bring down any running fleet containers during the migration, while leaving the loadtest containers up and running.
Once the re-deploy on the new branch is finished, you will need to run migrations again:
Once the migrations have completed, run the following command to bring the fleet containers back up (substituing in the correct `BRANCH_NAME`, `loadtest_containers` and `fleet_containers` values):
Using `-target=aws_appautoscaling_target.ecs_target` will prevent your instance from shutting down prematurely if there are performance issues, to allow for further investigation.
1. Delete all your local `loadtest` images, the image tags are of the form: `loadtest-$BRANCH_NAME-$TAG` (these are the `loadtest` images pushed to ECR). (Use `docker image list` to get their `IMAGE ID` and then run `docker rmi -f $ID`.)
1. Delete local images of the form `REPOSITORY=<none>` and `TAG=<none>` that were built recently (these are the builder images). (Use `docker image list` to get their `IMAGE ID` and then run `docker rmi -f $ID`.)
Since the tag name on Dockerhub doesn't match the tag name on GitHub, this presents a special use case when wanting to deploy a release tag. In this case, you can use the optional `-var git_branch` in order to specify the separate tag. For example, you would use the following to deploy a loadtest of version 4.28.0:
If terraform fails for some reason, you can make it output extra information to `stderr` by setting the `TF_LOG` environment variable to "DEBUG" or "TRACE", e.g.:
`TF_LOG=DEBUG terraform apply ...`
See https://www.terraform.io/internals/debugging for more details.
In a few instances, it is possible for an ECR repository to still have images left, preventing a full `terraform destroy` of a Loadtesting instance. Use the following one-liner to clean these up before re-running `terraform destroy`:
`REPOSITORY_NAME=fleet-$(terraform workspace show); aws ecr list-images --repository-name ${REPOSITORY_NAME} --query 'imageIds[*]' --output text | while read digest tag; do aws ecr batch-delete-image --repository-name ${REPOSITORY_NAME} --image-ids imageDigest=${digest}; done`