diff --git a/articles/deploying-fleet-on-aws-with-terraform.md b/articles/deploying-fleet-on-aws-with-terraform.md index e5494bc79c..1b753ada5e 100644 --- a/articles/deploying-fleet-on-aws-with-terraform.md +++ b/articles/deploying-fleet-on-aws-with-terraform.md @@ -47,6 +47,38 @@ Now that the remote state is configured, we can move on to setting up the infras ## Infastructure https://github.com/fleetdm/fleet/tree/main/infrastructure/dogfood/terraform/aws +![Architecture Diagram](../website/assets/images/articles/fleet-aws-reference-arch-diagram.png) + +The infrastructure used in this deployment is available in all regions. The following resources will be created: + +- VPC + - Subnets + - Public + - Private + - ACLs + - Security Groups + - Application Load Balancer +- ECS as the container orchestrator + - Fargate for underlying compute + - Task roles via IAM +- RDS Aurora (MySQL 8.X) +- Elasticache (Redis 6.X) +- Firehose Delivery Stream (osquery log destination) +- S3 bucket (the following S3 buckets can house sensitive data, thus are created with zero public access) + - firehose destination for osquery logs (https://github.com/fleetdm/fleet/blob/main/infrastructure/dogfood/terraform/aws/firehose.tf#L27) + - osquery file carving destination (https://github.com/fleetdm/fleet/blob/main/infrastructure/dogfood/terraform/aws/s3.tf#L29) + +### Encryption +By default, both RDS & Elasticache are encrypted at rest and encrypted in transit. The S3 buckets are also server-side encrypted using AWS managed KMS keys. + +### Networking +For more details on the networking configuration take a look at https://github.com/terraform-aws-modules/terraform-aws-vpc. In the configuration Fleet provides +we are creating public and private subnets in addition to separate data layer for RDS and Elasticache. The configuration also defaults +to using a single NAT Gateway. + +### Backups +RDS daily snapshots are enabled by default and retention is set to 30 days. If there is ever a need a snapshot identifier can be supplied via terraform variable (`rds_initial_snapshot`) +in order to create the database from a previous snapshot. Next, we’ll update the terraform setup in the `/aws` directory's [main.tf](https://github.com/fleetdm/fleet/tree/main/infrastructure/dogfood/terraform/aws/main.tf) to use the S3 Bucket and DynamoDB created above: @@ -84,7 +116,7 @@ osquery_results_s3_bucket = "-fleet-prod-osquery-results-archive" osquery_status_s3_bucket = "-fleet-prod-osquery-status-archive" ``` -Feel free to use whatever values you would like for the `osquery_results_s3_bucket` and `osquery_status_s3_bucket`. Just keep in mind that they need to be unique across AWS. We're setting the initial capacity for `fleet` to `0` to prevent the fleet service from attempting to start until setup is complete. +Feel free to use whatever values you would like for the `osquery_results_s3_bucket` and `osquery_status_s3_bucket`. Just keep in mind that they need to be unique across AWS. We're setting the initial capacity for `fleet` to `0` to prevent the fleet service from attempting to start until setup is complete. Note that your AWS CLI region should be set to the same region you intend to provision the resources. All regions are compatible. Now we’re ready to apply the terraform. From the `/aws` directory, Run: @@ -92,7 +124,7 @@ Now we’re ready to apply the terraform. From the `/aws` directory, Run: 2. `terraform workspace new -fleet-prod` 3. `terraform apply --var-file=prod.tfvars` -You should see the planned output, and you will need to confirm the creation. Review this output, and type `yes` when you are ready. +You should see the planned output, and you will need to confirm the creation. Review this output, and type `yes` when you are ready. Note this will take up to 30 minutes to apply. During this process, terraform will create a `hosted zone` with an `NS` record for your domain and request a certificate from [AWS Certificate Manager (ACM)](https://aws.amazon.com/certificate-manager/). While the process is running, you'll need to add the `NS` records to your domain as well. @@ -190,6 +222,39 @@ Once the process completes, your Fleet instance is ready to use! Check out the d Setting up all the required infrastructure to run a dedicated web service in AWS can be a daunting task. The Fleet team’s goal is to provide a solid base to build from. As most AWS environments have their own specific needs and requirements, this base is intended to be modified and tailored to your specific needs. +## Troubleshooting + +1. AWS CLI gives the error "cannot find ECS cluster" when trying to run the migration task + 1. double-check your AWS CLI default region and make sure it is the same region you deployed the ECS cluster in + 2. the `--cluster ` might be incorrect, verify the name of your ECS cluster that was created +2. AWS ACM fails to validate and issue certificates + 1. verify that the NS records created in the new hosted zone are propagated to your nameserver authority + 2. this might require multiple terraform apply runs +3. ECS fails to deploy Fleet container image (docker pull request limit exceeded/429 errors) + 1. if the migration task has not run successfully before the Fleet backend attempts to start it will cause the container to repeatedly fail and this can exceed docker pull request rate limits + 2. scale down the fleet backend to zero tasks and let the pull request limit reset, this can take from 15 minutes to an hour + 3. attempt to run migrations and then scale the Fleet backend back up +4. If Fleet is running, but you are getting a poor experience or feel like something is wrong + 1. check application logs emitted to AWS Cloudwatch + 2. check performance metrics (CPU & Memory utilization) in AWS Cloudwatch + 1. RDS + 2. Elasticache + 3. ECS +### Scaling Limitations +It is possible to run into multiple AWS scaling limitations depending on the size of the Fleet deployment, frequency of queries, and amount of data returned. +The Fleet backend is designed to scale horizontally (this is also enabled by default using target-tracking autoscaling policies out-of-the-box). + +However, it is still possible to run into AWS scaling limitations such as: +#### Firehose write throughput provision exceeded errors +This particular issue would only be encountered for the largest of Fleet deployments and can occur because of high volume of data and/or number of hosts, if you notice these errors in the application logs or from the AWS Firehose console try the following: +1. Check the service limits https://docs.aws.amazon.com/firehose/latest/dev/limits.html +2. evaluate the amount of data returned using Fleet's live query feature +3. reduce the frequency of scheduled queries +4. reduce the amount of data returned for scheduled queries (Snapshot vs Differential queries https://osquery.readthedocs.io/en/stable/deployment/logging/) + +#### + +More troubleshooting tips can be found here https://fleetdm.com/docs/deploying/faq diff --git a/infrastructure/dogfood/terraform/aws/ecs-iam.tf b/infrastructure/dogfood/terraform/aws/ecs-iam.tf index 60af0660b4..e23ba7bc88 100644 --- a/infrastructure/dogfood/terraform/aws/ecs-iam.tf +++ b/infrastructure/dogfood/terraform/aws/ecs-iam.tf @@ -5,6 +5,7 @@ data "aws_iam_policy_document" "fleet" { resources = ["*"] } + // allow fleet application to obtain the database password from secrets manager statement { effect = "Allow" actions = ["secretsmanager:GetSecretValue"] @@ -29,6 +30,7 @@ data "aws_iam_policy_document" "fleet" { resources = ["arn:aws:rds-db:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:dbuser:*/${var.database_user}"] } + // allow fleet application to write to kinesis firehose for osquery log destination configuration statement { effect = "Allow" actions = [ @@ -39,6 +41,7 @@ data "aws_iam_policy_document" "fleet" { resources = [aws_kinesis_firehose_delivery_stream.osquery_results.arn, aws_kinesis_firehose_delivery_stream.osquery_status.arn] } + // These actions are required for osquery file carving APIs // We use wildcards on these actions for buckets that are single-use. statement { #tfsec:ignore:aws-iam-no-policy-wildcards effect = "Allow" @@ -73,6 +76,7 @@ data "aws_iam_policy_document" "assume_role" { resource "aws_iam_role" "main" { name = "fleetdm-role" + description = "IAM role that Fleet application assumes when running in ECS" assume_role_policy = data.aws_iam_policy_document.assume_role.json } @@ -82,8 +86,9 @@ resource "aws_iam_role_policy_attachment" "role_attachment" { } resource "aws_iam_policy" "main" { - name = "fleet-iam-policy" - policy = data.aws_iam_policy_document.fleet.json + name = "fleet-iam-policy" + description = "IAM policy that Fleet application uses to define access to AWS resources" + policy = data.aws_iam_policy_document.fleet.json } resource "aws_iam_role_policy_attachment" "attachment" { diff --git a/infrastructure/dogfood/terraform/aws/main.tf b/infrastructure/dogfood/terraform/aws/main.tf index d0aa417120..849cecef4c 100644 --- a/infrastructure/dogfood/terraform/aws/main.tf +++ b/infrastructure/dogfood/terraform/aws/main.tf @@ -1,7 +1,3 @@ -variable "region" { - default = "us-east-2" -} - provider "aws" { region = var.region } diff --git a/infrastructure/dogfood/terraform/aws/redis.tf b/infrastructure/dogfood/terraform/aws/redis.tf index d34232d609..fccf314c33 100644 --- a/infrastructure/dogfood/terraform/aws/redis.tf +++ b/infrastructure/dogfood/terraform/aws/redis.tf @@ -11,7 +11,7 @@ variable "redis_instance" { default = "cache.m5.large" } resource "aws_elasticache_replication_group" "default" { - availability_zones = ["us-east-2a", "us-east-2b", "us-east-2c"] + availability_zones = var.redis_azs engine = "redis" parameter_group_name = "default.redis6.x" subnet_group_name = module.vpc.elasticache_subnet_group_name diff --git a/infrastructure/dogfood/terraform/aws/variables.tf b/infrastructure/dogfood/terraform/aws/variables.tf index 0cc088d78e..d3e2521022 100644 --- a/infrastructure/dogfood/terraform/aws/variables.tf +++ b/infrastructure/dogfood/terraform/aws/variables.tf @@ -129,3 +129,18 @@ variable "extra_security_group_cidrs" { variable "rds_initial_snapshot" { default = null } + +variable "redis_azs" { + default = ["us-east-2a", "us-east-2b", "us-east-2c"] + description = "the availability zones to utilize for redis" +} + +variable "vpc_azs" { + default = ["us-east-2a", "us-east-2b", "us-east-2c"] + description = "the availability zones to utilize for vpc creation" +} + +variable "region" { + default = "us-east-2" + description = "the default availability zone to utilize for infrastructure" +} diff --git a/infrastructure/dogfood/terraform/aws/vpc.tf b/infrastructure/dogfood/terraform/aws/vpc.tf index ed9fa8b4a7..5563fad940 100644 --- a/infrastructure/dogfood/terraform/aws/vpc.tf +++ b/infrastructure/dogfood/terraform/aws/vpc.tf @@ -4,7 +4,7 @@ module "vpc" { name = "fleet-vpc" cidr = "10.10.0.0/16" - azs = ["us-east-2a", "us-east-2b", "us-east-2c"] + azs = var.vpc_azs private_subnets = ["10.10.1.0/24", "10.10.2.0/24", "10.10.3.0/24"] public_subnets = ["10.10.11.0/24", "10.10.12.0/24", "10.10.13.0/24"] database_subnets = ["10.10.21.0/24", "10.10.22.0/24", "10.10.23.0/24"] diff --git a/website/assets/images/articles/fleet-aws-reference-arch-diagram.png b/website/assets/images/articles/fleet-aws-reference-arch-diagram.png new file mode 100644 index 0000000000..ad725a52f7 Binary files /dev/null and b/website/assets/images/articles/fleet-aws-reference-arch-diagram.png differ