fleet/terraform
Scott Gress 6bd9cc8a44
Monitor and alert on errors in cron jobs (#24347)
for #19930 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [X] Added/updated tests
- [X] If database migrations are included, checked table schema to
confirm autoupdate
- [X] Manual QA for all new/changed functionality

# Details

This PR adds a new feature to the existing monitoring add-on. The add-on
will now send an SNS alert whenever a scheduled job like
"vulnerabilities" or "apple_mdm_apns_pusher" exits early due to errors.
The alert contains the job type and the set of errors (there can be
multiple, since jobs can have multiple sub-jobs). By default the SNS
topic for this new alert is the same as the one for the existing cron
system alerts, but it can be configured to use a separate topic (e.g.
dogfood instance will post to a separate slack channel).

The actual changes are:

**On the server side:**

- Add errors field to cron_stats table (json DEFAULT NULL)
- Added errors var to `Schedule` struct to collect errors from jobs
- In `RunAllJobs`, collect err from job into new errors var
- Update `Schedule.updateStats`and `CronStats.UpdateCronStats`to accept
errors argument
- If provided, update errors field of cron_stats table

**On the monitor side:**

- Add new SQL query to look for all completed schedules since last run
with non-null errors
- send SNS with job ID, name, errors

# Testing

New automated testing was added for the functional code that gathers and
stores errors from cron runs in the database. To test the actual Lambda,
I added a row in my `cron_stats` table with errors, then compiled and
ran the Lambda executable locally, pointing it to my local mysql and
localstack instances:

```
2024/12/03 14:43:54 main.go:258: Lambda execution environment not found.  Falling back to local execution.
2024/12/03 14:43:54 main.go:133: Connected to database!
2024/12/03 14:43:54 main.go:161: Row vulnerabilities last updated at 2024-11-27 03:30:03 +0000 UTC
2024/12/03 14:43:54 main.go:163: *** 1h hasn't updated in more than vulnerabilities, alerting! (status completed)
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'vulnerabilities' hasn't updated in more than 1h. Last status was 'completed' at 2024-11-27 03:30:03 +0000 UTC.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "260864ff-4cc9-4951-acea-cef883b2de5f"
}
2024/12/03 14:43:54 main.go:198: *** mdm_apple_profile_manager job had errors, alerting! (errors {"something": "wrong"})
2024/12/03 14:43:54 main.go:70: Sending SNS Message
2024/12/03 14:43:54 main.go:74: Sending 'Environment: dev
Message: Fleet cron 'mdm_apple_profile_manager' (last updated 2024-12-03 20:34:14 +0000 UTC) raised errors during its run:
{"something": "wrong"}.' to 'arn:aws:sns:us-east-1:000000000000:topic1'
2024/12/03 14:43:54 main.go:82: {
  MessageId: "5cd085ef-89f6-42c1-8470-d80a22b295f8"
2024-12-19 15:55:29 -06:00
..
addons Monitor and alert on errors in cron jobs (#24347) 2024-12-19 15:55:29 -06:00
byo-vpc Adding changes for Fleet v4.61.0 (#24407) (#24904) 2024-12-19 10:09:22 -06:00
example Adding changes for Fleet v4.61.0 (#24407) (#24904) 2024-12-19 10:09:22 -06:00
.gitignore Add ability to use sidecars (#10287) 2023-03-03 13:50:48 -05:00
.header.md terraform module: some fixes for byo-vpc and below (#13553) 2023-09-01 16:06:45 -05:00
.terraform-docs.yml Customer terraform (#9136) 2022-12-29 16:28:50 -05:00
main.tf Terraform aws provider v5 fixes for terraform modules (#15159) 2023-11-15 23:50:38 -06:00
outputs.tf create vuln processing addon (#10526) 2023-03-29 08:57:10 -04:00
README.md Increase idle timeout for ALB to 15m across all configs (#23939) 2024-11-20 10:57:37 -06:00
variables.tf Adding changes for Fleet v4.61.0 (#24407) (#24904) 2024-12-19 10:09:22 -06:00

This module provides a basic Fleet setup. This assumes that you bring nothing to the installation. If you want to bring your own VPC/database/cache nodes/ECS cluster, then use one of the submodules provided.

To quickly list all available module versions you can run:

git tag |grep '^tf'

The following is the module layout, so you can navigate to the module that you want:

  • Root module (use this to get a Fleet instance ASAP with minimal setup)
    • BYO-VPC (use this if you want to install Fleet inside an existing VPC)
      • BYO-database (use this if you want to use an existing database and cache node)
        • BYO-ECS (use this if you want to bring your own everything but Fleet ECS services)

Migrating from existing Dogfood code

The below code describes how to migrate from existing Dogfood code

moved {
  from = module.vpc
  to   = module.main.module.vpc
}

moved {
  from = module.aurora_mysql
  to = module.main.module.byo-vpc.module.rds
}

moved {
  from = aws_elasticache_replication_group.default
  to = module.main.module.byo-vpc.module.redis.aws_elasticache_replication_group.default
}

This focuses on the resources that are "heavy" or store data. Note that the ALB cannot be moved like this because Dogfood uses the aws_alb resource and the module uses the aws_lb resource. The resources are aliases of eachother, but Terraform can't recognize that.

How to improve this module

If this module somehow doesn't fit your needs, feel free to contact us by opening a ticket, or contacting your contact at Fleet. Our goal is to make this module fit all needs within AWS, so we will try to find a solution so that this module fits your needs.

If you want to make the changes yourself, simply make a PR into main with your additions. We would ask that you make sure that variables are defined as null if there is no default that makes sense and that variable changes are reflected all the way up the stack.

How to update this readme

Edit .header.md and run terraform-docs markdown . > README.md

Requirements

Name Version
terraform >= 1.3.8

Providers

No providers.

Modules

Name Source Version
byo-vpc ./byo-vpc n/a
vpc terraform-aws-modules/vpc/aws 5.1.2

Resources

No resources.

Inputs

Name Description Type Default Required
alb_config n/a
object({
name = optional(string, "fleet")
security_groups = optional(list(string), [])
access_logs = optional(map(string), {})
allowed_cidrs = optional(list(string), ["0.0.0.0/0"])
allowed_ipv6_cidrs = optional(list(string), ["::/0"])
egress_cidrs = optional(list(string), ["0.0.0.0/0"])
egress_ipv6_cidrs = optional(list(string), ["::/0"])
extra_target_groups = optional(any, [])
https_listener_rules = optional(any, [])
tls_policy = optional(string, "ELBSecurityPolicy-TLS-1-2-2017-01")
idle_timeout = optional(number, 905)
})
{} no
certificate_arn n/a string n/a yes
ecs_cluster The config for the terraform-aws-modules/ecs/aws module
object({
autoscaling_capacity_providers = optional(any, {})
cluster_configuration = optional(any, {
execute_command_configuration = {
logging = "OVERRIDE"
log_configuration = {
cloud_watch_log_group_name = "/aws/ecs/aws-ec2"
}
}
})
cluster_name = optional(string, "fleet")
cluster_settings = optional(map(string), {
"name" : "containerInsights",
"value" : "enabled",
})
create = optional(bool, true)
default_capacity_provider_use_fargate = optional(bool, true)
fargate_capacity_providers = optional(any, {
FARGATE = {
default_capacity_provider_strategy = {
weight = 100
}
}
FARGATE_SPOT = {
default_capacity_provider_strategy = {
weight = 0
}
}
})
tags = optional(map(string))
})
{
"autoscaling_capacity_providers": {},
"cluster_configuration": {
"execute_command_configuration": {
"log_configuration": {
"cloud_watch_log_group_name": "/aws/ecs/aws-ec2"
},
"logging": "OVERRIDE"
}
},
"cluster_name": "fleet",
"cluster_settings": {
"name": "containerInsights",
"value": "enabled"
},
"create": true,
"default_capacity_provider_use_fargate": true,
"fargate_capacity_providers": {
"FARGATE": {
"default_capacity_provider_strategy": {
"weight": 100
}
},
"FARGATE_SPOT": {
"default_capacity_provider_strategy": {
"weight": 0
}
}
},
"tags": {}
}
no
fleet_config The configuration object for Fleet itself. Fields that default to null will have their respective resources created if not specified.
object({
task_mem = optional(number, null)
task_cpu = optional(number, null)
mem = optional(number, 4096)
cpu = optional(number, 512)
pid_mode = optional(string, null)
image = optional(string, "fleetdm/fleet:v4.54.1")
family = optional(string, "fleet")
sidecars = optional(list(any), [])
depends_on = optional(list(any), [])
mount_points = optional(list(any), [])
volumes = optional(list(any), [])
extra_environment_variables = optional(map(string), {})
extra_iam_policies = optional(list(string), [])
extra_execution_iam_policies = optional(list(string), [])
extra_secrets = optional(map(string), {})
security_group_name = optional(string, "fleet")
iam_role_arn = optional(string, null)
repository_credentials = optional(string, "")
private_key_secret_name = optional(string, "fleet-server-private-key")
service = optional(object({
name = optional(string, "fleet")
}), {
name = "fleet"
})
database = optional(object({
password_secret_arn = string
user = string
database = string
address = string
rr_address = optional(string, null)
}), {
password_secret_arn = null
user = null
database = null
address = null
rr_address = null
})
redis = optional(object({
address = string
use_tls = optional(bool, true)
}), {
address = null
use_tls = true
})
awslogs = optional(object({
name = optional(string, null)
region = optional(string, null)
create = optional(bool, true)
prefix = optional(string, "fleet")
retention = optional(number, 5)
}), {
name = null
region = null
prefix = "fleet"
retention = 5
})
loadbalancer = optional(object({
arn = string
}), {
arn = null
})
extra_load_balancers = optional(list(any), [])
networking = optional(object({
subnets = optional(list(string), null)
security_groups = optional(list(string), null)
ingress_sources = optional(object({
cidr_blocks = optional(list(string), [])
ipv6_cidr_blocks = optional(list(string), [])
security_groups = optional(list(string), [])
prefix_list_ids = optional(list(string), [])
}), {
cidr_blocks = []
ipv6_cidr_blocks = []
security_groups = []
prefix_list_ids = []
})
}), {
subnets = null
security_groups = null
ingress_sources = {
cidr_blocks = []
ipv6_cidr_blocks = []
security_groups = []
prefix_list_ids = []
}
})
autoscaling = optional(object({
max_capacity = optional(number, 5)
min_capacity = optional(number, 1)
memory_tracking_target_value = optional(number, 80)
cpu_tracking_target_value = optional(number, 80)
}), {
max_capacity = 5
min_capacity = 1
memory_tracking_target_value = 80
cpu_tracking_target_value = 80
})
iam = optional(object({
role = optional(object({
name = optional(string, "fleet-role")
policy_name = optional(string, "fleet-iam-policy")
}), {
name = "fleet-role"
policy_name = "fleet-iam-policy"
})
execution = optional(object({
name = optional(string, "fleet-execution-role")
policy_name = optional(string, "fleet-execution-role")
}), {
name = "fleet-execution-role"
policy_name = "fleet-iam-policy-execution"
})
}), {
name = "fleetdm-execution-role"
})
software_installers = optional(object({
create_bucket = optional(bool, true)
bucket_name = optional(string, null)
bucket_prefix = optional(string, "fleet-software-installers-")
s3_object_prefix = optional(string, "")
}), {
create_bucket = true
bucket_name = null
bucket_prefix = "fleet-software-installers-"
s3_object_prefix = ""
})
})
{
"autoscaling": {
"cpu_tracking_target_value": 80,
"max_capacity": 5,
"memory_tracking_target_value": 80,
"min_capacity": 1
},
"awslogs": {
"create": true,
"name": null,
"prefix": "fleet",
"region": null,
"retention": 5
},
"cpu": 256,
"database": {
"address": null,
"database": null,
"password_secret_arn": null,
"rr_address": null,
"user": null
},
"depends_on": [],
"extra_environment_variables": {},
"extra_execution_iam_policies": [],
"extra_iam_policies": [],
"extra_load_balancers": [],
"extra_secrets": {},
"family": "fleet",
"iam": {
"execution": {
"name": "fleet-execution-role",
"policy_name": "fleet-iam-policy-execution"
},
"role": {
"name": "fleet-role",
"policy_name": "fleet-iam-policy"
}
},
"iam_role_arn": null,
"image": "fleetdm/fleet:v4.54.1",
"loadbalancer": {
"arn": null
},
"mem": 512,
"mount_points": [],
"networking": {
"ingress_sources": {
"cidr_blocks": [],
"ipv6_cidr_blocks": [],
"prefix_list_ids": [],
"security_groups": []
},
"security_groups": null,
"subnets": null
},
"pid_mode": null,
"private_key_secret_name": "fleet-server-private-key",
"redis": {
"address": null,
"use_tls": true
},
"repository_credentials": "",
"security_group_name": "fleet",
"security_groups": null,
"service": {
"name": "fleet"
},
"sidecars": [],
"software_installers": {
"bucket_name": null,
"bucket_prefix": "fleet-software-installers-",
"create_bucket": true,
"s3_object_prefix": ""
},
"task_cpu": null,
"task_mem": null,
"volumes": []
}
no
migration_config The configuration object for Fleet's migration task.
object({
mem = number
cpu = number
})
{
"cpu": 1024,
"mem": 2048
}
no
rds_config The config for the terraform-aws-modules/rds-aurora/aws module
object({
name = optional(string, "fleet")
engine_version = optional(string, "8.0.mysql_aurora.3.07.1")
instance_class = optional(string, "db.t4g.large")
subnets = optional(list(string), [])
allowed_security_groups = optional(list(string), [])
allowed_cidr_blocks = optional(list(string), [])
apply_immediately = optional(bool, true)
monitoring_interval = optional(number, 10)
db_parameter_group_name = optional(string)
db_parameters = optional(map(string), {})
db_cluster_parameter_group_name = optional(string)
db_cluster_parameters = optional(map(string), {})
enabled_cloudwatch_logs_exports = optional(list(string), [])
master_username = optional(string, "fleet")
snapshot_identifier = optional(string)
cluster_tags = optional(map(string), {})
})
{
"allowed_cidr_blocks": [],
"allowed_security_groups": [],
"apply_immediately": true,
"cluster_tags": {},
"db_cluster_parameter_group_name": null,
"db_cluster_parameters": {},
"db_parameter_group_name": null,
"db_parameters": {},
"enabled_cloudwatch_logs_exports": [],
"engine_version": "8.0.mysql_aurora.3.07.1",
"instance_class": "db.t4g.large",
"master_username": "fleet",
"monitoring_interval": 10,
"name": "fleet",
"snapshot_identifier": null,
"subnets": []
}
no
redis_config n/a
object({
name = optional(string, "fleet")
replication_group_id = optional(string)
elasticache_subnet_group_name = optional(string)
allowed_security_group_ids = optional(list(string), [])
subnets = optional(list(string))
availability_zones = optional(list(string))
cluster_size = optional(number, 3)
instance_type = optional(string, "cache.m5.large")
apply_immediately = optional(bool, true)
automatic_failover_enabled = optional(bool, false)
engine_version = optional(string, "6.x")
family = optional(string, "redis6.x")
at_rest_encryption_enabled = optional(bool, true)
transit_encryption_enabled = optional(bool, true)
parameter = optional(list(object({
name = string
value = string
})), [])
log_delivery_configuration = optional(list(map(any)), [])
tags = optional(map(string), {})
})
{
"allowed_security_group_ids": [],
"apply_immediately": true,
"at_rest_encryption_enabled": true,
"automatic_failover_enabled": false,
"availability_zones": null,
"cluster_size": 3,
"elasticache_subnet_group_name": null,
"engine_version": "6.x",
"family": "redis6.x",
"instance_type": "cache.m5.large",
"log_delivery_configuration": [],
"name": "fleet",
"parameter": [],
"replication_group_id": null,
"subnets": null,
"tags": {},
"transit_encryption_enabled": true
}
no
vpc n/a
object({
name = optional(string, "fleet")
cidr = optional(string, "10.10.0.0/16")
azs = optional(list(string), ["us-east-2a", "us-east-2b", "us-east-2c"])
private_subnets = optional(list(string), ["10.10.1.0/24", "10.10.2.0/24", "10.10.3.0/24"])
public_subnets = optional(list(string), ["10.10.11.0/24", "10.10.12.0/24", "10.10.13.0/24"])
database_subnets = optional(list(string), ["10.10.21.0/24", "10.10.22.0/24", "10.10.23.0/24"])
elasticache_subnets = optional(list(string), ["10.10.31.0/24", "10.10.32.0/24", "10.10.33.0/24"])

create_database_subnet_group = optional(bool, false)
create_database_subnet_route_table = optional(bool, true)
create_elasticache_subnet_group = optional(bool, true)
create_elasticache_subnet_route_table = optional(bool, true)
enable_vpn_gateway = optional(bool, false)
one_nat_gateway_per_az = optional(bool, false)
single_nat_gateway = optional(bool, true)
enable_nat_gateway = optional(bool, true)
enable_dns_hostnames = optional(bool, false)
enable_dns_support = optional(bool, true)
enable_flow_log = optional(bool, false)
create_flow_log_cloudwatch_log_group = optional(bool, false)
create_flow_log_cloudwatch_iam_role = optional(bool, false)
flow_log_max_aggregation_interval = optional(number, 600)
flow_log_cloudwatch_log_group_name_prefix = optional(string, "/aws/vpc-flow-log/")
flow_log_cloudwatch_log_group_name_suffix = optional(string, "")
vpc_flow_log_tags = optional(map(string), {})
})
{
"azs": [
"us-east-2a",
"us-east-2b",
"us-east-2c"
],
"cidr": "10.10.0.0/16",
"create_database_subnet_group": false,
"create_database_subnet_route_table": true,
"create_elasticache_subnet_group": true,
"create_elasticache_subnet_route_table": true,
"create_flow_log_cloudwatch_iam_role": false,
"create_flow_log_cloudwatch_log_group": false,
"database_subnets": [
"10.10.21.0/24",
"10.10.22.0/24",
"10.10.23.0/24"
],
"elasticache_subnets": [
"10.10.31.0/24",
"10.10.32.0/24",
"10.10.33.0/24"
],
"enable_dns_hostnames": false,
"enable_dns_support": true,
"enable_flow_log": false,
"enable_nat_gateway": true,
"enable_vpn_gateway": false,
"flow_log_cloudwatch_log_group_name_prefix": "/aws/vpc-flow-log/",
"flow_log_cloudwatch_log_group_name_suffix": "",
"flow_log_max_aggregation_interval": 600,
"name": "fleet",
"one_nat_gateway_per_az": false,
"private_subnets": [
"10.10.1.0/24",
"10.10.2.0/24",
"10.10.3.0/24"
],
"public_subnets": [
"10.10.11.0/24",
"10.10.12.0/24",
"10.10.13.0/24"
],
"single_nat_gateway": true,
"vpc_flow_log_tags": {}
}
no

Outputs

Name Description
byo-vpc n/a
vpc n/a