Commit graph

149 commits

Author SHA1 Message Date
Jorge Falcon
66b2908042
Load test - Enable standard performance insights (#44694)
- Enable `standard` RDS database performance insights

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Chores**
* Enhanced database monitoring capabilities by enabling Database
Insights for load testing infrastructure.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-04 16:23:38 -04:00
Jorge Falcon
1c95f5c886
Load test terraform fixes (#44678)
- Disable performance insights
- Allow redis instance count >=1
- Properly set ecs_cluster logging config path
- Targeted apply with auto approve for pre-creating fleet and execution
roles

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Enhanced ECS cluster logging with CloudWatch integration and extended
log retention to 365 days.
* Adjusted RDS monitoring configuration and disabled performance
insights for operational optimization.
* Reduced minimum Redis instance requirement from 3 to 1 for greater
deployment flexibility.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-05-04 13:59:01 -05:00
Jorge Falcon
473fbffff5
Terraform module updates (#43543) 2026-04-29 17:18:40 -05:00
Sharon Katz
6032c137e5
Bump Alpine base image to 3.23.4 to resolve openssl/musl/zlib CVEs (#43671) (#44097)
Resolves #43671.

Bumps the Alpine base image from 3.23.3 to 3.23.4 in the Dockerfiles
that produce published images, picking up patched openssl, musl, and
zlib packages. Follows the same pattern as #38977.

### CVEs resolved
- HIGH: CVE-2026-28388, CVE-2026-28389, CVE-2026-28390, CVE-2026-31790,
CVE-2026-2673, CVE-2026-40200
- MEDIUM: CVE-2026-27171, CVE-2026-6042, CVE-2026-22184

### Test plan
- CI image build passes.
- Trivy/ECR scan on the resulting fleetdm/fleet image confirms the nine
listed CVEs are gone.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated Docker base images to Alpine 3.23.4 across infrastructure and
deployment components for improved stability and security.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-23 23:15:53 -03:00
Lucas Manuel Rodriguez
682202444c
Update go to 1.26.2 and update tooling to update it (#43771)
Golang 1.26.2 has been released. It fixes some CVEs:
https://github.com/golang/go/issues?q=milestone%3AGo1.26.2+label%3ACherryPickApproved

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Updated Go toolchain to 1.26.2 across the repository and build
configs.
  * Updated Docker build images to use Go 1.26.2.
* Expanded the set of tracked modules for the Go version update so
additional module files are included in automated updates.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-20 13:40:57 -03:00
Jorge Falcon
75f79dc866
Loadtest osquery perf workflow wording and enroll.sh remainder updates (#43762)
- Updates wording in `.github/workflows/loadtest-osquery-perf.yml` 
  - `4098` -> `4096`
- Removes: `(should be a multiple of 8, if setting
loadtest_containers_starting_index)`
- Updates `infrastructure/loadtesting/terraform/osquery_perf/enroll.sh`
to handle values that are not multiples of 8. If the value is not a
multiple of 8, logic has been added to apply the remainder.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Documentation**
* Updated load testing workflow configuration input descriptions for
improved clarity of parameters and their usage examples.

* **Bug Fixes**
* Fixed container count allocation logic in the load testing process to
ensure the final target count is always properly applied, even when
using increment values that don't divide evenly into the specified total
range.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2026-04-20 12:01:23 -04:00
Jorge Falcon
34cb7ab6d1
Loadtest internal alb logging and osquery-perf scaling updates (#42581)
- Configures internal alb to log to the same bucket as the public alb
- Adds support for osquery-perf task size (cpu/memory) configuration
- Updates defaults for osquery-perf extra_flags
- Updates default enroll.sh loop sleep_time from 60s -> 300s
2026-03-31 11:15:07 -04:00
Jorge Falcon
2d09916f60
Fix loadtest/infra docker_image resource (#42537)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves # N/A

- Resolves an issue that prevents some locally pulled docker images from
being pushed to ECR.
2026-03-27 01:17:37 -04:00
Jorge Falcon
42b02483d4
Dogfood & Loadtest - Updating mysql engine version to 8.0.mysql_aurora.3.10.3 (#42120)
- Bumps Dogfood and Loadtest environment Aurora MySQL engine verison
from `8.0.mysql_aurora.3.08.2` -> `8.0.mysql_aurora.3.10.3`
2026-03-19 21:05:24 -05:00
Jorge Falcon
115e00decd
Configure software_installers defaults in Loadtest terraform (#41207)
- Adds software_installers {} configuration to loadtest terraform
- Modifies template/cloudfront.tf.disabled to use pkcs#8 format for the
private key
2026-03-19 20:17:54 -04:00
Victor Lyuboslavsky
ecee908157
Bumping signoz resources for 100K hosts loadtest. (#41961) 2026-03-19 12:49:36 -05:00
Victor Lyuboslavsky
fbc5b9d8b6
Updated go to 1.26.1 (#42027)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #41749

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
2026-03-19 07:01:00 -05:00
Jorge Falcon
45c4e47fab
Dogfood and loadtest - mysql require secure transport on (#40211)
- Adds require_secure_transport for mysql connections to the db_cluster
parameter group for dogfood and loadtest environments.

```
    db_cluster_parameters = {
      require_secure_transport = "ON"
    }
```
2026-02-20 15:57:10 -05:00
Robert Fairburn
dac2ef18f0
Ensure terraform docker compatibility with github actions (#39988)
Co-authored-by: Jorge Falcon <22119513+BCTBB@users.noreply.github.com>
2026-02-17 15:09:50 -05:00
Robert Fairburn
9f60dadae0
Allow gzip responses (#39700) 2026-02-12 10:24:49 -06:00
Jorge Falcon
502351dcde
Add FLEET_MYSQL_READ_REPLICA_TLS_CONFIG environment variable to dogfood and loadtesting (#39692)
- Adds `FLEET_MYSQL_READ_REPLICA_TLS_CONFIG = "custom"` to dogfood and
loadtesting environments.
2026-02-11 13:05:11 -05:00
Ian Littman
d4906dd3d6
Update to Go 1.25.7 (#39584)
- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files)
for more information.
2026-02-09 17:47:51 -06:00
Victor Lyuboslavsky
0ae909fedf
Updated loadtest OTEL config to match dogfood (#38991)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #36494

I tried this with loadtest.
2026-01-29 10:18:02 -06:00
Ian Littman
ec06952245
Bump Alpine (to 3.23.3), Go (to 1.25.6) to resolve vulns (#38973) 2026-01-28 18:51:15 -06:00
Jorge Falcon
7ac24d8752
Loadtest (new) - MDM Updates (#37420)
- Adds `FLEET_DEV_MDM_APPLE_DISABLE_PUSH = 1`
- Adds `FLEET_DEV_MDM_APPLE_DISABLE_DEVICE_INFO_CERT_VERIFY = 1`
- Updates osquery_perf/README.md, providing an example fetching and
using mdm scep challenge secret.
2025-12-17 17:55:13 -05:00
Ian Littman
62755cbd82
Bump Go to 1.25.5, Alpine to 3.23.0 where relevant, bump Trivy to current version (#36848)
Fixes vulns reported in
https://github.com/fleetdm/fleet/actions/runs/19999992703. We'll
definitely want to at least cherry-pick this.
2025-12-07 20:04:14 -06:00
Victor Lyuboslavsky
6ab79dd5a7
Add more software to loadtest (#35756)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #34677 and #35932

Adding ~450K software to the loadtest, including scripts to add more
software in the future.
Software is held in a `software.sql` file, which is used to create a
sqlite DB during osquery perf run/deployment.

# Checklist for submitter

## Testing

- [x] QA'd all new/changed functionality manually

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added support for loading software data from an external SQLite
database via a new `--software_db_path` command-line flag for more
realistic simulation scenarios.
* Added import and SQL generation tools to build and manage custom
software libraries.

* **Documentation**
* Added comprehensive README with setup instructions, tool usage, and
end-to-end workflow guidance for the software library.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-11-21 10:42:19 -06:00
Jorge Falcon
e0be06fa76
Loadtesting - osquery deployment session timeout increase (#36097) 2025-11-20 21:08:52 -05:00
Jorge Falcon
fb5c90ad9c
Dogfood and loadtest module updates (#35990)
Updates `module.main` to `1.18.3` (dogfood)
- Adds `memory_tracking_target_value = 70` (dogfood and loadtesting)
- Adds `cpu_tracking_target_value = 70`  (dogfood and loadtesting)

Updates `module.migrations` to `2.2.1` (dogfood)
- Adds `max_capacity`

Updates `module.logging_alb` to `1.6.2` (dogfood)

Updates `module.monitoring` to `1.8.0` (dogfood)
- Adds `log_monitoring` configuration
2025-11-19 22:04:17 -05:00
Victor Lyuboslavsky
f8ce47ec88
Grouping OTEL exceptions by type. (#35794)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #34677 

Changing OTEL set up to group exceptions by type, which is an
OTEL/industry best practice.
2025-11-19 10:24:19 -06:00
George Karr
ca5d02d471
Adding changes for Fleet v4.76.1 (#35760) 2025-11-18 14:35:31 -06:00
Victor Lyuboslavsky
3029a3ac44
Adjusting OTEL resources for high throughput. (#35878) 2025-11-18 06:53:49 -06:00
Jorge Falcon
0b0c67a5d5
Loadtest - osquery_perf scaling fixes (#35798)
- Removes timestamp from osquery_perf image
- Adds `default: 0` to loadtest osquery_perf workflow, `variable:
loadtest_containers_starting_index`
- Adds `variable: sleep_time` to loadtest osquery_perf workflow
- Adds osquery_perf docker repository in ECR
- Adds support for `sleep_time` to `enroll.sh`
- Updates terraform variables to enforce `git_branch` or `git_tag` for
osquery_perf
2025-11-17 10:21:18 -05:00
Robert Fairburn
caf9e83968
Configure osquery-perf memory and cpu (#35786) 2025-11-14 14:35:42 -06:00
Jorge Falcon
776cd67647
Loadtest - Firehose logging removal, adds filesystem logging, and module updates (#35735)
- Removes `firehose` logging from loadtesting environment
- Sets `filesystem` logging in loadtesting environment
- Updates fleet image to 4.76.0 as the default value
- Updates `migrations` and `logging_alb` modules with latest versions
2025-11-13 19:16:00 -05:00
Jorge Falcon
e2085bfd86
Loadtesting documentation - Removes (Coming Soon) from README (#35649)
- Removes `(Coming Soon)` from
`infrastructure/loadtesting/infra/README.md` with regards to deployment
via Github Actions
- Moves Signoz steps to `.header.md` to preserve steps in generated
`README.md`
2025-11-12 16:54:14 -05:00
Jorge Falcon
0471b8ce19
Loadtest - osquery_perf - Removal of fleet_image requirement (#35365)
- Adds support for `enroll.sh`, to deploy osquery_perf in batches
- Merges variables `tag` and `git_branch` into `git_tag_branch`. Only
one tag or git_branch should be specified.
  - Still used for osquery_perf to check out the correct tag/branch.
- Removes fleet_image requirement for cutting osquery_perf images

---------

Co-authored-by: Robert Fairburn <8029478+rfairburn@users.noreply.github.com>
2025-11-10 16:16:20 -05:00
Victor Lyuboslavsky
73501e5755
Infra changes after latest loadtest (#35083)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #34500

Terraform changes after my latest loadtest.

VPC consolidation: updated (and deployed) shared VPC so that Signoz
backend can now use it

  - Removed eks-vpc/ directory
  - Moved VPC management to shared/vpc.tf
  - Updated shared/init.tf to reflect VPC changes

  Infra improvements

- infra/internal_alb.tf - changed suffix from -internal to -int since I
hit max 32 characters issue

OTEL

- OTEL Collector configuration overrides for production stability
2025-11-03 11:02:15 -06:00
Victor Lyuboslavsky
072ee68eda
Updating to Go 1.25.3 (#35082) 2025-11-03 09:47:07 -06:00
Jorge Falcon
6ea9185c1c
Loadtesting - osquery_perf docker image build fixes (#34901)
- Bumps docker provider from 2.16.0 to 3.6.x
- Moves builds from `docker_registry_image` to new `docker_image`
resource
2025-10-29 08:33:46 -04:00
Robert Fairburn
1fedabe7a8
Update alpine base image to latest (#34864)
Resolves openssl:3.3.3/CVE-2025-9230 in base images.
2025-10-28 11:24:05 -05:00
Robert Fairburn
30c4798ec6
Switch git providers for loadtesting tf (#34180)
untested end-to-end but works as a replacement for plans and doesn't
require a local arm64 build to work.

Co-authored-by: Jorge Falcon <22119513+BCTBB@users.noreply.github.com>
2025-10-23 14:53:13 -04:00
Victor Lyuboslavsky
e4e3c3f9ff
Fix issues with OTEL SigNoz deployments for loadtests (#34694)
SigNoz converted from child module to standalone root module with
independent state.

  **Critical Impact**

  Deployment order is now required:
  1. Deploy infrastructure/loadtesting/terraform/signoz/ FIRST
  2. Then deploy infrastructure/loadtesting/terraform/infra/

  Communication between modules via Terraform remote state.

  **Key Configuration Changes**

  - SigNoz creates its own EKS cluster: signoz-${workspace}
- Instance type: t3.xlarge (upgraded from t3.large for resource
headroom)
  - ClickHouse disk: 200Gi (was 20Gi) with 2-day retention
  - Resource limits configured to prevent OOMKills during loadtest
  - wait_for_jobs = false to avoid Helm deployment deadlock


<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #32331
2025-10-23 12:49:36 -05:00
Victor Lyuboslavsky
aef9b8400c
Added terraform files for Signoz OTEL backend. (#34058)
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #32331 

This PR allows us to run loadtest with SigNoz OTEL backend by adding
`-var=enable_otel=true`
SigNoz is deployed via Helm chart.

Enhancements needed (in future PR):
- put SigNoz UI behind VPN
- combine the new eks-vpc with shared fleet-vpc
- make SigNoz shared, so multiple loadtests use the same instance? (But
what about updating to it to latest version?)

Next steps:
- Enable SigNoz in Dogfood environment
- SigNoz by default [keeps 15 days of logs and
traces](https://signoz.io/docs/userguide/retention-period), which is
quite a bit. How much would that cost us and should we reduce it?

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- New Features
- Optional OpenTelemetry tracing with SigNoz via a new enable_otel flag.
- Conditional deployment of a SigNoz stack (managed EKS, storage,
Helm-based apps) with internal OTLP collector endpoint.
- New outputs to retrieve OTLP endpoint, cluster name, and a kubectl
configuration command.

- Documentation
  - Added guidance for deploying and using SigNoz with load testing.
  - Updated examples to include -var=enable_otel=true.

- Chores
- Introduced required providers to support Helm and Kubernetes
resources.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2025-10-10 21:53:04 -05:00
Jorge Falcon
c0f753cb83
Updated permissions for GHA role - load test environment (#34059)
* Fixes missing STS permission on the load test environment GHA role
2025-10-09 15:10:52 -04:00
Jorge Falcon
e952ef06c0
Loadtesting IAC updates (#32629)
# Github Actions (New)
- New workflow to deploy/destroy loadtest infrastructure with one-click
(Needs to be tested)
- Common inputs drive configuration and deployment of loadtest
infrastructure
    - tag
    - fleet_task_count
    - fleet_task_memory
    - fleet_task_cpu
    - fleet_database_instance_size
    - fleet_database_instance_count
    - fleet_redis_instance_size
    - fleet_redis_instance_count
    - terraform_workspace
    - terraform_action
- New workflow to deploy/destroy osquery-perf to loadtest infrastructure
with one-click (Needs to be tested)
- Common inputs drive configuration and deployment of osquery-perf
resources
    - tag
    - git_branch
    - loadtest_containers
    - extra_flags
    - terraform_workspace
    - terraform_action
- New workflow to deploy shared loadtest resources with one-click (Needs
to be tested)

# Loadtest Infrastructure (New)
- New directory (`infrastructure/loadtesting/terraform/infra`) for
one-click deployment
- Loadtest environment updated to use [fleet-terraform
modules](https://github.com/fleetdm/fleet-terraform)
- [Deployment documentation
updated](0c254bca40/infrastructure/loadtesting/terraform/infra/README.md)
to reflect new steps

# Osquery-perf deployment (New)
- New directory (`infrastructure/loadtesting/terraform/osquery-perf`)
for the deployment of osquery-perf
- osquery-perf updated to use [fleet-terraform
modules](https://github.com/fleetdm/fleet-terraform)
- [Deployment documentation
updated](0c254bca40/infrastructure/loadtesting/terraform/osquery_perf)
to reflect new steps
2025-10-08 15:31:37 -04:00
Konstantin Sykulev
9e5c632c4c
Updating osquery perf loadtest infrastructure (#34003)
Bumping memory and cpu on aws load test containers Creating multiple ecs
services with a single task. This allows us to specify different
settings per osquery perf container/task.

**Related issue:** No issue.
2025-10-08 13:28:33 -05:00
Victor Lyuboslavsky
abc912bd03
Updated go to 1.25.1 (#32833) 2025-09-11 18:31:39 -05:00
Lucas Manuel Rodriguez
d849e01add
Update Go to 1.24.6 (#31784)
Ran
```
make update-go version=1.24.6
```
And then updated the `sha256`s manually in the Dockerfiles.

Fixes https://nvd.nist.gov/vuln/detail/CVE-2025-47907
```
Cancelling a query (e.g. by cancelling the context passed to one of the query methods) during a call
to the Scan method of the returned Rows can result in unexpected results if other queries are being
made in parallel. This can result in a race condition that may overwrite the expected results with those
of another query, causing the call to Scan to return either unexpected results from the other
query or an error.
```
2025-08-12 08:10:05 -03:00
Jorge Falcon
9618d72b54
Loadtesting MySQL engine_version update (#31351)
- MySQL engine version bumped from 8.0.mysql_aurora.3.07.1 ->
8.0.mysql_aurora.3.08.2
2025-07-29 12:02:49 -04:00
Janis Watts
7085ad2a74
Update enable cloudfront directions (#31152)
Just a couple small changes to help with the instructions
2025-07-22 16:31:12 -05:00
Jorge Falcon
dcf68ccd09
Loadtesting - Cloudfront iam fix (#31145)
- Added missed IAM permission for tasks to access cloudfront secret
2025-07-22 15:07:26 -04:00
Jorge Falcon
3a112afdb6
Loadtesting - Enable Cloudfront (#31073)
# Added
- Added kms.tf to support encrypting keys, specifically cloudfront keys.
- Added template/cloudfront.tf.disabled for use in enabling cloudfront.-
Modified ecs-iam.tf to support log-alb.tf, cloudfront.tf policies that
are injected into `local.extra_execution_iam_policies` and `local.iam`.
- Added log-alb.tf to enable logging alb, required by cloudfront.tf.

# Changed
- Modified ecs.tf to support adding of additional secrets from
`local.secrets`.
- Modified firehose.tf to support provider required updates for
deprecated resource configurations.
- Modified init.tf to support `> v5.0` of `hashicorp/aws` provider.
- Modified locals.tf to add `extra_execution_iam_policies`, `iam`,
`software_installers_kms_policy`, `extra_secrets`, secrets, and
`cloudfront_key_basename`, to support cloudfront.
- Modified readme.md with instructions on how to enable cloudfront.tf
- Modified redis.tf to support provider required updates for deprecated
resource configurations
- Modified s3.tf to support kms keys and add kms iam.
- Modified terraform version in .github/workflows/tfvalidate.yml - 1.9.0
-> 1.10.4
2025-07-21 16:41:06 -04:00
Jorge Falcon
91cedf039d
Allow Loadtesting environment non-empty s3 bucket cleanup on terraform destroy (#30899)
* Modified resource aws_s3_bucket blocks to include `force_destroy =
true` in firehose.tf and s3.tf.
2025-07-16 12:15:27 -04:00
jacobshandling
555ae5441e
Update Go to 1.24.5 (#30770)
## #30730 
- Update Go version
- Update the docs for this process
- Confirmed `fleet`, `fleetctl`, and related docker images build
successfully
- Note that failing tests are unrelated: see [Slack
thread](https://fleetdm.slack.com/archives/C019WG4GH0A/p1752175318523689)

---------

Co-authored-by: Jacob Shandling <jacob@fleetdm.com>
2025-07-15 10:59:17 -07:00