mirror of
https://github.com/fleetdm/fleet
synced 2026-04-27 00:17:21 +00:00
SigNoz converted from child module to standalone root module with
independent state.
**Critical Impact**
Deployment order is now required:
1. Deploy infrastructure/loadtesting/terraform/signoz/ FIRST
2. Then deploy infrastructure/loadtesting/terraform/infra/
Communication between modules via Terraform remote state.
**Key Configuration Changes**
- SigNoz creates its own EKS cluster: signoz-${workspace}
- Instance type: t3.xlarge (upgraded from t3.large for resource
headroom)
- ClickHouse disk: 200Gi (was 20Gi) with 2-day retention
- Resource limits configured to prevent OOMKills during loadtest
- wait_for_jobs = false to avoid Helm deployment deadlock
<!-- Add the related story/sub-task/bug number, like Resolves #123, or
remove if NA -->
**Related issue:** Resolves #32331
88 lines
2.8 KiB
Markdown
88 lines
2.8 KiB
Markdown
# SigNoz for Fleet Loadtesting
|
|
|
|
SigNoz provides OpenTelemetry tracing for Fleet loadtest environments. It's deployed as a standalone Terraform root module to ensure it's available before Fleet starts up.
|
|
|
|
## Architecture
|
|
|
|
- **EKS Cluster**: Per-workspace (e.g., `signoz-victor-baseline`)
|
|
- **Kubernetes**: v1.31
|
|
- **Node group**: 2x t3.xlarge nodes
|
|
- **Components**:
|
|
- SigNoz UI (public LoadBalancer on port 8080)
|
|
- OTLP Collector (internal LoadBalancer on port 4317)
|
|
- ClickHouse (200Gi storage)
|
|
|
|
## Deployment order
|
|
|
|
**IMPORTANT**: SigNoz must be deployed BEFORE the main Fleet infrastructure to capture telemetry from Fleet's initial bootup.
|
|
|
|
1. Deploy shared EKS VPC (one time, shared across workspaces, should already be deployed)
|
|
2. Deploy SigNoz (this directory)
|
|
3. Deploy Fleet infrastructure (../infra)
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
# 1. Initialize and select workspace
|
|
cd infrastructure/loadtesting/terraform/signoz
|
|
terraform init
|
|
terraform workspace new <workspace_name> # Match your infra workspace
|
|
|
|
# 2. Deploy SigNoz
|
|
terraform apply
|
|
|
|
# 3. Wait for deployment to complete (~10-15 minutes)
|
|
# The OTLP collector endpoint will be shown in outputs
|
|
|
|
# 4. Now deploy Fleet infrastructure
|
|
cd ../infra
|
|
terraform apply
|
|
```
|
|
|
|
## Accessing SigNoz UI
|
|
|
|
```bash
|
|
# Get the SigNoz UI URL
|
|
terraform output -raw get_signoz_ui_url | bash
|
|
|
|
# Or configure kubectl and access directly
|
|
$(terraform output -raw configure_kubectl)
|
|
kubectl get svc -n signoz signoz -o jsonpath='http://{.status.loadBalancer.ingress[0].hostname}:8080'
|
|
```
|
|
|
|
## Managing storage and retention
|
|
|
|
**IMPORTANT**: ClickHouse has limited storage. To prevent running out of space:
|
|
|
|
1. **Reduce trace retention period** in the SigNoz UI:
|
|
- Navigate to Settings → Retention Period
|
|
- Lower the retention period for traces (default may be too long for loadtesting)
|
|
- Consider 1-3 days for active loadtest environments
|
|
|
|
2. **Monitor ClickHouse storage**:
|
|
```bash
|
|
# Check ClickHouse pod storage usage
|
|
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- df -h /var/lib/clickhouse
|
|
|
|
# Check database sizes
|
|
kubectl exec -n signoz chi-signoz-clickhouse-cluster-0-0-0 -- clickhouse-client --query "SELECT database, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active GROUP BY database ORDER BY sum(bytes_on_disk) DESC"
|
|
```
|
|
|
|
3. **What happens when storage is full**:
|
|
- ClickHouse will reject new writes
|
|
- **New traces will NOT be captured**
|
|
- OTEL collector will log errors about failed writes
|
|
- Fleet will continue running but traces will be lost
|
|
|
|
## Outputs
|
|
|
|
The main Fleet infrastructure reads these outputs via remote state:
|
|
- `cluster_name`: EKS cluster name
|
|
- `otel_collector_endpoint`: Internal OTLP endpoint for Fleet to send traces
|
|
- `configure_kubectl`: Command to configure kubectl access
|
|
|
|
## Destroying
|
|
|
|
```bash
|
|
terraform destroy
|
|
```
|