## Summary
Update `packages/otel-collector/builder-config.yaml` to include commonly-used components from the upstream [opentelemetry-collector](https://github.com/open-telemetry/opentelemetry-collector) core and [opentelemetry-collector-contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib) distributions. This gives users more flexibility in their custom OTel configs without pulling in the entire contrib distribution (which causes very long compile times).
Also adds Go module and build cache mounts to the OCB Docker build stage for faster rebuilds, and bumps CI timeouts for integration and smoke test jobs to account for the larger binary.
### Core extensions added (2)
- `memorylimiterextension` — memory-based limiting at the extension level
- `zpagesextension` — zPages debugging endpoints
### Contrib receivers added (4)
- `dockerstatsreceiver` — container metrics from Docker
- `filelogreceiver` — tail log files
- `k8sclusterreceiver` — Kubernetes cluster-level metrics
- `kubeletstatsreceiver` — node/pod/container metrics from kubelet
### Contrib processors added (12)
- `attributesprocessor` — insert/update/delete/hash attributes
- `cumulativetodeltaprocessor` — convert cumulative metrics to delta
- `filterprocessor` — drop unwanted telemetry
- `groupbyattrsprocessor` — reassign resource attributes
- `k8sattributesprocessor` — enrich telemetry with k8s metadata
- `logdedupprocessor` — deduplicate repeated log entries
- `metricstransformprocessor` — rename/aggregate/transform metrics
- `probabilisticsamplerprocessor` — percentage-based sampling
- `redactionprocessor` — mask/remove sensitive data
- `resourceprocessor` — modify resource attributes
- `spanprocessor` — rename spans, extract attributes
- `tailsamplingprocessor` — sample traces based on policies
### Contrib extensions added (1)
- `filestorage` — persistent file-based storage (used by clickhouse exporter sending queue in EE OpAMP controller)
### Other changes
- **Docker cache mounts**: Added `--mount=type=cache` for Go module and build caches in the OCB builder stage of both `docker/otel-collector/Dockerfile` and `docker/hyperdx/Dockerfile`
- **CI timeouts**: Bumped `integration` and `otel-smoke-test` jobs from 8 to 16 minutes in `.github/workflows/main.yml`
All existing HyperDX-specific components are preserved unchanged.
### How to test locally or on Vercel
1. Build the OTel Collector Docker image — verify OCB resolves all listed modules
2. Provide a custom OTel config that uses one of the newly-added components and verify it loads
3. Verify existing HyperDX OTel pipeline still functions
### References
- Linear Issue: https://linear.app/clickhouse/issue/HDX-4029
- Upstream core builder-config: https://github.com/open-telemetry/opentelemetry-collector/blob/main/cmd/otelcorecol/builder-config.yaml
- Upstream contrib builder-config: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/otelcontribcol/builder-config.yaml
## Summary
Deprecate the upstream-deprecated `--feature-gates=clickhouse.json` CLI flag in favor of the per-exporter `json: true` config option, as recommended by the OpenTelemetry ClickHouse exporter v0.149.0.
This introduces a new env var `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE` that controls JSON mode at the exporter config level. The old `OTEL_AGENT_FEATURE_GATE_ARG` env var remains backward-compatible — when it contains `clickhouse.json`, the entrypoint strips that gate, maps it to the new env var, and prints a deprecation warning. Other feature gates are preserved and passed through to the collector.
**Key changes:**
- **`docker/otel-collector/entrypoint.sh`** — Detects `clickhouse.json` in `OTEL_AGENT_FEATURE_GATE_ARG`, strips it, sets `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE=true`, and prints a deprecation warning. Remaining feature gates are still passed through to the collector in both standalone and supervisor modes.
- **`docker/otel-collector/config.standalone.yaml`** — Added `json: ${env:HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE:-false}` to both ClickHouse exporter configs
- **`packages/api/src/opamp/controllers/opampController.ts`** — Added `json` field to the `CollectorConfig` type and both ClickHouse exporter configs for OpAMP-managed collectors
- **`docker/otel-collector/supervisor_docker.yaml.tmpl`** — Feature gate pass-through preserved for non-`clickhouse.json` gates (entrypoint strips the deprecated gate before supervisor template renders)
- **`smoke-tests/otel-collector/`** — Added a JSON-enabled otel-collector service and smoke tests verifying:
- `ResourceAttributes` and `LogAttributes` columns in `otel_logs` are `JSON` type (not `Map`)
- Log data with various attribute types (string, int, boolean) is inserted and queryable via JSON path access
### How to test locally or on Vercel
1. Run `yarn dev` to start the dev stack
2. Verify the `otel-collector-json` container starts without errors (the `clickhouse.json` feature gate is stripped, not passed to the collector)
3. Check container logs for the deprecation warning when `OTEL_AGENT_FEATURE_GATE_ARG` contains `clickhouse.json`
4. Verify the non-JSON `otel-collector` service continues to work normally (json defaults to false)
5. Run smoke tests: `cd smoke-tests/otel-collector && bats json-exporter.bats`
### References
- Linear Issue: https://linear.app/hyperdx/issue/HDX-3994
- Upstream deprecation: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter#experimental-json-support
## Problem
High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N.
Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable.
### Why Materialized Views Cannot Solve This Alone
A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it cannot replace a sampling-aware query engine.
**Fixed dimensions.** A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression.
**Fixed aggregation fields.** A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view.
**Industry precedent.** Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps.
A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting.
## Summary
TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged.
When set, the query builder rewrites SQL aggregations to weight each span by its sample rate:
aggFn | Before | After (sample-corrected) | Overhead
-------------- | ---------------------- | --------------------------------------------------- | --------
count | count() | sum(weight) | ~1x
count + cond | countIf(cond) | sumIf(weight, cond) | ~1x
avg | avg(col) | sum(col * weight) / sum(weight) | ~2x
sum | sum(col) | sum(col * weight) | ~1x
quantile(p) | quantile(p)(col) | quantileTDigestWeighted(p)(col, toUInt32(weight)) | ~1.5x
min/max | unchanged | unchanged | 1x
count_distinct | unchanged | unchanged (cannot correct) | 1x
**Types**:
- Add sampleRateExpression to TraceSourceSchema + Mongoose model
- Add sampleWeightExpression to ChartConfig schema
**Query builder:**
- sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so
spans without a SampleRate attribute default to weight 1 (unsampled
data produces identical results to the original queries).
- Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression
is set, with safe default-to-1 wrapping
**Integration** (propagate sampleWeightExpression to all chart configs):
- ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions
- DBDashboardPage (raw SQL + builder branches)
- AlertPreviewChart
- SessionSubpanel
- ServiceDashboardEndpointPerformanceChart
- ServiceDashboardSlowestEventsTile (p95 query + events table)
- ServiceDashboardEndpointSidePanel (error rate + throughput)
- ServiceDashboardDbQuerySidePanel (total query time + throughput)
- External API v2 charts, AI controller, alerts (index + template)
**UI**:
- Add Sample Rate Expression field to trace source admin form
### Screenshots or video
| Before | After |
| :----- | :---- |
| | |
### How to test locally or on Vercel
1.
2.
3.
### References
- Linear Issue:
- Related PRs:
## Summary
- Bump OpenTelemetry Collector base images (`opentelemetry-collector-contrib` and `opentelemetry-collector-opampsupervisor`) from **0.145.0** to **0.147.0**
- Updated in both `docker/otel-collector/Dockerfile` and `docker/hyperdx/Dockerfile`
For a log line like
```
x-amz-id-2: WxwS/N175wqLyRlzCXLpGZGszCEbQA0f63uFgdQN1qfcPr2IAmwE/P7HF2b1NdZLg18pNLF3ecTw5CrItXJid/uLe+fxh3jMBiJ7UlUxidw=
```
The level will be inferred as fatal because it contains `CrIt`, which is incorrect.
To fix this, we need to add a word boundary at the start
Ref: HDX-3439
CLAUDE: made a mistake.
```
❌ Test expects "ALERTING" to match "alert" keyword → "ALERTING" won't match with word boundary because "alert" is a substring, not at a word boundary. Expected should be "info",9,"ALERTING system engaged" not "fatal",21.
```
-> incorrect statement
Currently users will need to add an extra flag to enable it `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA`=true. Ideally the JSON schema should be created if feature gate is enable `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'`
Ref: HDX-3428
- Upgrade OTel collector-contrib and opampsupervisor from 0.136.0 to 0.145.0 to resolve Go stdlib CVEs from outdated binaries
- Pin Alpine base to 3.21 with fresh digest replacing stale alpine:latest pin
- Add HEALTHCHECK to both dev and prod stages using the health_check extension on port 13133
- Fix Makefile otel-collector build targets to use repo-root context with -f flag, matching the repo-root relative COPY paths
Followup from #1697#1698
- Introduce a new flag `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA` (default to false) to otel collector
- Custom ClickStack schemas should be enforced by default
- ClickHouse tables migration logs should be stored in `clickstack_db_version_xxx` tables
- The collector will run the migration at startup and retry if it fails to connect to the database (using exponential backoff).
- Fully backward compatible
Ref: HDX-3301
Today, users have to set up an OpAMP server to run with our clickstack OTel collector. Instead, we should allow users to disable OpAMP when they're using ClickHouse Cloud with the clickstack integration.
This can be determined by `OPAMP_SERVER_URL` not being defined by the user.
The end result is that a user can do
```
docker run \
-e CLICKHOUSE_ENDPOINT=${CLICKHOUSE_ENDPOINT} \
-e CLICKHOUSE_USER=default \
-e CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD} \
-p 8080:8080 -p 4317:4317 -p 4318:4318 \
clickhouse/clickstack-otel-collector:latest
```
Ref: HDX-3300
We removed the rotator script when we used the named pipe approach to the otel collector logging. There were some references left over that caused the docker build to fail.
There were two issues with the log rotation script:
1. Logs could be lost since copying and then truncating the file might not finish before logs arrive.
2. The otel collector application will keeps the file handle and offset cached. After truncating, it will write starting at the last offset leaving the unallocated garbage in the beginning of the file. This garbage uses space.
This commit moves the file instead of copying. That allows the collector to continue writing to the rolled file until a SIGHUP is sent. This causes a config refresh, which also opens a new log file. After, the rolled file and the new log file have correct sizes.
--
**ADDITIONAL NOTES**:
Claude's code review is not accurate here.
* The alpine image is based on busybox and fuser is a command implemented by busybox. This can be verified by just running the collector and watching the log rotate behavior.
* The mv command updates the name of the file in the file system but doesn't change the inode number. A process only uses the file path the first time the file is open to resolve it into a inode number. Moving the file changes the name but doesn't change the inode number so the process will continue to write to that file.
plus the fix to reduce bloat in opamp agent logs
Users should be able to mount the custom otel collector config file and add/overrider receivers, processors and exporters
For example:
```
receivers:
hostmetrics:
collection_interval: 5s
scrapers:
cpu:
load:
memory:
disk:
filesystem:
network:
# override the default processors
processors:
batch:
send_batch_size: 10000
timeout: 10s
memory_limiter:
limit_mib: 2000
service:
pipelines:
metrics/hostmetrics:
receivers: [hostmetrics]
# attach existing processors
processors: [memory_limiter, batch]
# attach existing exporters
exporters: [clickhouse]
```
This will add a new `hostmetrics` receiver + `metrics/hostmetrics` pipeline and update existing `batch` + `memory_limiter` processors
WARNING: This feature is still in beta, and future updates may change how it works, potentially affecting compatibility
Ref: HDX-1865
- Upgrades ClickHouse to 25.6, fixes breaking config change, needed for latest JSON type
- Upgrades OTel Collector to 0.129.1, fixes breaking config change, needed for latest JSON support in exporter
- Upgrades OTel OpAMP Supervisor to 0.128.0
- Fixes features to support JSON type columns in OTel in HyperDX (filtering, searching, graphing, opening rows, etc.)
Requires users to set `BETA_CH_OTEL_JSON_SCHEMA_ENABLED=true` in `ch-server` and `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'` in `otel-collector` to enable JSON schema. Users must start a new ClickHouse DB or migrate their own table manually to enable as it is not schema compatible and migration is not automatic.
Closes HDX-1849, HDX-1969, HDX-1849, HDX-1966, HDX-1964
Co-authored-by: Tom Alexander <3245235+teeohhem@users.noreply.github.com>
Adds environment variable to allow for passthrough_logs to be enabled in the supervisor config
Test locally with:
Edit docker-compose.dev.yaml
Add `OTEL_SUPERVISOR_PASSTHROUGH_LOGS: 'true'` under the otel environment variables
```
make dev-up
```
Ref: HDX-1859
- Support `CLICKHOUSE_ENDPOINT` to switch aio clickhouse endpoint (Ref: HDX-1758)
- Support `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE ` (Ref: HDX-1786)
Support command like
```
docker run -e CLICKHOUSE_ENDPOINT=<CH-CLOUD-ENDPOINT> -e CLICKHOUSE_USER=default -e CLICKHOUSE_PASSWORD='BLABLA' -e HYPERDX_OTEL_EXPORTER_CLICKHOUSE_DATABASE =hyperdx -p 8080:8080 -p 4317:4317 -p 4318:4318 hyperdx/hyperdx-local:2-nightly
```
So users can export data to other services like clickhouse cloud
For users connecting to ClickHouse Cloud or a TLS endpoint, add the `secure=true` query parameter or use the HTTPS protocol. Providing the full URL via the `CLICKHOUSE_SERVER_ENDPOINT` in the exporter's endpoint field should resolve this issue
Ref: HDX-1743
Updates the OTEL pipeline to handle structured logs better. If the body content is an OTEL map, it will merge the body map into the log attributes map. If the body is a JSON object, it will parse the JSON string into an OTEL map, then merge the fields into the log attributes map.
Replacing the Body field doesn't work since the Clickhouse exporter schema defines Body as string, so any parsed out object ends up turning back into a string. At least as log resources, it's a lighter weight means of grouping and filtering in the UI.
Ref: HDX-1453