hyperdx

mirror of https://github.com/hyperdxio/hyperdx synced 2026-04-21 13:37:15 +00:00

Author	SHA1	Message	Date
Warren Lee	0a4fb15df2	[HDX-4029] Add commonly-used core and contrib components to OTel Collector builder-config (#2121 ) ## Summary Update `packages/otel-collector/builder-config.yaml` to include commonly-used components from the upstream [opentelemetry-collector](https://github.com/open-telemetry/opentelemetry-collector) core and [opentelemetry-collector-contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib) distributions. This gives users more flexibility in their custom OTel configs without pulling in the entire contrib distribution (which causes very long compile times). Also adds Go module and build cache mounts to the OCB Docker build stage for faster rebuilds, and bumps CI timeouts for integration and smoke test jobs to account for the larger binary. ### Core extensions added (2) - `memorylimiterextension` — memory-based limiting at the extension level - `zpagesextension` — zPages debugging endpoints ### Contrib receivers added (4) - `dockerstatsreceiver` — container metrics from Docker - `filelogreceiver` — tail log files - `k8sclusterreceiver` — Kubernetes cluster-level metrics - `kubeletstatsreceiver` — node/pod/container metrics from kubelet ### Contrib processors added (12) - `attributesprocessor` — insert/update/delete/hash attributes - `cumulativetodeltaprocessor` — convert cumulative metrics to delta - `filterprocessor` — drop unwanted telemetry - `groupbyattrsprocessor` — reassign resource attributes - `k8sattributesprocessor` — enrich telemetry with k8s metadata - `logdedupprocessor` — deduplicate repeated log entries - `metricstransformprocessor` — rename/aggregate/transform metrics - `probabilisticsamplerprocessor` — percentage-based sampling - `redactionprocessor` — mask/remove sensitive data - `resourceprocessor` — modify resource attributes - `spanprocessor` — rename spans, extract attributes - `tailsamplingprocessor` — sample traces based on policies ### Contrib extensions added (1) - `filestorage` — persistent file-based storage (used by clickhouse exporter sending queue in EE OpAMP controller) ### Other changes - Docker cache mounts: Added `--mount=type=cache` for Go module and build caches in the OCB builder stage of both `docker/otel-collector/Dockerfile` and `docker/hyperdx/Dockerfile` - CI timeouts: Bumped `integration` and `otel-smoke-test` jobs from 8 to 16 minutes in `.github/workflows/main.yml` All existing HyperDX-specific components are preserved unchanged. ### How to test locally or on Vercel 1. Build the OTel Collector Docker image — verify OCB resolves all listed modules 2. Provide a custom OTel config that uses one of the newly-added components and verify it loads 3. Verify existing HyperDX OTel pipeline still functions ### References - Linear Issue: https://linear.app/clickhouse/issue/HDX-4029 - Upstream core builder-config: https://github.com/open-telemetry/opentelemetry-collector/blob/main/cmd/otelcorecol/builder-config.yaml - Upstream contrib builder-config: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/otelcontribcol/builder-config.yaml	2026-04-15 15:57:44 +00:00
Warren Lee	cb841457f2	[HDX-3994] Deprecate clickhouse.json feature gate in favor of per-exporter json config (#2119 ) ## Summary Deprecate the upstream-deprecated `--feature-gates=clickhouse.json` CLI flag in favor of the per-exporter `json: true` config option, as recommended by the OpenTelemetry ClickHouse exporter v0.149.0. This introduces a new env var `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE` that controls JSON mode at the exporter config level. The old `OTEL_AGENT_FEATURE_GATE_ARG` env var remains backward-compatible — when it contains `clickhouse.json`, the entrypoint strips that gate, maps it to the new env var, and prints a deprecation warning. Other feature gates are preserved and passed through to the collector. Key changes: - `docker/otel-collector/entrypoint.sh` — Detects `clickhouse.json` in `OTEL_AGENT_FEATURE_GATE_ARG`, strips it, sets `HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE=true`, and prints a deprecation warning. Remaining feature gates are still passed through to the collector in both standalone and supervisor modes. - `docker/otel-collector/config.standalone.yaml` — Added `json: ${env:HYPERDX_OTEL_EXPORTER_CLICKHOUSE_JSON_ENABLE:-false}` to both ClickHouse exporter configs - `packages/api/src/opamp/controllers/opampController.ts` — Added `json` field to the `CollectorConfig` type and both ClickHouse exporter configs for OpAMP-managed collectors - `docker/otel-collector/supervisor_docker.yaml.tmpl` — Feature gate pass-through preserved for non-`clickhouse.json` gates (entrypoint strips the deprecated gate before supervisor template renders) - `smoke-tests/otel-collector/` — Added a JSON-enabled otel-collector service and smoke tests verifying: - `ResourceAttributes` and `LogAttributes` columns in `otel_logs` are `JSON` type (not `Map`) - Log data with various attribute types (string, int, boolean) is inserted and queryable via JSON path access ### How to test locally or on Vercel 1. Run `yarn dev` to start the dev stack 2. Verify the `otel-collector-json` container starts without errors (the `clickhouse.json` feature gate is stripped, not passed to the collector) 3. Check container logs for the deprecation warning when `OTEL_AGENT_FEATURE_GATE_ARG` contains `clickhouse.json` 4. Verify the non-JSON `otel-collector` service continues to work normally (json defaults to false) 5. Run smoke tests: `cd smoke-tests/otel-collector && bats json-exporter.bats` ### References - Linear Issue: https://linear.app/hyperdx/issue/HDX-3994 - Upstream deprecation: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter#experimental-json-support	2026-04-15 15:49:45 +00:00
Warren Lee	28f374ef12	[HDX-3929] Migrate OTel Collector build to use OCB (OpenTelemetry Collector Builder) (#2109 )	2026-04-14 12:04:20 -07:00
Vineet Ahirkar	941d045077	feat: support sample-weighted aggregations for sampled trace data (#1963 ) ## Problem High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N. Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable. ### Why Materialized Views Cannot Solve This Alone A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it cannot replace a sampling-aware query engine. Fixed dimensions. A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression. Fixed aggregation fields. A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view. Industry precedent. Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps. A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting. ## Summary TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged. When set, the query builder rewrites SQL aggregations to weight each span by its sample rate: aggFn \| Before \| After (sample-corrected) \| Overhead -------------- \| ---------------------- \| --------------------------------------------------- \| -------- count \| count() \| sum(weight) \| ~1x count + cond \| countIf(cond) \| sumIf(weight, cond) \| ~1x avg \| avg(col) \| sum(col * weight) / sum(weight) \| ~2x sum \| sum(col) \| sum(col * weight) \| ~1x quantile(p) \| quantile(p)(col) \| quantileTDigestWeighted(p)(col, toUInt32(weight)) \| ~1.5x min/max \| unchanged \| unchanged \| 1x count_distinct \| unchanged \| unchanged (cannot correct) \| 1x Types: - Add sampleRateExpression to TraceSourceSchema + Mongoose model - Add sampleWeightExpression to ChartConfig schema Query builder: - sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so spans without a SampleRate attribute default to weight 1 (unsampled data produces identical results to the original queries). - Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression is set, with safe default-to-1 wrapping Integration (propagate sampleWeightExpression to all chart configs): - ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions - DBDashboardPage (raw SQL + builder branches) - AlertPreviewChart - SessionSubpanel - ServiceDashboardEndpointPerformanceChart - ServiceDashboardSlowestEventsTile (p95 query + events table) - ServiceDashboardEndpointSidePanel (error rate + throughput) - ServiceDashboardDbQuerySidePanel (total query time + throughput) - External API v2 charts, AI controller, alerts (index + template) UI: - Add Sample Rate Expression field to trace source admin form ### Screenshots or video \| Before \| After \| \| :----- \| :---- \| \| \| \| ### How to test locally or on Vercel 1. 2. 3. ### References - Linear Issue: - Related PRs:	2026-03-30 19:52:18 +00:00
Warren Lee	470b2c2992	ci: Replace QEMU with native ARM64 runners for release builds (#1952 ) ## Summary - Replace QEMU-emulated multi-platform builds with native ARM64 runners for both `release.yml` and `release-nightly.yml`, significantly speeding up CI build times - Each architecture (amd64/arm64) now builds in parallel on native hardware, then a manifest-merge job combines them into a multi-arch Docker tag using `docker buildx imagetools create` - Migrate from raw Makefile `docker buildx build` commands to `docker/build-push-action@v6` for better GHA integration ## Changes ### `.github/workflows/release.yml` - Removed QEMU setup entirely - Replaced single `release` matrix job with per-image build+publish job pairs: - `build-otel-collector` / `publish-otel-collector` (runners: `ubuntu-latest` / `ubuntu-latest-arm64`) - `build-app` / `publish-app` (runners: `Large-Runner-x64-32` / `Large-Runner-ARM64-32`) - `build-local` / `publish-local` (runners: `Large-Runner-x64-32` / `Large-Runner-ARM64-32`) - `build-all-in-one` / `publish-all-in-one` (runners: `Large-Runner-x64-32` / `Large-Runner-ARM64-32`) - Added `check_version` job to centralize skip-if-exists logic (replaces per-image `docker manifest inspect` in Makefile) - Removed `check_release_app_pushed` artifact upload/download — `publish-app` now outputs `app_was_pushed` directly - Scoped GHA build cache per image+arch (e.g. `scope=app-amd64`) to avoid collisions - All 4 images build in parallel (8 build jobs total), then 4 manifest-merge jobs, then downstream notifications ### `.github/workflows/release-nightly.yml` - Same native runner pattern (no skip logic since nightly always rebuilds) - 8 build + 4 publish jobs running in parallel - Slack failure notification and OTel trace export now depend on publish jobs ### `Makefile` - Removed `release-` and `release--nightly` targets (lines 203-361) — build logic moved into workflow YAML - Local `build-` targets preserved for developer use ## Architecture Follows the same pattern as `release-ee.yml` in the EE repo: ``` check_changesets → check_version │ ┌───────────────────┼───────────────────┬───────────────────┐ v v v v build-app(x2) build-otel(x2) build-local(x2) build-aio(x2) │ │ │ │ publish-app publish-otel publish-local publish-aio │ │ │ │ └─────────┬─────────┴───────────────────┴───────────────────┘ v notify_helm_charts / notify_clickhouse_clickstack │ otel-cicd-action ``` ## Notes - `--squash` flag dropped — it's an experimental Docker feature incompatible with `build-push-action` in multi-platform mode. `sbom` and `provenance` are preserved via action params. - Per-arch intermediate tags (e.g. `hyperdx/hyperdx:2.21.0-amd64`) remain visible on DockerHub — this is standard practice. - Dual DockerHub namespace tagging (`hyperdx/` + `clickhouse/clickstack-*`) preserved. ## Sample Run https://github.com/hyperdxio/hyperdx/actions/runs/23362835749	2026-03-20 23:04:49 +00:00
Dan Hable	a0b3361a85	[HDX-2712] Unified hyperdx entrypoint script for API and tasks (#1951 ) ## Summary The node commands to start the API server and alert task are duplicated across 4+ files, each hardcoding the build output path and node require flags. When the build process changed (esbuild introduction/revert per HDX-2690), the downstream operator and helm chart broke because their entrypoint commands were stale. This PR introduces `packages/api/bin/hyperdx`, a single shell script that is the sole source of truth for how to launch API and task processes. It resolves the build directory relative to its own location, applies the correct node flags (`-r @hyperdx/node-opentelemetry/build/src/tracing`), and exposes two subcommands: - `hyperdx api` -- starts the API server - `hyperdx task <name>` -- runs a named task (e.g., `check-alerts`) All Dockerfiles and entry scripts now delegate to this script instead of inlining the node command. Future build changes only need updating in one place. ### How to test locally or on Vercel 1. Build the standalone API image and confirm the entrypoint works: ```bash docker build . -f packages/api/Dockerfile -t hyperdx-api-test:latest --target prod docker run -d --name hdx-api-test -p 18000:8000 hyperdx-api-test:latest sleep 5 docker logs hdx-api-test 2>&1 \| head -30 # Should show OpenTelemetry init + MongoStore error (expected without Mongo) # No "file not found" or "permission denied" errors docker stop hdx-api-test && docker rm hdx-api-test ``` 2. Build and run the all-in-one image for a full integration test: ```bash make build-local docker run -d --name hdx-aio-test -p 18080:8080 -p 18000:8000 hyperdx/hyperdx-local:2.21.0 # Wait up to 90s for startup, then: curl -sf http://localhost:18080/api/health # should return {"data":"OK",...} curl -sf http://localhost:18000/health # should return {"data":"OK",...} docker exec hdx-aio-test sh -c "ps aux" # Confirm API, APP, and ALERT-TASK processes are running via the hyperdx script docker stop hdx-aio-test && docker rm hdx-aio-test ``` 3. Build the prod image to confirm the entry script changes are valid: ```bash make build-app ``` Testing performed: All three Docker image targets were built and verified locally. The standalone API image started node via `hyperdx api` correctly (crashed on missing MongoDB as expected). The all-in-one image passed health checks on both `localhost:18080/api/health` and `localhost:18000/health`, with all three processes (API, APP, ALERT-TASK) confirmed running inside the container using the new entry point script. ### References - Linear Issue: [HDX-2712](https://linear.app/clickhouse/issue/HDX-2712/use-a-single-entry-point-script-for-both-hyperdx-api-and-alert-job) - Related PRs: HDX-2690 (root cause), HDX-2815 (downstream helm chart follow-up) - Follow-up needed: Update helm chart cron job template and operator template in `ClickHouse/ClickStack-helm-charts` to use `./packages/api/bin/hyperdx task check-alerts` Made with [Cursor](https://cursor.com)	2026-03-20 18:27:40 +00:00
Warren Lee	25a3291f57	feat: Attach service version to all internal telemetry (#1891 ) ## Summary - Bump `@hyperdx/browser` to 0.22.0 and pass `service.version` as an OTel resource attribute to the browser SDK, so all frontend telemetry includes the app version - Inject `OTEL_RESOURCE_ATTRIBUTES=service.version=$CODE_VERSION` in API and all-in-one Dockerfiles (prod) and `.env.development` (dev) so the Node OTel SDK attaches service version to all backend traces, metrics, and logs - Remove the standalone Next.js health endpoint — `/api/health` now proxies through to the API server. The original `/api/health` page was redundant since `/api/config` already serves the same purpose	2026-03-12 16:50:46 +00:00
Warren Lee	53a4b67262	chore: update otel collector base image to 0.147.0 (#1845 ) ## Summary - Bump OpenTelemetry Collector base images (`opentelemetry-collector-contrib` and `opentelemetry-collector-opampsupervisor`) from 0.145.0 to 0.147.0 - Updated in both `docker/otel-collector/Dockerfile` and `docker/hyperdx/Dockerfile`	2026-03-04 20:18:16 +00:00
Rahul	ef66cba8cd	build(deps): add security resolutions for vulnerable npm packages (#1740 ) ## Summary Addresses npm security vulnerabilities in transitive dependencies. Prefer direct dependency upgrades over broad resolutions where possible. ## Changes Direct upgrade: - `@slack/webhook`: `^6.1.0` → `^7.0.0` — v7 natively uses axios v1, eliminating the axios@0.21.4 SSRF/redirect vulnerabilities. Only breaking change in v7 is dropping Node <18 (we're on Node 22). Resolutions for transitive deps with no direct upgrade path: - `fast-xml-parser`: `^4.4.0` — fixes prototype pollution (High) - `systeminformation`: `^5.24.0` — fixes command injection (High) ## Removed/Not Done - `axios` resolution removed — covered by the `@slack/webhook` upgrade instead - `tar` resolution removed — was a v6→v7 major jump on build-only tools (`cacache`, `node-gyp`); not present in the production image - `glob` resolution removed — was breaking test coverage tooling (`test-exclude@6` depends on glob@^7) ## Related Follow-up to #1731 which addressed base image vulnerabilities (Node, Go, ClickHouse).	2026-02-26 02:14:24 +00:00
Aaron Knudtson	8772f5e294	chore: update clickhouse version for compose files to 26.1 (#1791 )	2026-02-24 15:24:43 -05:00
Warren Lee	36da6ff4d8	chore: resolve collector CVE-2025-15467 (#1761 )	2026-02-19 11:45:50 -08:00
Warren Lee	4c42fdc3a4	fix(otel-collector): improve log level extraction with word boundaries in regex (#1747 ) For a log line like ``` x-amz-id-2: WxwS/N175wqLyRlzCXLpGZGszCEbQA0f63uFgdQN1qfcPr2IAmwE/P7HF2b1NdZLg18pNLF3ecTw5CrItXJid/uLe+fxh3jMBiJ7UlUxidw= ``` The level will be inferred as fatal because it contains `CrIt`, which is incorrect. To fix this, we need to add a word boundary at the start Ref: HDX-3439 CLAUDE: made a mistake. ``` ❌ Test expects "ALERTING" to match "alert" keyword → "ALERTING" won't match with word boundary because "alert" is a substring, not at a word boundary. Expected should be "info",9,"ALERTING system engaged" not "fatal",21. ``` -> incorrect statement	2026-02-18 22:16:07 +00:00
Warren Lee	18c2b37599	fix: Fallback to legacy schema when CH JSON feature gate is on (#1748 ) Currently users will need to add an extra flag to enable it `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA`=true. Ideally the JSON schema should be created if feature gate is enable `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'` Ref: HDX-3428	2026-02-18 16:42:44 +00:00
Rahul	b991e7bd37	fix: improve Docker Scout scores for clickstack images (#1731 ) Updates base images and patches vulnerable dependencies: - Node.js 22.16.0 -> 22.22-alpine - Go 1.25 -> 1.26-alpine - Express 4.19.2 -> 4.22.1 - Cookie, send, serve-static, and other npm packages - Fix ENV format warnings in Dockerfile Reduces vulnerabilities from 178 to 168 (9C, 52H, 98M, 9L). Tested: all services start correctly, health checks pass.	2026-02-13 18:21:19 +00:00
Tom Alexander	75ff28dd68	chore: Use local clickhouse instance for playwright tests (#1711 ) TLDR: This PR changes playwright full-stack tests to run against a local clickhouse instance (with seeded data) instead of relying on the clickhouse demo server, which can be unpredictable at times. This workflow allows us to fully control the data to make tests more predictable. This PR: * Adds local CH instance to the e2e dockerfile * Adds a schema creation script * Adds a data seeding script * Updates playwright config * Updates various tests to change hardcoded fields, metrics, or areas relying on play demo data * Updates github workflow to use the dockerfile instead of separate services * Runs against a local clickhouse instead of the demo server Fixes: HDX-3193	2026-02-13 15:43:12 +00:00
Rahul	ebbfa2410e	fix: improve Docker Scout score for otel-collector image (#1727 ) - Upgrade OTel collector-contrib and opampsupervisor from 0.136.0 to 0.145.0 to resolve Go stdlib CVEs from outdated binaries - Pin Alpine base to 3.21 with fresh digest replacing stale alpine:latest pin - Add HEALTHCHECK to both dev and prod stages using the health_check extension on port 13133 - Fix Makefile otel-collector build targets to use repo-root context with -f flag, matching the repo-root relative COPY paths Followup from #1697 #1698	2026-02-11 20:50:22 +00:00
Drew Davis	c3bc43add1	fix: Avoid using bodyExpression for trace sources (#1722 ) Closes HDX-3361 # Summary This PR prevents various query errors caused by references to `bodyExpression` on trace sources. The `bodyExpression` should not exist on trace sources, and cannot be edited in the source form. Despite that, the `bodyExpression` would be set on trace sources during source inference. - The `getEventBody` helper function will now correctly use the `spanNameExpression` field instead for trace sources. A few direct references to `bodyExpression` have been updated to `getEventBody` calls. - Source configuration inference will no longer populate the `bodyExpression` for trace sources, and the default trace source will not be created with a `bodyExpression`.	2026-02-11 13:01:12 +00:00
Warren Lee	629fb52edc	feat: introduce `HYPERDX_OTEL_EXPORTER_TABLES_TTL` (ClickStack OTel collector) (#1720 ) - Users can configure table TTLs via `HYPERDX_OTEL_EXPORTER_TABLES_TTL`, which defaults to 720h. - Add TTL to metric tables Ref: HDX-3365	2026-02-10 16:00:38 +00:00
Adrian Philipp	5c895ff34a	fix: allow overriding default connections (#1710 ) Co-authored-by: Aaron Knudtson <87577305+knudtty@users.noreply.github.com> Co-authored-by: Warren Lee <5959690+wrn14897@users.noreply.github.com>	2026-02-10 07:56:28 +01:00
Warren Lee	baf18da4c0	feat: add TLS support for OTel collector migration script (#1714 ) Moved the inline goose CLI script to its own go script. For the seed DDLs, we don’t create the version tables, and they should all be idempotent.	2026-02-10 02:40:28 +00:00
Warren Lee	3dae0e012f	fix: copy otel-collector schema directory to AIO image (#1700 )	2026-02-04 12:27:23 -08:00
Hannes Leutloff	8f1026089d	fix: Set correct github URL as image source in Dockerfiles (#1698 ) I went ahead and looked for more occurences of the issue i raised #1697 and fixed them. I hope that's alright with you.	2026-02-04 16:11:31 +00:00
Warren Lee	683ec1a80e	fix: add TLS parameters for https ClickHouse endpoints in goose DB string (#1689 ) Need to add `secure=true&skip_verify=false` TLS params for https (CHC) Ref: https://github.com/pressly/goose/pull/796/changes	2026-02-02 23:13:56 +00:00
Warren Lee	c2a6193393	feat: add OTLP auth token support for standalone mode (#1684 ) Ref: HDX-3317	2026-02-02 17:25:38 +00:00
Warren Lee	6f4c8efba0	feat: Enforce ClickStack schemas by default (#1682 ) - Introduce a new flag `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA` (default to false) to otel collector - Custom ClickStack schemas should be enforced by default - ClickHouse tables migration logs should be stored in `clickstack_db_version_xxx` tables - The collector will run the migration at startup and retry if it fails to connect to the database (using exponential backoff). - Fully backward compatible Ref: HDX-3301	2026-02-02 16:39:20 +00:00
Rahul	0d321ea15f	update Docker images for Docker Scout Score improvements (#1680 ) Co-authored-by: Himanshu Kapoor <himanshu.kapoor@clickhouse.com>	2026-01-29 16:37:22 -06:00
Warren Lee	43de467864	feat: allow otel-collector to run without OpAMP server (#1672 ) Today, users have to set up an OpAMP server to run with our clickstack OTel collector. Instead, we should allow users to disable OpAMP when they're using ClickHouse Cloud with the clickstack integration. This can be determined by `OPAMP_SERVER_URL` not being defined by the user. The end result is that a user can do ``` docker run \ -e CLICKHOUSE_ENDPOINT=${CLICKHOUSE_ENDPOINT} \ -e CLICKHOUSE_USER=default \ -e CLICKHOUSE_PASSWORD=${CLICKHOUSE_PASSWORD} \ -p 8080:8080 -p 4317:4317 -p 4318:4318 \ clickhouse/clickstack-otel-collector:latest ``` Ref: HDX-3300	2026-01-29 17:50:24 +00:00
Dan Hable	d07e30d5fb	feat: associate logged in user to clickhouse query (#1636 ) Allows setting a custom setting prefix on a connection. When set in HyperDX and the ClickHouse settings, the HyperDX app will set a custom setting for each query. These are recorded in the query log and can be used to identify which user issues the query. ## Testing The commit also updates the local dev ClickHouse instance to support a custom setting prefix of `hyeprdx`. After running `make dev-up`, you should be able to edit the connection and set the the prefix to `hyperdx`. <img width="955" height="197" alt="Screenshot 2026-01-21 at 1 23 14 PM" src="https://github.com/user-attachments/assets/607fc945-d93f-4976-9862-3118b420c077" /> After saving, just allow the app to live tail a source like logs. If you connect to the ClickHouse database, you should then be able to run ``` SELECT query, Settings FROM system.query_log WHERE has(mapKeys(Settings), 'hyperdx_user') FORMAT Vertical ``` and then see a bunch of queries with the user set to your logged in user. ``` Row 46: ─────── query: SELECT Timestamp, ServiceName, SeverityText, Body, TimestampTime FROM default.otel_logs WHERE (TimestampTime >= fromUnixTimestamp64Milli(_CAST(1769022372269, 'Int64'))) AND (TimestampTime <= fromUnixTimestamp64Milli(_CAST(1769023272269, 'Int64'))) ORDER BY (TimestampTime, Timestamp) DESC LIMIT _CAST(0, 'Int32'), _CAST(200, 'Int32') FORMAT JSONCompactEachRowWithNamesAndTypes Settings: {'use_uncompressed_cache':'0','load_balancing':'in_order','log_queries':'1','max_memory_usage':'10000000000','cancel_http_readonly_queries_on_client_close':'1','parallel_replicas_for_cluster_engines':'0','date_time_output_format':'iso','hyperdx_user':'\'dan@hyperdx.io\''} ```	2026-01-28 14:58:05 +00:00
Dale McDiarmid	66f56cb1d0	chore: Move schema configs to file (#1635 ) Co-authored-by: Tom Alexander <tom.alexander@clickhouse.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-28 14:52:54 +01:00
Drew Davis	1cf8cebb4b	feat: Support JSON Sessions (#1628 )	2026-01-21 19:25:39 -05:00
Dan Hable	6537884825	build: remove lingering references to log-rotator.sh (#1520 ) We removed the rotator script when we used the named pipe approach to the otel collector logging. There were some references left over that caused the docker build to fail.	2025-12-23 16:22:46 +00:00
Dan Hable	0bb7407747	fix: support otel collector logging with minimal storage (#1509 ) This commit sets up a FIFO named pipe with the same name/path that the otel collector and supervisor are expecting. By starting to tail that pipe before starting the collector, we can send log files to stdio without the memory required by the `passthrough_logs` feature or and storage on the volume. --- Running locally in orbstack, we're still seeing logs on stdout <img width="2316" height="1452" alt="image" src="https://github.com/user-attachments/assets/f86961cf-6ea4-4faa-82f8-54f9596b5f16" /> but the file size for `agent.log` remains at 0 <img width="1606" height="838" alt="image" src="https://github.com/user-attachments/assets/6feb9470-e220-4d8a-b5b3-10a221926158" /> disk usage stats also remain stable <img width="1728" height="860" alt="image" src="https://github.com/user-attachments/assets/ff5fe593-4936-446e-9396-606fb495d60d" />	2025-12-19 19:18:33 +00:00
Daniel Lockyer	99ea6395cf	chore: drop npx prefix from concurrently commands (#1505 ) Co-authored-by: Drew Davis <drew.davis@clickhouse.com>	2025-12-19 11:55:02 -05:00
Dan Hable	7a53880356	Revert "fix(otel-collector): fix log rotation script (#1479 )" (#1495 ) This reverts commit `0b19f915e8`.	2025-12-17 17:27:51 +00:00
Dan Hable	0b19f915e8	fix(otel-collector): fix log rotation script (#1479 ) There were two issues with the log rotation script: 1. Logs could be lost since copying and then truncating the file might not finish before logs arrive. 2. The otel collector application will keeps the file handle and offset cached. After truncating, it will write starting at the last offset leaving the unallocated garbage in the beginning of the file. This garbage uses space. This commit moves the file instead of copying. That allows the collector to continue writing to the rolled file until a SIGHUP is sent. This causes a config refresh, which also opens a new log file. After, the rolled file and the new log file have correct sizes. -- ADDITIONAL NOTES: Claude's code review is not accurate here. * The alpine image is based on busybox and fuser is a command implemented by busybox. This can be verified by just running the collector and watching the log rotate behavior. * The mv command updates the name of the file in the file system but doesn't change the inode number. A process only uses the file path the first time the file is open to resolve it into a inode number. Moving the file changes the name but doesn't change the inode number so the process will continue to write to that file.	2025-12-15 16:08:11 +00:00
Tom Alexander	52d2798582	chore: Update to next 16, react 19, add react compiler (#1434 ) fixes: HDX-2956 Co-authored-by: Brandon Pereira <7552738+brandon-pereira@users.noreply.github.com>	2025-12-04 23:40:59 +00:00
Jarrad	7cf4ba4d70	allow configuring the app's listen address with HYPERDX_APP_LISTEN_HOSTNAME (#1344 )	2025-11-25 16:22:07 +01:00
Aaron Knudtson	19c5085cde	chore: split json otel collector to enable both during dev (#1247 ) Gets us closer to a staging instance of json <img width="216" height="174" alt="image" src="https://github.com/user-attachments/assets/b5cc3cf8-aef0-4ba4-9e9a-8c1d4fad5451" /> Co-authored-by: Warren <5959690+wrn14897@users.noreply.github.com>	2025-11-04 21:16:41 +00:00
Ruud Kamphuis	c6ad250f3d	Enable auto-provisioning for no-auth mode (#1297 ) Co-authored-by: Aaron Knudtson <87577305+knudtty@users.noreply.github.com>	2025-10-29 09:42:39 -04:00
Warren	131a1c1edb	revert: api esbuild (#1280 ) This PR reverts https://github.com/hyperdxio/hyperdx/pull/937 Ref: HDX-2620	2025-10-21 09:27:47 +00:00
Brandon Pereira	e032af5509	attempt to ensure otel collector logs go to stdout (#1228 )	2025-10-01 11:51:24 -06:00
Drew Davis	45e8e1b62d	fix: Update tsconfigs to resolve IDE type errors (#1150 )	2025-09-11 08:55:14 -04:00
Aaron Knudtson	8568580127	feat: add custom ingestion key via INGESTION_API_KEY (#1112 ) Closes HDX-2283 This adds an environment variable `INGESTION_API_KEY` that can be set by the user. This apiKey will be valid and accepted by the Otel collector. It is in addition to the autogenerated apiKey and will not show in the team settings apiKey section.	2025-08-29 20:10:46 +00:00
Warren	3636fc570d	style: update otelcol config file volume mount from dev stage (#1091 )	2025-08-21 14:03:45 +00:00
Warren	56fd856d7a	fix: otelcol process in aio build (#1085 )	2025-08-20 19:17:39 +00:00
Warren	d29e2bcb67	fix: handle the case when `CUSTOM_OTELCOL_CONFIG_FILE` is not specified (#1080 ) plus fixing startup issue when the team isn't created yet	2025-08-19 17:08:49 +00:00
Warren	ab50b12a6b	feat: support custom otel collector config (BETA) (#1074 ) plus the fix to reduce bloat in opamp agent logs Users should be able to mount the custom otel collector config file and add/overrider receivers, processors and exporters For example: ``` receivers: hostmetrics: collection_interval: 5s scrapers: cpu: load: memory: disk: filesystem: network: # override the default processors processors: batch: send_batch_size: 10000 timeout: 10s memory_limiter: limit_mib: 2000 service: pipelines: metrics/hostmetrics: receivers: [hostmetrics] # attach existing processors processors: [memory_limiter, batch] # attach existing exporters exporters: [clickhouse] ``` This will add a new `hostmetrics` receiver + `metrics/hostmetrics` pipeline and update existing `batch` + `memory_limiter` processors WARNING: This feature is still in beta, and future updates may change how it works, potentially affecting compatibility Ref: HDX-1865	2025-08-18 21:22:43 +00:00
Warren	6c134035c4	fix: use '--kill-others-on-fail' to prevent processes from terminating when RUN_SCHEDULED_TASKS_EXTERNALLY is enabled (#1015 ) Ref: HDX-2044 Co-authored-by: Dan Hable <418679+dhable@users.noreply.github.com>	2025-07-24 21:56:38 +00:00
Mike Shi	ecb0f2c889	feat: Add JSON support to all in one build (#972 )	2025-07-03 22:54:59 +00:00
Mike Shi	52ca1823a4	feat: Add ClickHouse JSON Type Support (#969 ) - Upgrades ClickHouse to 25.6, fixes breaking config change, needed for latest JSON type - Upgrades OTel Collector to 0.129.1, fixes breaking config change, needed for latest JSON support in exporter - Upgrades OTel OpAMP Supervisor to 0.128.0 - Fixes features to support JSON type columns in OTel in HyperDX (filtering, searching, graphing, opening rows, etc.) Requires users to set `BETA_CH_OTEL_JSON_SCHEMA_ENABLED=true` in `ch-server` and `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'` in `otel-collector` to enable JSON schema. Users must start a new ClickHouse DB or migrate their own table manually to enable as it is not schema compatible and migration is not automatic. Closes HDX-1849, HDX-1969, HDX-1849, HDX-1966, HDX-1964 Co-authored-by: Tom Alexander <3245235+teeohhem@users.noreply.github.com>	2025-07-03 17:11:03 +00:00

1 2 3

127 commits