hyperdx

mirror of https://github.com/hyperdxio/hyperdx synced 2026-04-21 13:37:15 +00:00

Author	SHA1	Message	Date
Vineet Ahirkar	941d045077	feat: support sample-weighted aggregations for sampled trace data (#1963 ) ## Problem High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N. Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable. ### Why Materialized Views Cannot Solve This Alone A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it cannot replace a sampling-aware query engine. Fixed dimensions. A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression. Fixed aggregation fields. A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view. Industry precedent. Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps. A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting. ## Summary TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged. When set, the query builder rewrites SQL aggregations to weight each span by its sample rate: aggFn \| Before \| After (sample-corrected) \| Overhead -------------- \| ---------------------- \| --------------------------------------------------- \| -------- count \| count() \| sum(weight) \| ~1x count + cond \| countIf(cond) \| sumIf(weight, cond) \| ~1x avg \| avg(col) \| sum(col * weight) / sum(weight) \| ~2x sum \| sum(col) \| sum(col * weight) \| ~1x quantile(p) \| quantile(p)(col) \| quantileTDigestWeighted(p)(col, toUInt32(weight)) \| ~1.5x min/max \| unchanged \| unchanged \| 1x count_distinct \| unchanged \| unchanged (cannot correct) \| 1x Types: - Add sampleRateExpression to TraceSourceSchema + Mongoose model - Add sampleWeightExpression to ChartConfig schema Query builder: - sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so spans without a SampleRate attribute default to weight 1 (unsampled data produces identical results to the original queries). - Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression is set, with safe default-to-1 wrapping Integration (propagate sampleWeightExpression to all chart configs): - ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions - DBDashboardPage (raw SQL + builder branches) - AlertPreviewChart - SessionSubpanel - ServiceDashboardEndpointPerformanceChart - ServiceDashboardSlowestEventsTile (p95 query + events table) - ServiceDashboardEndpointSidePanel (error rate + throughput) - ServiceDashboardDbQuerySidePanel (total query time + throughput) - External API v2 charts, AI controller, alerts (index + template) UI: - Add Sample Rate Expression field to trace source admin form ### Screenshots or video \| Before \| After \| \| :----- \| :---- \| \| \| \| ### How to test locally or on Vercel 1. 2. 3. ### References - Linear Issue: - Related PRs:	2026-03-30 19:52:18 +00:00
Tom Alexander	75ff28dd68	chore: Use local clickhouse instance for playwright tests (#1711 ) TLDR: This PR changes playwright full-stack tests to run against a local clickhouse instance (with seeded data) instead of relying on the clickhouse demo server, which can be unpredictable at times. This workflow allows us to fully control the data to make tests more predictable. This PR: * Adds local CH instance to the e2e dockerfile * Adds a schema creation script * Adds a data seeding script * Updates playwright config * Updates various tests to change hardcoded fields, metrics, or areas relying on play demo data * Updates github workflow to use the dockerfile instead of separate services * Runs against a local clickhouse instead of the demo server Fixes: HDX-3193	2026-02-13 15:43:12 +00:00
Warren Lee	6f4c8efba0	feat: Enforce ClickStack schemas by default (#1682 ) - Introduce a new flag `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA` (default to false) to otel collector - Custom ClickStack schemas should be enforced by default - ClickHouse tables migration logs should be stored in `clickstack_db_version_xxx` tables - The collector will run the migration at startup and retry if it fails to connect to the database (using exponential backoff). - Fully backward compatible Ref: HDX-3301	2026-02-02 16:39:20 +00:00
Dan Hable	d07e30d5fb	feat: associate logged in user to clickhouse query (#1636 ) Allows setting a custom setting prefix on a connection. When set in HyperDX and the ClickHouse settings, the HyperDX app will set a custom setting for each query. These are recorded in the query log and can be used to identify which user issues the query. ## Testing The commit also updates the local dev ClickHouse instance to support a custom setting prefix of `hyeprdx`. After running `make dev-up`, you should be able to edit the connection and set the the prefix to `hyperdx`. <img width="955" height="197" alt="Screenshot 2026-01-21 at 1 23 14 PM" src="https://github.com/user-attachments/assets/607fc945-d93f-4976-9862-3118b420c077" /> After saving, just allow the app to live tail a source like logs. If you connect to the ClickHouse database, you should then be able to run ``` SELECT query, Settings FROM system.query_log WHERE has(mapKeys(Settings), 'hyperdx_user') FORMAT Vertical ``` and then see a bunch of queries with the user set to your logged in user. ``` Row 46: ─────── query: SELECT Timestamp, ServiceName, SeverityText, Body, TimestampTime FROM default.otel_logs WHERE (TimestampTime >= fromUnixTimestamp64Milli(_CAST(1769022372269, 'Int64'))) AND (TimestampTime <= fromUnixTimestamp64Milli(_CAST(1769023272269, 'Int64'))) ORDER BY (TimestampTime, Timestamp) DESC LIMIT _CAST(0, 'Int32'), _CAST(200, 'Int32') FORMAT JSONCompactEachRowWithNamesAndTypes Settings: {'use_uncompressed_cache':'0','load_balancing':'in_order','log_queries':'1','max_memory_usage':'10000000000','cancel_http_readonly_queries_on_client_close':'1','parallel_replicas_for_cluster_engines':'0','date_time_output_format':'iso','hyperdx_user':'\'dan@hyperdx.io\''} ```	2026-01-28 14:58:05 +00:00
Dale McDiarmid	66f56cb1d0	chore: Move schema configs to file (#1635 ) Co-authored-by: Tom Alexander <tom.alexander@clickhouse.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-28 14:52:54 +01:00
Mike Shi	52ca1823a4	feat: Add ClickHouse JSON Type Support (#969 ) - Upgrades ClickHouse to 25.6, fixes breaking config change, needed for latest JSON type - Upgrades OTel Collector to 0.129.1, fixes breaking config change, needed for latest JSON support in exporter - Upgrades OTel OpAMP Supervisor to 0.128.0 - Fixes features to support JSON type columns in OTel in HyperDX (filtering, searching, graphing, opening rows, etc.) Requires users to set `BETA_CH_OTEL_JSON_SCHEMA_ENABLED=true` in `ch-server` and `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'` in `otel-collector` to enable JSON schema. Users must start a new ClickHouse DB or migrate their own table manually to enable as it is not schema compatible and migration is not automatic. Closes HDX-1849, HDX-1969, HDX-1849, HDX-1966, HDX-1964 Co-authored-by: Tom Alexander <3245235+teeohhem@users.noreply.github.com>	2025-07-03 17:11:03 +00:00
Warren	31e22dcff4	feat: introduce clickhouse db init script (#843 ) Ref: HDX-1777 This shouldn't have any impact on users	2025-06-09 16:45:23 +00:00
Warren	5b2cba019e	feat: scrape local otelcol + clickhouse metrics (#633 ) <img width="1329" alt="Screenshot 2025-02-25 at 5 26 06 PM" src="https://github.com/user-attachments/assets/ae54c3de-3e4c-4452-84ef-dda05d23c39e" /> <img width="1321" alt="Screenshot 2025-02-25 at 5 28 06 PM" src="https://github.com/user-attachments/assets/b3eab865-d6da-44da-a2fe-79a3797790f9" />	2025-02-26 03:13:05 +00:00
Warren	7a766f7977	style: remove aggregator related codes (#521 )	2024-12-09 09:59:36 -08:00
Warren	aa165fcc46	feat: move more codes	2024-11-21 21:44:33 -08:00
Warren	b16456fc39	feat: move v2 codes	2024-11-12 05:53:15 -07:00
Warren	593c4ca758	refactor: set output datetime format on the client side (#42 )	2023-09-24 21:19:45 -07:00
Warren	0826d4dd89	first commit	2023-09-12 20:08:05 -07:00

13 commits