Commit graph

13 commits

Author SHA1 Message Date
Vineet Ahirkar
941d045077
feat: support sample-weighted aggregations for sampled trace data (#1963)
## Problem

High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N.

Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable.

### Why Materialized Views Cannot Solve This Alone

A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it  cannot replace a sampling-aware query engine.

**Fixed dimensions.** A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression.

**Fixed aggregation fields.** A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view.

**Industry precedent.** Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps.

A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting.

## Summary

TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged.
When set, the query builder rewrites SQL aggregations to weight each span by its sample rate:

  aggFn          | Before                 | After (sample-corrected)                            | Overhead
  -------------- | ---------------------- | --------------------------------------------------- | --------
  count          | count()                | sum(weight)                                         | ~1x
  count + cond   | countIf(cond)          | sumIf(weight, cond)                                 | ~1x
  avg            | avg(col)               | sum(col * weight) / sum(weight)                     | ~2x
  sum            | sum(col)               | sum(col * weight)                                   | ~1x
  quantile(p)    | quantile(p)(col)       | quantileTDigestWeighted(p)(col, toUInt32(weight))   | ~1.5x
  min/max        | unchanged              | unchanged                                           | 1x
  count_distinct | unchanged              | unchanged (cannot correct)                          | 1x

**Types**:
- Add sampleRateExpression to TraceSourceSchema + Mongoose model
- Add sampleWeightExpression to ChartConfig schema

**Query builder:**
- sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so
spans without a SampleRate attribute default to weight 1 (unsampled
data produces identical results to the original queries).
- Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression
  is set, with safe default-to-1 wrapping

**Integration** (propagate sampleWeightExpression to all chart configs):
- ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions
- DBDashboardPage (raw SQL + builder branches)
- AlertPreviewChart
- SessionSubpanel
- ServiceDashboardEndpointPerformanceChart
- ServiceDashboardSlowestEventsTile (p95 query + events table)
- ServiceDashboardEndpointSidePanel (error rate + throughput)
- ServiceDashboardDbQuerySidePanel (total query time + throughput)
- External API v2 charts, AI controller, alerts (index + template)

**UI**:
- Add Sample Rate Expression field to trace source admin form

### Screenshots or video



| Before | After |
| :----- | :---- |
|        |       |

### How to test locally or on Vercel



1.
2.
3.

### References



- Linear Issue:
- Related PRs:
2026-03-30 19:52:18 +00:00
Tom Alexander
75ff28dd68
chore: Use local clickhouse instance for playwright tests (#1711)
TLDR: This PR changes playwright full-stack tests to run against a local clickhouse instance (with seeded data) instead of relying on the clickhouse demo server, which can be unpredictable at times. This workflow allows us to fully control the data to make tests more predictable.

This PR: 
* Adds local CH instance to the e2e dockerfile
* Adds a schema creation script
* Adds a data seeding script
* Updates playwright config 
* Updates various tests to change hardcoded fields, metrics, or areas relying on play demo data
* Updates github workflow to use the dockerfile instead of separate services
* Runs against a local clickhouse instead of the demo server

Fixes: HDX-3193
2026-02-13 15:43:12 +00:00
Warren Lee
6f4c8efba0
feat: Enforce ClickStack schemas by default (#1682)
- Introduce a new flag `HYPERDX_OTEL_EXPORTER_CREATE_LEGACY_SCHEMA` (default to false) to otel collector
- Custom ClickStack schemas should be enforced by default
- ClickHouse tables migration logs should be stored in `clickstack_db_version_xxx` tables
- The collector will run the migration at startup and retry if it fails to connect to the database (using exponential backoff).
- Fully backward compatible

Ref: HDX-3301
2026-02-02 16:39:20 +00:00
Dan Hable
d07e30d5fb
feat: associate logged in user to clickhouse query (#1636)
Allows setting a custom setting prefix on a connection. When set in HyperDX and the ClickHouse settings, the HyperDX app will set a custom setting for each query. These are recorded in the query log and can be used to identify which user issues the query.

## Testing

The commit also updates the local dev ClickHouse instance to support a custom setting prefix of `hyeprdx`. After running `make dev-up`, you should be able to edit the connection and set the the prefix to `hyperdx`. 

<img width="955" height="197" alt="Screenshot 2026-01-21 at 1 23 14 PM" src="https://github.com/user-attachments/assets/607fc945-d93f-4976-9862-3118b420c077" />

After saving, just allow the app to live tail a source like logs. If you connect to the ClickHouse database, you should then be able to run

```
SELECT query, Settings
FROM system.query_log
WHERE has(mapKeys(Settings), 'hyperdx_user')
FORMAT Vertical
```

and then see a bunch of queries with the user set to your logged in user.

```
Row 46:
───────
query:    SELECT Timestamp, ServiceName, SeverityText, Body, TimestampTime FROM default.otel_logs WHERE (TimestampTime >= fromUnixTimestamp64Milli(_CAST(1769022372269, 'Int64'))) AND (TimestampTime <= fromUnixTimestamp64Milli(_CAST(1769023272269, 'Int64'))) ORDER BY (TimestampTime, Timestamp) DESC LIMIT _CAST(0, 'Int32'), _CAST(200, 'Int32') FORMAT JSONCompactEachRowWithNamesAndTypes
Settings: {'use_uncompressed_cache':'0','load_balancing':'in_order','log_queries':'1','max_memory_usage':'10000000000','cancel_http_readonly_queries_on_client_close':'1','parallel_replicas_for_cluster_engines':'0','date_time_output_format':'iso','hyperdx_user':'\'dan@hyperdx.io\''}
```
2026-01-28 14:58:05 +00:00
Dale McDiarmid
66f56cb1d0
chore: Move schema configs to file (#1635)
Co-authored-by: Tom Alexander <tom.alexander@clickhouse.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-01-28 14:52:54 +01:00
Mike Shi
52ca1823a4
feat: Add ClickHouse JSON Type Support (#969)
- Upgrades ClickHouse to 25.6, fixes breaking config change, needed for latest JSON type
- Upgrades OTel Collector to 0.129.1, fixes breaking config change, needed for latest JSON support in exporter
- Upgrades OTel OpAMP Supervisor to 0.128.0
- Fixes features to support JSON type columns in OTel in HyperDX (filtering, searching, graphing, opening rows, etc.)

Requires users to set `BETA_CH_OTEL_JSON_SCHEMA_ENABLED=true` in `ch-server` and `OTEL_AGENT_FEATURE_GATE_ARG='--feature-gates=clickhouse.json'` in `otel-collector` to enable JSON schema. Users must start a new ClickHouse DB or migrate their own table manually to enable as it is not schema compatible and migration is not automatic.

Closes HDX-1849, HDX-1969, HDX-1849, HDX-1966, HDX-1964

Co-authored-by: Tom Alexander <3245235+teeohhem@users.noreply.github.com>
2025-07-03 17:11:03 +00:00
Warren
31e22dcff4
feat: introduce clickhouse db init script (#843)
Ref: HDX-1777

This shouldn't have any impact on users
2025-06-09 16:45:23 +00:00
Warren
5b2cba019e
feat: scrape local otelcol + clickhouse metrics (#633)
<img width="1329" alt="Screenshot 2025-02-25 at 5 26 06 PM" src="https://github.com/user-attachments/assets/ae54c3de-3e4c-4452-84ef-dda05d23c39e" />


<img width="1321" alt="Screenshot 2025-02-25 at 5 28 06 PM" src="https://github.com/user-attachments/assets/b3eab865-d6da-44da-a2fe-79a3797790f9" />
2025-02-26 03:13:05 +00:00
Warren
7a766f7977
style: remove aggregator related codes (#521) 2024-12-09 09:59:36 -08:00
Warren
aa165fcc46 feat: move more codes 2024-11-21 21:44:33 -08:00
Warren
b16456fc39 feat: move v2 codes 2024-11-12 05:53:15 -07:00
Warren
593c4ca758
refactor: set output datetime format on the client side (#42) 2023-09-24 21:19:45 -07:00
Warren
0826d4dd89 first commit 2023-09-12 20:08:05 -07:00