hyperdx

mirror of https://github.com/hyperdxio/hyperdx synced 2026-04-21 13:37:15 +00:00

History

Vineet Ahirkar 941d045077 feat: support sample-weighted aggregations for sampled trace data (#1963 ) ## Problem High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N. Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable. ### Why Materialized Views Cannot Solve This Alone A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it cannot replace a sampling-aware query engine. Fixed dimensions. A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression. Fixed aggregation fields. A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view. Industry precedent. Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps. A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting. ## Summary TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged. When set, the query builder rewrites SQL aggregations to weight each span by its sample rate: aggFn \| Before \| After (sample-corrected) \| Overhead -------------- \| ---------------------- \| --------------------------------------------------- \| -------- count \| count() \| sum(weight) \| ~1x count + cond \| countIf(cond) \| sumIf(weight, cond) \| ~1x avg \| avg(col) \| sum(col * weight) / sum(weight) \| ~2x sum \| sum(col) \| sum(col * weight) \| ~1x quantile(p) \| quantile(p)(col) \| quantileTDigestWeighted(p)(col, toUInt32(weight)) \| ~1.5x min/max \| unchanged \| unchanged \| 1x count_distinct \| unchanged \| unchanged (cannot correct) \| 1x Types: - Add sampleRateExpression to TraceSourceSchema + Mongoose model - Add sampleWeightExpression to ChartConfig schema Query builder: - sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so spans without a SampleRate attribute default to weight 1 (unsampled data produces identical results to the original queries). - Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression is set, with safe default-to-1 wrapping Integration (propagate sampleWeightExpression to all chart configs): - ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions - DBDashboardPage (raw SQL + builder branches) - AlertPreviewChart - SessionSubpanel - ServiceDashboardEndpointPerformanceChart - ServiceDashboardSlowestEventsTile (p95 query + events table) - ServiceDashboardEndpointSidePanel (error rate + throughput) - ServiceDashboardDbQuerySidePanel (total query time + throughput) - External API v2 charts, AI controller, alerts (index + template) UI: - Add Sample Rate Expression field to trace source admin form ### Screenshots or video \| Before \| After \| \| :----- \| :---- \| \| \| \| ### How to test locally or on Vercel 1. 2. 3. ### References - Linear Issue: - Related PRs:	2026-03-30 19:52:18 +00:00
..
ch	feat: support sample-weighted aggregations for sampled trace data (#1963 )	2026-03-30 19:52:18 +00:00
mongo	chore: Update to next 16, react 19, add react compiler (#1434 )	2025-12-04 23:40:59 +00:00

feat: support sample-weighted aggregations for sampled trace data (#1963 )

## Problem

High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a `SampleRate` attribute recording N.

Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable.

### Why Materialized Views Cannot Solve This Alone

A materialized view that pre-aggregates sampled spans is a useful performance optimization for known dashboard queries, but it  cannot replace a sampling-aware query engine.

**Fixed dimensions.** A materialized view pre-aggregates by a fixed set of GROUP BY keys (e.g. `ServiceName`, `SpanName`, `StatusCode`, `TimestampBucket`). Trace exploration requires slicing by arbitrary span attributes -- `http.target`, `k8s.pod.name`, custom business tags -- in combinations that cannot be predicted at view creation time. Grouping by a different dimension either requires going back to raw table or a separate materialized views for every possible dimension combination. If you try to work around the fixed-dimensions problem by adding high-cardinality span attributes to the GROUP BY, the materialized table approaches a 1:1 row ratio with the raw table. You end up doubling storage without meaningful compression.

**Fixed aggregation fields.** A typical MV only aggregates a single numeric column like `Duration`. Users want weighted aggregations over any numeric attribute: request body sizes, queue depths, retry counts, custom metrics attached to spans. Each new field requires adding more `AggregateFunction` columns and recreating the view.

**Industry precedent.** Platforms that rely solely on pre-aggregation (Datadog, Splunk, New Relic, Elastic) get accurate RED dashboards but cannot correct ad-hoc queries over sampled span data. Only query-engine weighting (Honeycomb) produces correct results for arbitrary ad-hoc queries, including weighted percentiles and heatmaps.

A better solution is making the query engine itself sampling-aware, so that all queries from dashboards, alerts, an ad-hoc searches, automatically weights by `SampleRate` regardless of which dimensions or fields the user picks. Materialized views remain a useful complement for accelerating known, fixed-dimension dashboard panels, but they are not a substitute for correct query-time weighting.

## Summary

TraceSourceSchema gets a new optional field `sampleRateExpression` - the ClickHouse expression that evaluates to the per-span sample rate (e.g. `SpanAttributes['SampleRate']`). When not configured, all queries are unchanged.
When set, the query builder rewrites SQL aggregations to weight each span by its sample rate:

  aggFn          | Before                 | After (sample-corrected)                            | Overhead
  -------------- | ---------------------- | --------------------------------------------------- | --------
  count          | count()                | sum(weight)                                         | ~1x
  count + cond   | countIf(cond)          | sumIf(weight, cond)                                 | ~1x
  avg            | avg(col)               | sum(col * weight) / sum(weight)                     | ~2x
  sum            | sum(col)               | sum(col * weight)                                   | ~1x
  quantile(p)    | quantile(p)(col)       | quantileTDigestWeighted(p)(col, toUInt32(weight))   | ~1.5x
  min/max        | unchanged              | unchanged                                           | 1x
  count_distinct | unchanged              | unchanged (cannot correct)                          | 1x

**Types**:
- Add sampleRateExpression to TraceSourceSchema + Mongoose model
- Add sampleWeightExpression to ChartConfig schema

**Query builder:**
- sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so
spans without a SampleRate attribute default to weight 1 (unsampled
data produces identical results to the original queries).
- Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression
  is set, with safe default-to-1 wrapping

**Integration** (propagate sampleWeightExpression to all chart configs):
- ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions
- DBDashboardPage (raw SQL + builder branches)
- AlertPreviewChart
- SessionSubpanel
- ServiceDashboardEndpointPerformanceChart
- ServiceDashboardSlowestEventsTile (p95 query + events table)
- ServiceDashboardEndpointSidePanel (error rate + throughput)
- ServiceDashboardDbQuerySidePanel (total query time + throughput)
- External API v2 charts, AI controller, alerts (index + template)

**UI**:
- Add Sample Rate Expression field to trace source admin form

### Screenshots or video



| Before | After |
| :----- | :---- |
|        |       |

### How to test locally or on Vercel



1.
2.
3.

### References



- Linear Issue:
- Related PRs:

2026-03-30 19:52:18 +00:00

feat: support sample-weighted aggregations for sampled trace data (#1963 )

2026-03-30 19:52:18 +00:00

mongo

chore: Update to next 16, react 19, add react compiler (#1434 )

2025-12-04 23:40:59 +00:00