mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

Reindex Work - Perf , Metrics , Benchmarking and More (#26231 )

* Update Perf

* Add multi asset scale count

* Update perf and Usage

* Fix recommendation

* Add Benchmarking script and doc

* Fix Perf

* Add --no break to benchmark

* add more metrics and validation for indexes miss

* Update generated TypeScript types

* Bound Doc Virtual Threads

* Remove Additional Properties from the UI

* Update doc

* Fix Job Getting Marked Stopped

* Server killed logs fixes

* Add Server stat to Quartz Progress

* Fix CPU spiking

* Make Auto Tune Consider JVm configs

* Fix Partition Calculator and Recovery Job Stats

* Update Auto Tune to show up in logs and stored in config

* Fix Auto Tune Config not store in app run record

* Fix OnDemand Job type

* Indexing Failures not flushed fixed

* Fix Stat counting at job level with process job failures

* Add Reindex Job Identifier

* Add Thread Identifiers

* Wait for sink

* Wait for sink

* Fix Stopping to let partitions finish the job

* CPU Budgeting

* More Conservative settings

* Address Review Comment

* fix Open Search Index Manager

* Reapply OpenSearch BulkSink

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

2026-03-10 08:10:46 +05:30

12 KiB

Raw Blame History

Server-Side Diagnostics & Load Test Correlation

The diagnostics endpoint (GET /api/v1/system/diagnostics) provides a single-call performance snapshot of the OpenMetadata server. Combined with the load test script, it enables pinpointing where time is spent during high-load scenarios and produces actionable tuning recommendations.

The Diagnostics Endpoint

Basic Usage

curl -H "Authorization: Bearer $TOKEN" \
  http://localhost:8585/api/v1/system/diagnostics | python3 -m json.tool

Response Structure

{
  "timestamp": "2026-03-02T19:00:00Z",
  "jvm": { ... },
  "jetty": { ... },
  "database": { ... },
  "database_queries": { ... },
  "bulk_executor": { ... },
  "request_latency": { ... }
}

Each section is explained below.

Understanding Each Section

JVM

Field	What It Tells You
`heap_used_bytes` / `heap_max_bytes`	Current heap consumption vs maximum.
`heap_usage_pct`	If >85% after load, GC pressure is likely adding tail latency.
`gc_pause_total_ms`	Cumulative GC pause time since JVM start. Compare before/after load to see how much GC occurred during the test.
`gc_count`	Total GC collections. A large delta during load means frequent stop-the-world pauses.
`thread_count` / `thread_peak`	Active JVM threads. Correlate with Jetty thread pool.
`cpu_process_pct`	Process CPU utilization (0-100). If pinned at 100%, the server is CPU-bound.
`uptime_seconds`	Useful to confirm the server wasn't restarted mid-test.

Jetty (Thread Pool)

Field	What It Tells You
`threads_busy` / `threads_max`	How many request-handling threads are in use.
`utilization_pct`	Key metric. If >90% with `queue_size > 0`, the thread pool is saturated and requests are queuing.
`queue_size`	Requests waiting for a free thread. Non-zero means latency is being added by queuing.
`queue_time_avg_ms`	Average time a request waits in the queue before getting a thread.
`virtual_threads_enabled`	Whether Java 21 virtual threads are active (eliminates thread pool as a bottleneck).

Database (HikariCP Pool)

Field	What It Tells You
`pool_active` / `pool_max`	Active DB connections vs maximum pool size.
`pool_usage_pct`	Key metric. If >80%, connection contention is likely. Requests wait for a free connection.
`pool_pending`	Threads waiting for a DB connection. If >0 during load, the pool is undersized.
`pool_idle`	Spare connections. If 0 during load, the pool is fully utilized.
`connection_acquire_avg_ms`	Average time to acquire a connection from the pool. High values (>50ms) indicate pool contention.
`connection_acquire_max_ms`	Maximum connection acquire time observed.
`connection_acquire_count`	Total number of connection acquire operations.

Database Queries (Per-Type Profiling)

Breaks down DB query timing by operation type (select, insert, update, delete):

Field	What It Tells You
`total_operations`	Sum of all DB operations across all types.
`{type}.count`	Number of queries of this type.
`{type}.mean_ms`	Average query duration.
`{type}.max_ms`	Maximum query duration. Spikes indicate lock contention or full table scans.
`{type}.p95_ms`	95th percentile query duration. If >100ms, investigate slow queries.
`{type}.total_ms`	Total cumulative time spent in queries of this type.

Reading the profile: If select.p95_ms is 200ms while insert.p95_ms is only 20ms, read queries are the bottleneck. This often indicates missing indexes or N+1 query patterns.

Bulk Executor

Field	What It Tells You
`queue_depth` / `queue_capacity`	Items in the async processing queue.
`queue_usage_pct`	If >70%, approaching the rejection threshold (HTTP 503 errors).
`active_threads` / `max_threads`	Worker threads actively processing bulk operations.
`has_capacity`	`false` means the next bulk submission will be rejected with 503.

Request Latency (Per-Endpoint Breakdown)

This is the most actionable section. For each METHOD /endpoint combination:

Field	What It Tells You
`count`	Total requests processed for this endpoint.
`avg_total_ms`	Average end-to-end latency.
`avg_db_ms` / `db_pct`	Time spent in database queries and its percentage of total.
`avg_search_ms` / `search_pct`	Time spent in search/Elasticsearch operations.
`avg_internal_ms` / `internal_pct`	Time in Java code (serialization, validation, business logic).
`avg_db_ops` / `avg_search_ops`	Average number of DB/search round-trips per request.

Reading the breakdown: If PUT /v1/tables shows db_pct: 56%, then 56% of the request time is spent waiting for database queries. Combined with database.pool_usage_pct: 85%, this tells you the DB connection pool is the bottleneck.

Load Test Integration

The load test script automatically queries the diagnostics endpoint at three points:

Before load — baseline snapshot
During load — sampled every 10 seconds by the health monitor
After load — final snapshot for comparison

Running a Load Test with Diagnostics

# Basic: diagnostics are collected automatically
./perf-test.sh --scale small --server http://localhost:8585 --admin-port 8586

# With explicit token
./perf-test.sh --scale medium --server http://localhost:8585 \
  --admin-port 8586 --token "$MY_TOKEN" --output /tmp/bench.json

The --admin-port flag enables both Prometheus scraping and diagnostics collection. Diagnostics work without it too (they use the main API port).

Console Output

After the benchmark table, you'll see a SERVER-SIDE BREAKDOWN section:

SERVER-SIDE BREAKDOWN (from /api/v1/system/diagnostics):
  JVM:  heap 1.2GB/2GB (60%), GC pauses +450ms during load
  Jetty: 142/150 threads busy (95%), queue depth: 23
  DB Pool: 85/100 active (85%), 12 pending connections
  Bulk Executor: queue 450/1000 (45%)

  Latency Breakdown (PUT endpoints):
    Endpoint                          Total    DB%   Search%   Internal%
    /v1/tables                        320ms   56.2%     14.1%      29.7%
    /v1/topics                        180ms   48.0%     22.0%      30.0%
    /v1/dashboards                    250ms   52.0%     18.0%      30.0%

  BOTTLENECK: DB bottleneck on PUT /v1/tables: 56.2% of request time in DB, pool at 85.0% utilization

JSON Report

The report includes top-level diagnostics_before and diagnostics_after objects, plus cluster_sizing.server_side_analysis:

cat /tmp/bench.json | python3 -c "
import json, sys
r = json.load(sys.stdin)

# Check if diagnostics were available
diag = r.get('diagnostics_after', {})
if diag:
    jvm = diag['jvm']
    print(f'Heap: {jvm[\"heap_usage_pct\"]}%')
    print(f'GC pauses: {jvm[\"gc_pause_total_ms\"]}ms')

    jetty = diag['jetty']
    print(f'Jetty: {jetty[\"threads_busy\"]}/{jetty[\"threads_max\"]} ({jetty[\"utilization_pct\"]}%)')

    db = diag['database']
    print(f'DB pool: {db[\"pool_active\"]}/{db[\"pool_max\"]} ({db[\"pool_usage_pct\"]}%)')

    for ep, data in diag.get('request_latency', {}).items():
        print(f'{ep}: total={data[\"avg_total_ms\"]}ms '
              f'DB={data[\"db_pct\"]}% Search={data[\"search_pct\"]}% '
              f'Internal={data[\"internal_pct\"]}%')
else:
    print('Diagnostics not available (server may be older version)')
"

Bottleneck Detection Rules

The load test applies these rules automatically and surfaces them in findings:

Condition	Diagnosis	Recommended Fix
`db_pct > 60%` AND `pool_usage_pct > 80%`	DB is the bottleneck	`export DB_CONNECTION_POOL_MAX_SIZE=150`
`jetty.utilization_pct > 90%` AND `queue_size > 0`	Thread pool saturated	`export SERVER_MAX_THREADS=300` or enable virtual threads
`search_pct > 30%` for any endpoint	Search indexing consuming latency	`export ELASTICSEARCH_MAX_CONN_TOTAL=50`
`bulk_executor.queue_usage_pct > 70%`	Near bulk rejection threshold	`export BULK_OPERATION_QUEUE_SIZE=2000`
`jvm.heap_usage_pct > 85%` after load	Memory pressure / GC tail latency	Increase JVM heap (`-Xmx`)
`database_queries.{type}.p95_ms > 100ms`	Slow DB queries of that type	Add indexes, optimize queries
`connection_acquire_avg_ms > 50ms`	Connection pool contention	`export DB_CONNECTION_POOL_MAX_SIZE=150`

Common Scenarios

Scenario 1: High Latency, DB is the Bottleneck

Symptoms: p95 latency >2s, db_pct >60%, pool_usage_pct >80%.

  Latency Breakdown:
    /v1/tables   320ms   DB=62%   Search=12%   Internal=26%
  DB Pool: 95/100 active (95%), 8 pending

What's happening: Every PUT requires multiple DB round-trips. At 95% pool utilization with 8 pending connections, requests are waiting for a free connection.

Fix:

export DB_CONNECTION_POOL_MAX_SIZE=150
export DB_CONNECTION_TIMEOUT=10000   # Fail fast instead of waiting 30s

Scenario 2: Thread Pool Exhaustion

Symptoms: Connection refused errors, utilization_pct >95%, queue_size growing.

  Jetty: 148/150 threads busy (99%), queue depth: 45

What's happening: All Jetty threads are busy. New requests queue up, adding latency. If the queue fills, connections get refused.

Fix:

export SERVER_MAX_THREADS=300
# OR enable virtual threads (preferred for I/O-bound workloads):
export SERVER_ENABLE_VIRTUAL_THREAD=true

Scenario 3: GC Pressure

Symptoms: Periodic latency spikes, heap_usage_pct >85%, large GC pause delta.

  JVM: heap 1.8GB/2GB (90%), GC pauses +2300ms during load

What's happening: The JVM is spending significant time in garbage collection. This manifests as periodic latency spikes and throughput drops.

Fix:

# Increase heap
export OPENMETADATA_HEAP_OPTS="-Xmx4g -Xms4g"

Scenario 4: Bulk Executor Queue Filling

Symptoms: HTTP 503 errors on PUT endpoints.

  Bulk Executor: queue 980/1000 (98%)
  has_capacity: false

What's happening: The async processing queue is full. New requests that need bulk processing are rejected with 503.

Fix:

export BULK_OPERATION_QUEUE_SIZE=2000
export BULK_OPERATION_MAX_THREADS=20

Scenario 5: Slow DB Queries

Symptoms: High p95 latency, database_queries.select.p95_ms >100ms, db_pct >60%.

  DB Query Profile (total operations: 125,000):
    Type         Count       Mean        Max
    select      80,000     12.3ms    450.0ms
    insert      40,000     25.1ms    890.0ms
  Connection acquire avg: 85.2ms

What's happening: Individual DB queries are slow (p95 >100ms) and connection acquire time is elevated, indicating both query performance issues and pool contention.

Fix:

# Increase pool size to reduce acquire contention
export DB_CONNECTION_POOL_MAX_SIZE=150
# Reduce connection timeout to fail fast
export DB_CONNECTION_TIMEOUT=10000
# Consider adding database indexes for slow SELECT queries

Comparing Before/After Snapshots

The most valuable analysis comes from comparing diagnostics before and after load:

Metric	Before	After	Interpretation
`heap_usage_pct`	25%	85%	Significant memory allocation during load
`gc_pause_total_ms`	200	2500	2.3s of GC pauses during the test
`pool_active`	2	95	Pool went from idle to near-max
`pool_pending`	0	8	Connection contention appeared
`queue_depth`	0	450	Bulk queue built up under load

If the diagnostics_during samples are available in the health monitor data, you can plot these metrics over time to see exactly when bottlenecks emerged.

Graceful Fallback

If the server doesn't have the diagnostics endpoint (older version), the load test:

Prints a notice: Diagnostics endpoint returned status=404 (may not be available)
Falls back to Prometheus scraping (if --admin-port is set)
Skips the SERVER-SIDE BREAKDOWN section in the console output
Omits diagnostics_before/diagnostics_after from the JSON report

No hard dependency — the load test works with or without it.

12 KiB Raw Blame History

Server-Side Diagnostics & Load Test Correlation

The Diagnostics Endpoint

Basic Usage

Response Structure

Understanding Each Section

JVM

Jetty (Thread Pool)

Database (HikariCP Pool)

Database Queries (Per-Type Profiling)

Bulk Executor

Request Latency (Per-Endpoint Breakdown)

Load Test Integration

Running a Load Test with Diagnostics

Console Output

JSON Report

Bottleneck Detection Rules

Common Scenarios

Scenario 1: High Latency, DB is the Bottleneck

Scenario 2: Thread Pool Exhaustion

Scenario 3: GC Pressure

Scenario 4: Bulk Executor Queue Filling

Scenario 5: Slow DB Queries

Comparing Before/After Snapshots

Graceful Fallback

12 KiB

Raw Blame History