* Update Perf * Add multi asset scale count * Update perf and Usage * Fix recommendation * Add Benchmarking script and doc * Fix Perf * Add --no break to benchmark * add more metrics and validation for indexes miss * Update generated TypeScript types * Bound Doc Virtual Threads * Remove Additional Properties from the UI * Update doc * Fix Job Getting Marked Stopped * Server killed logs fixes * Add Server stat to Quartz Progress * Fix CPU spiking * Make Auto Tune Consider JVm configs * Fix Partition Calculator and Recovery Job Stats * Update Auto Tune to show up in logs and stored in config * Fix Auto Tune Config not store in app run record * Fix OnDemand Job type * Indexing Failures not flushed fixed * Fix Stat counting at job level with process job failures * Add Reindex Job Identifier * Add Thread Identifiers * Wait for sink * Wait for sink * Fix Stopping to let partitions finish the job * CPU Budgeting * More Conservative settings * Address Review Comment * fix Open Search Index Manager * Reapply OpenSearch BulkSink --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
14 KiB
OpenMetadata Cluster Sizing Runbook
This runbook walks through benchmarking an OpenMetadata cluster from 10K to 5M entities, identifying where performance breaks, and applying the right configuration for your target scale.
Prerequisites
- OpenMetadata server running and accessible via HTTP
- Admin port exposed (optional, recommended for diagnostics — typically 8586)
- Python 3 installed on the benchmark host (for JSON parsing)
curlandbashavailable- Sufficient disk space for test data (each run generates JSON reports ~1-5MB)
- Network connectivity from benchmark host to server with low latency
Quick Start
Run a full progressive benchmark with a single command:
cd bin/distributed-test
./scripts/benchmark-sizing.sh \
--server http://your-server:8585 \
--workers 30 \
--ramp \
--mixed \
--output-dir ./sizing-results
This will:
- Run a ramp test to find optimal concurrency
- Loop through scales: 10k → 50k → 100k → 200k → 500k → ~2M → ~5M
- At each scale, run both sequential and realistic modes
- Stop automatically when a break-point is detected
- Generate
SIZING-SUMMARY.mdwith results and recommendations
Step-by-Step Guide
Step 1: Environment Setup
Option A: Point to existing server
# Verify connectivity
curl -s http://your-server:8585/api/v1/system/version | python3 -m json.tool
Option B: Start local distributed cluster
cd bin/distributed-test
./scripts/start.sh
# Servers will be available at:
# Server 1: http://localhost:8585 (admin: 8586)
# Server 2: http://localhost:8587 (admin: 8588)
# Server 3: http://localhost:8589 (admin: 8590)
Note current configuration
Before benchmarking, record your current settings so you can compare before/after:
# Check server version
curl -s http://localhost:8585/api/v1/system/version
# If admin port is exposed, check diagnostics
curl -s http://localhost:8586/metrics | grep -E "jvm_memory|jetty_threads|hikari"
Key settings to note:
- JVM heap size (
-Xmx) SERVER_MAX_THREADSDB_CONNECTION_POOL_MAX_SIZESERVER_ENABLE_VIRTUAL_THREADBULK_OPERATION_QUEUE_SIZE
Step 2: Find Optimal Concurrency (Optional)
The ramp test progressively increases worker count to find the sweet spot between throughput and latency:
./scripts/perf-test.sh --ramp --scale 10k --workers 64 --admin-port 8586
What to look for in ramp results:
| Workers | RPS | p95ms | Interpretation |
|---|---|---|---|
| 1-4 | Low | Low | Underutilizing server capacity |
| 8-16 | Peak | Low | Sweet spot — use this worker count |
| 32+ | Flat | Rising | Server saturated — more workers hurt |
The ramp test output will suggest an optimal_workers value. Use it for subsequent runs.
Step 3: Run Progressive Benchmark
Full run (recommended)
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--admin-port 8586 \
--workers 30 \
--ramp \
--mixed \
--mixed-duration 60 \
--output-dir ./sizing-results
Quick 2-tier smoke test
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--start-scale 10k \
--end-scale 50k \
--output-dir /tmp/sizing-test
Sequential mode only (simpler, less contention)
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--modes seq \
--end-scale 500k
Resume after failure
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--skip-existing \
--output-dir ./sizing-results
What to expect at each scale tier
| Scale | Entities | Typical Duration | What It Tests |
|---|---|---|---|
| 10k | ~10,000 | 2-5 min | Baseline — should pass easily |
| 50k | ~50,000 | 5-15 min | Small production workload |
| 100k | ~100,000 | 10-30 min | Medium production workload |
| 200k | ~200,000 | 20-60 min | Large production workload |
| 500k | ~500,000 | 1-3 hours | Enterprise scale — DB pool pressure |
| large | ~2,000,000 | 3-8 hours | Large enterprise — thread/memory limits |
| xlarge | ~5,000,000 | 8-24 hours | Extreme scale — full cluster stress |
Monitoring during benchmark
While the benchmark runs, monitor the server:
# Watch server logs for errors
docker logs -f openmetadata-server-1 2>&1 | grep -E "ERROR|WARN|OOM"
# Watch DB connection pool (if admin port exposed)
watch -n 5 'curl -s http://localhost:8586/metrics | grep hikari'
# Watch JVM heap
watch -n 5 'curl -s http://localhost:8586/metrics | grep "jvm_memory_bytes_used.*heap"'
Step 4: Analyze Results
After the benchmark completes, open the generated summary:
cat ./sizing-results/SIZING-SUMMARY.md
Reading the Progressive Results table
| Scale | Mode | Entities | RPS | p95 (ms) | Errors % | Assessment |
|-------|------------|----------|-------|----------|----------|------------|
| 10k | Sequential | 10234 | 456.2 | 312 | 0.05 | adequate |
| 10k | Realistic | 10234 | 389.1 | 445 | 0.12 | adequate |
| 50k | Sequential | 50120 | 412.3 | 456 | 0.23 | adequate |
| 50k | Realistic | 50120 | 312.4 | 890 | 1.45 | marginal |
| 100k | Realistic | 98450 | 189.2 | 2345 | 12.3 | undersized |
Key columns:
- RPS: Higher is better. Watch for drops between tiers.
- p95: 95th percentile latency. Under 500ms is good, over 2000ms is concerning.
- Errors %: Under 1% is acceptable. Over 5% signals resource exhaustion.
- Assessment:
adequate(good),marginal(tuning needed),undersized(upgrade needed).
Interpreting sequential vs realistic differences
- Sequential runs entity types one at a time — measures raw throughput per entity type
- Realistic runs all types concurrently — exposes cross-entity contention (DB locks, thread pool pressure)
- A large gap (>30% RPS drop or >2x p95 increase) in realistic mode indicates contention issues
- If sequential is fine but realistic breaks, focus on: DB pool size, thread count, bulk executor tuning
Understanding the break-point
The benchmark stops when it detects:
- Assessment =
undersized - Error rate > 10%
- p95 latency > 10 seconds
- Throughput per entity degraded >50% from previous tier
The tier just before the break-point is your current cluster's capacity ceiling.
Reading cluster_sizing recommendations
Each JSON report contains a cluster_sizing section with specific recommendations:
python3 -c "
import json
with open('./sizing-results/sizing-50k-realistic.json') as f:
data = json.load(f)
for k, v in data['cluster_sizing']['recommendations'].items():
if isinstance(v, dict) and 'recommended_env' in v:
print(f\"{k}: {v['recommended_env']} (was: {v.get('current_env', 'unknown')})\")
"
Step 5: Apply Recommendations
Configuration Matrix
| Parameter | Small (<50K) | Medium (50-200K) | Large (200K-2M) | XLarge (2-5M) |
|---|---|---|---|---|
SERVER_MAX_THREADS |
150 | 300 | 500 | 750 |
SERVER_ENABLE_VIRTUAL_THREAD |
false | true | true | true |
DB_CONNECTION_POOL_MAX_SIZE |
50 | 100 | 200 | 300 |
DB_CONNECTION_TIMEOUT |
30000 | 30000 | 60000 | 60000 |
ELASTICSEARCH_MAX_CONN_TOTAL |
50 | 100 | 200 | 300 |
BULK_OPERATION_QUEUE_SIZE |
1000 | 2000 | 5000 | 10000 |
BULK_OPERATION_MAX_THREADS |
10 | 20 | 30 | 50 |
SERVER_ACCEPT_QUEUE_SIZE |
50 | 100 | 200 | 500 |
JVM Heap (-Xmx) |
2G | 4G | 8G | 16G |
Applying changes
Docker Compose:
Add environment variables to your server service:
environment:
SERVER_MAX_THREADS: "300"
SERVER_ENABLE_VIRTUAL_THREAD: "true"
DB_CONNECTION_POOL_MAX_SIZE: "100"
ELASTICSEARCH_MAX_CONN_TOTAL: "100"
BULK_OPERATION_QUEUE_SIZE: "2000"
BULK_OPERATION_MAX_THREADS: "20"
JAVA_OPTS: "-Xmx4g -Xms4g"
Kubernetes / Helm:
# values.yaml
openmetadata:
config:
serverMaxThreads: 300
enableVirtualThread: true
database:
connectionPoolMaxSize: 100
connectionTimeout: 30000
elasticsearch:
maxConnectionsTotal: 100
bulkOperation:
queueSize: 2000
maxThreads: 20
jvmOpts: "-Xmx4g -Xms4g"
Bare metal / openmetadata.yaml:
server:
applicationConnectors:
- type: http
port: 8585
maxThreads: 300
enableVirtualThread: true
database:
hikariConfig:
maximumPoolSize: 100
connectionTimeout: 30000
elasticsearch:
maxConnectionsTotal: 100
Verifying changes took effect
After restarting the server with new settings:
# Check via diagnostics endpoint
curl -s http://localhost:8585/api/v1/system/diagnostics | python3 -c "
import json, sys
d = json.load(sys.stdin)
j = d.get('jetty', {})
db = d.get('database', {})
print(f\"Jetty max threads: {j.get('threads_max', 'N/A')}\")
print(f\"DB pool max: {db.get('pool_max', 'N/A')}\")
print(f\"Virtual threads: {j.get('virtual_threads_enabled', 'N/A')}\")
"
# Check via Prometheus metrics (admin port)
curl -s http://localhost:8586/metrics | grep -E "jetty_threads_max|hikari.*max"
Step 6: Re-validate
After applying configuration changes, re-run the benchmark at your target scale:
# Re-run just the tier that previously broke
./scripts/perf-test.sh \
--scale 100k \
--realistic \
--mixed \
--workers 30 \
--admin-port 8586 \
--output ./sizing-results/sizing-100k-realistic-tuned.json
# Or re-run the full progressive suite
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--admin-port 8586 \
--output-dir ./sizing-results-tuned
Compare before/after:
# Side-by-side comparison
python3 -c "
import json
with open('./sizing-results/sizing-100k-realistic.json') as f:
before = json.load(f)
with open('./sizing-results/sizing-100k-realistic-tuned.json') as f:
after = json.load(f)
b = before['overall']
a = after['overall']
print(f\"{'Metric':<20} {'Before':>12} {'After':>12} {'Change':>12}\")
print('-' * 56)
for key in ['overall_throughput_rps', 'overall_error_rate_pct']:
bv = b.get(key, 0)
av = a.get(key, 0)
diff = ((av - bv) / bv * 100) if bv > 0 else 0
print(f\"{key:<20} {bv:>12.1f} {av:>12.1f} {diff:>+11.1f}%\")
print(f\"\nBefore: {before['cluster_sizing']['assessment']}\")
print(f\"After: {after['cluster_sizing']['assessment']}\")
"
Configuration Reference
| Parameter | Default | Description | When to Increase |
|---|---|---|---|
SERVER_MAX_THREADS |
150 | Jetty HTTP thread pool | Jetty utilization >80% |
SERVER_ENABLE_VIRTUAL_THREAD |
false | Use JDK21 virtual threads | Always enable at >50K entities |
DB_CONNECTION_POOL_MAX_SIZE |
50 | HikariCP max connections | DB pool utilization >70% or pending >0 |
DB_CONNECTION_TIMEOUT |
30000 | Max wait for DB connection (ms) | Connection acquire timeouts |
ELASTICSEARCH_MAX_CONN_TOTAL |
50 | Max HTTP connections to ES/OS | Search latency spikes |
BULK_OPERATION_QUEUE_SIZE |
1000 | Async bulk operation queue depth | 503 "queue full" errors |
BULK_OPERATION_MAX_THREADS |
10 | Worker threads for bulk ops | Queue filling up, low throughput |
SERVER_ACCEPT_QUEUE_SIZE |
50 | TCP accept backlog | Connection refused under load |
Interpreting Results
Assessment Meanings
| Assessment | Meaning | Action |
|---|---|---|
adequate |
Server handles this scale with acceptable latency and error rates | No changes needed |
marginal |
Performance is degrading — latency rising, some errors appearing | Tune configuration (Step 5) |
undersized |
Server cannot handle this scale — high errors, extreme latency | Upgrade resources or reduce scale target |
Common Patterns and Fixes
| Pattern | Symptom | Fix |
|---|---|---|
| DB pool exhaustion | p95 spikes, bimodal latency, pending connections >0 | Increase DB_CONNECTION_POOL_MAX_SIZE |
| Thread pool saturation | Jetty utilization >90%, queue depth growing | Increase SERVER_MAX_THREADS, enable virtual threads |
| Bulk executor overflow | 503 errors with "queue full" | Increase BULK_OPERATION_QUEUE_SIZE and BULK_OPERATION_MAX_THREADS |
| GC pressure | p99/p95 ratio >3x, periodic latency spikes | Increase heap, tune GC settings |
| Search bottleneck | High search latency in breakdown, low ES connection usage | Increase ELASTICSEARCH_MAX_CONN_TOTAL |
| Sequential OK, realistic broken | Performance fine per-type but breaks under mixed load | All of the above — concurrent load compounds contention |
Sequential vs Realistic Differences
- <10% RPS drop: Excellent — server handles contention well
- 10-30% RPS drop: Normal — some contention expected
- 30-50% RPS drop: Concerning — review DB pool and thread settings
- >50% RPS drop: Critical contention — likely needs resource upgrades
Troubleshooting
Server OOM during benchmark
Symptoms: Server process killed, connection refused mid-run.
# Check if server was OOM-killed
docker logs openmetadata-server-1 2>&1 | tail -50
dmesg | grep -i "oom\|killed"
# Fix: Increase heap
# In docker-compose.yaml:
# JAVA_OPTS: "-Xmx8g -Xms8g"
Connection refused errors
Symptoms: Benchmark shows many connection_error entries.
# Check server is running
curl -s http://localhost:8585/api/v1/system/version
# Check accept queue
curl -s http://localhost:8586/metrics | grep "jetty_queued"
# Fix: Increase SERVER_ACCEPT_QUEUE_SIZE, check firewall rules
Benchmark script hangs
Symptoms: No progress for >5 minutes.
# Check server health
curl -s http://localhost:8585/api/v1/system/version
# Check server thread dumps for deadlocks
curl -s http://localhost:8586/threads
# The script has a 60s per-request timeout; if the server is extremely slow
# but responsive, individual requests may be timing out and retrying
How to resume after failure
The --skip-existing flag lets you resume:
# First, check what's already completed
ls -la ./sizing-results/sizing-*.json
# Resume from where it left off
./scripts/benchmark-sizing.sh \
--skip-existing \
--output-dir ./sizing-results
Individual tiers can also be re-run manually:
./scripts/perf-test.sh \
--scale 100k \
--realistic \
--workers 30 \
--output ./sizing-results/sizing-100k-realistic.json