OpenMetadata/bin/distributed-test/CLUSTER-SIZING-RUNBOOK.md
Mohit Yadav 7b6360a9ed
Reindex Work - Perf , Metrics , Benchmarking and More (#26231)
* Update Perf

* Add multi asset scale count

* Update perf and Usage

* Fix recommendation

* Add Benchmarking script and doc

* Fix Perf

* Add --no break to benchmark

* add more metrics and validation for indexes miss

* Update generated TypeScript types

* Bound Doc Virtual Threads

* Remove Additional Properties from the UI

* Update doc

* Fix Job Getting Marked Stopped

* Server killed logs fixes

* Add Server stat to Quartz Progress

* Fix CPU spiking

* Make Auto Tune Consider JVm configs

* Fix Partition Calculator and Recovery Job Stats

* Update Auto Tune to show up in logs and stored in config

* Fix Auto Tune Config not store in app run record

* Fix OnDemand Job type

* Indexing Failures not flushed fixed

* Fix Stat counting at job level with process job failures

* Add Reindex Job Identifier

* Add Thread Identifiers

* Wait for sink

* Wait for sink

* Fix Stopping to let partitions finish the job

* CPU Budgeting

* More Conservative settings

* Address Review Comment

* fix Open Search Index Manager

* Reapply OpenSearch BulkSink

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-10 08:10:46 +05:30

469 lines
14 KiB
Markdown

# OpenMetadata Cluster Sizing Runbook
This runbook walks through benchmarking an OpenMetadata cluster from 10K to 5M entities, identifying where performance breaks, and applying the right configuration for your target scale.
## Prerequisites
- OpenMetadata server running and accessible via HTTP
- Admin port exposed (optional, recommended for diagnostics — typically 8586)
- Python 3 installed on the benchmark host (for JSON parsing)
- `curl` and `bash` available
- Sufficient disk space for test data (each run generates JSON reports ~1-5MB)
- Network connectivity from benchmark host to server with low latency
## Quick Start
Run a full progressive benchmark with a single command:
```bash
cd bin/distributed-test
./scripts/benchmark-sizing.sh \
--server http://your-server:8585 \
--workers 30 \
--ramp \
--mixed \
--output-dir ./sizing-results
```
This will:
1. Run a ramp test to find optimal concurrency
2. Loop through scales: 10k → 50k → 100k → 200k → 500k → ~2M → ~5M
3. At each scale, run both sequential and realistic modes
4. Stop automatically when a break-point is detected
5. Generate `SIZING-SUMMARY.md` with results and recommendations
## Step-by-Step Guide
### Step 1: Environment Setup
#### Option A: Point to existing server
```bash
# Verify connectivity
curl -s http://your-server:8585/api/v1/system/version | python3 -m json.tool
```
#### Option B: Start local distributed cluster
```bash
cd bin/distributed-test
./scripts/start.sh
# Servers will be available at:
# Server 1: http://localhost:8585 (admin: 8586)
# Server 2: http://localhost:8587 (admin: 8588)
# Server 3: http://localhost:8589 (admin: 8590)
```
#### Note current configuration
Before benchmarking, record your current settings so you can compare before/after:
```bash
# Check server version
curl -s http://localhost:8585/api/v1/system/version
# If admin port is exposed, check diagnostics
curl -s http://localhost:8586/metrics | grep -E "jvm_memory|jetty_threads|hikari"
```
Key settings to note:
- JVM heap size (`-Xmx`)
- `SERVER_MAX_THREADS`
- `DB_CONNECTION_POOL_MAX_SIZE`
- `SERVER_ENABLE_VIRTUAL_THREAD`
- `BULK_OPERATION_QUEUE_SIZE`
### Step 2: Find Optimal Concurrency (Optional)
The ramp test progressively increases worker count to find the sweet spot between throughput and latency:
```bash
./scripts/perf-test.sh --ramp --scale 10k --workers 64 --admin-port 8586
```
**What to look for in ramp results:**
| Workers | RPS | p95ms | Interpretation |
|---------|------|-------|----------------|
| 1-4 | Low | Low | Underutilizing server capacity |
| 8-16 | Peak | Low | Sweet spot — use this worker count |
| 32+ | Flat | Rising | Server saturated — more workers hurt |
The ramp test output will suggest an `optimal_workers` value. Use it for subsequent runs.
### Step 3: Run Progressive Benchmark
#### Full run (recommended)
```bash
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--admin-port 8586 \
--workers 30 \
--ramp \
--mixed \
--mixed-duration 60 \
--output-dir ./sizing-results
```
#### Quick 2-tier smoke test
```bash
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--start-scale 10k \
--end-scale 50k \
--output-dir /tmp/sizing-test
```
#### Sequential mode only (simpler, less contention)
```bash
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--modes seq \
--end-scale 500k
```
#### Resume after failure
```bash
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--skip-existing \
--output-dir ./sizing-results
```
#### What to expect at each scale tier
| Scale | Entities | Typical Duration | What It Tests |
|-------|----------|-----------------|---------------|
| 10k | ~10,000 | 2-5 min | Baseline — should pass easily |
| 50k | ~50,000 | 5-15 min | Small production workload |
| 100k | ~100,000 | 10-30 min | Medium production workload |
| 200k | ~200,000 | 20-60 min | Large production workload |
| 500k | ~500,000 | 1-3 hours | Enterprise scale — DB pool pressure |
| large | ~2,000,000 | 3-8 hours | Large enterprise — thread/memory limits |
| xlarge | ~5,000,000 | 8-24 hours | Extreme scale — full cluster stress |
#### Monitoring during benchmark
While the benchmark runs, monitor the server:
```bash
# Watch server logs for errors
docker logs -f openmetadata-server-1 2>&1 | grep -E "ERROR|WARN|OOM"
# Watch DB connection pool (if admin port exposed)
watch -n 5 'curl -s http://localhost:8586/metrics | grep hikari'
# Watch JVM heap
watch -n 5 'curl -s http://localhost:8586/metrics | grep "jvm_memory_bytes_used.*heap"'
```
### Step 4: Analyze Results
After the benchmark completes, open the generated summary:
```bash
cat ./sizing-results/SIZING-SUMMARY.md
```
#### Reading the Progressive Results table
```
| Scale | Mode | Entities | RPS | p95 (ms) | Errors % | Assessment |
|-------|------------|----------|-------|----------|----------|------------|
| 10k | Sequential | 10234 | 456.2 | 312 | 0.05 | adequate |
| 10k | Realistic | 10234 | 389.1 | 445 | 0.12 | adequate |
| 50k | Sequential | 50120 | 412.3 | 456 | 0.23 | adequate |
| 50k | Realistic | 50120 | 312.4 | 890 | 1.45 | marginal |
| 100k | Realistic | 98450 | 189.2 | 2345 | 12.3 | undersized |
```
Key columns:
- **RPS**: Higher is better. Watch for drops between tiers.
- **p95**: 95th percentile latency. Under 500ms is good, over 2000ms is concerning.
- **Errors %**: Under 1% is acceptable. Over 5% signals resource exhaustion.
- **Assessment**: `adequate` (good), `marginal` (tuning needed), `undersized` (upgrade needed).
#### Interpreting sequential vs realistic differences
- **Sequential** runs entity types one at a time — measures raw throughput per entity type
- **Realistic** runs all types concurrently — exposes cross-entity contention (DB locks, thread pool pressure)
- A large gap (>30% RPS drop or >2x p95 increase) in realistic mode indicates contention issues
- If sequential is fine but realistic breaks, focus on: DB pool size, thread count, bulk executor tuning
#### Understanding the break-point
The benchmark stops when it detects:
- Assessment = `undersized`
- Error rate > 10%
- p95 latency > 10 seconds
- Throughput per entity degraded >50% from previous tier
The tier just before the break-point is your current cluster's capacity ceiling.
#### Reading cluster_sizing recommendations
Each JSON report contains a `cluster_sizing` section with specific recommendations:
```bash
python3 -c "
import json
with open('./sizing-results/sizing-50k-realistic.json') as f:
data = json.load(f)
for k, v in data['cluster_sizing']['recommendations'].items():
if isinstance(v, dict) and 'recommended_env' in v:
print(f\"{k}: {v['recommended_env']} (was: {v.get('current_env', 'unknown')})\")
"
```
### Step 5: Apply Recommendations
#### Configuration Matrix
| Parameter | Small (<50K) | Medium (50-200K) | Large (200K-2M) | XLarge (2-5M) |
|-----------|-------------|-------------------|-----------------|---------------|
| `SERVER_MAX_THREADS` | 150 | 300 | 500 | 750 |
| `SERVER_ENABLE_VIRTUAL_THREAD` | false | true | true | true |
| `DB_CONNECTION_POOL_MAX_SIZE` | 50 | 100 | 200 | 300 |
| `DB_CONNECTION_TIMEOUT` | 30000 | 30000 | 60000 | 60000 |
| `ELASTICSEARCH_MAX_CONN_TOTAL` | 50 | 100 | 200 | 300 |
| `BULK_OPERATION_QUEUE_SIZE` | 1000 | 2000 | 5000 | 10000 |
| `BULK_OPERATION_MAX_THREADS` | 10 | 20 | 30 | 50 |
| `SERVER_ACCEPT_QUEUE_SIZE` | 50 | 100 | 200 | 500 |
| JVM Heap (`-Xmx`) | 2G | 4G | 8G | 16G |
#### Applying changes
**Docker Compose:**
Add environment variables to your server service:
```yaml
environment:
SERVER_MAX_THREADS: "300"
SERVER_ENABLE_VIRTUAL_THREAD: "true"
DB_CONNECTION_POOL_MAX_SIZE: "100"
ELASTICSEARCH_MAX_CONN_TOTAL: "100"
BULK_OPERATION_QUEUE_SIZE: "2000"
BULK_OPERATION_MAX_THREADS: "20"
JAVA_OPTS: "-Xmx4g -Xms4g"
```
**Kubernetes / Helm:**
```yaml
# values.yaml
openmetadata:
config:
serverMaxThreads: 300
enableVirtualThread: true
database:
connectionPoolMaxSize: 100
connectionTimeout: 30000
elasticsearch:
maxConnectionsTotal: 100
bulkOperation:
queueSize: 2000
maxThreads: 20
jvmOpts: "-Xmx4g -Xms4g"
```
**Bare metal / `openmetadata.yaml`:**
```yaml
server:
applicationConnectors:
- type: http
port: 8585
maxThreads: 300
enableVirtualThread: true
database:
hikariConfig:
maximumPoolSize: 100
connectionTimeout: 30000
elasticsearch:
maxConnectionsTotal: 100
```
#### Verifying changes took effect
After restarting the server with new settings:
```bash
# Check via diagnostics endpoint
curl -s http://localhost:8585/api/v1/system/diagnostics | python3 -c "
import json, sys
d = json.load(sys.stdin)
j = d.get('jetty', {})
db = d.get('database', {})
print(f\"Jetty max threads: {j.get('threads_max', 'N/A')}\")
print(f\"DB pool max: {db.get('pool_max', 'N/A')}\")
print(f\"Virtual threads: {j.get('virtual_threads_enabled', 'N/A')}\")
"
# Check via Prometheus metrics (admin port)
curl -s http://localhost:8586/metrics | grep -E "jetty_threads_max|hikari.*max"
```
### Step 6: Re-validate
After applying configuration changes, re-run the benchmark at your target scale:
```bash
# Re-run just the tier that previously broke
./scripts/perf-test.sh \
--scale 100k \
--realistic \
--mixed \
--workers 30 \
--admin-port 8586 \
--output ./sizing-results/sizing-100k-realistic-tuned.json
# Or re-run the full progressive suite
./scripts/benchmark-sizing.sh \
--server http://localhost:8585 \
--admin-port 8586 \
--output-dir ./sizing-results-tuned
```
Compare before/after:
```bash
# Side-by-side comparison
python3 -c "
import json
with open('./sizing-results/sizing-100k-realistic.json') as f:
before = json.load(f)
with open('./sizing-results/sizing-100k-realistic-tuned.json') as f:
after = json.load(f)
b = before['overall']
a = after['overall']
print(f\"{'Metric':<20} {'Before':>12} {'After':>12} {'Change':>12}\")
print('-' * 56)
for key in ['overall_throughput_rps', 'overall_error_rate_pct']:
bv = b.get(key, 0)
av = a.get(key, 0)
diff = ((av - bv) / bv * 100) if bv > 0 else 0
print(f\"{key:<20} {bv:>12.1f} {av:>12.1f} {diff:>+11.1f}%\")
print(f\"\nBefore: {before['cluster_sizing']['assessment']}\")
print(f\"After: {after['cluster_sizing']['assessment']}\")
"
```
## Configuration Reference
| Parameter | Default | Description | When to Increase |
|-----------|---------|-------------|-----------------|
| `SERVER_MAX_THREADS` | 150 | Jetty HTTP thread pool | Jetty utilization >80% |
| `SERVER_ENABLE_VIRTUAL_THREAD` | false | Use JDK21 virtual threads | Always enable at >50K entities |
| `DB_CONNECTION_POOL_MAX_SIZE` | 50 | HikariCP max connections | DB pool utilization >70% or pending >0 |
| `DB_CONNECTION_TIMEOUT` | 30000 | Max wait for DB connection (ms) | Connection acquire timeouts |
| `ELASTICSEARCH_MAX_CONN_TOTAL` | 50 | Max HTTP connections to ES/OS | Search latency spikes |
| `BULK_OPERATION_QUEUE_SIZE` | 1000 | Async bulk operation queue depth | 503 "queue full" errors |
| `BULK_OPERATION_MAX_THREADS` | 10 | Worker threads for bulk ops | Queue filling up, low throughput |
| `SERVER_ACCEPT_QUEUE_SIZE` | 50 | TCP accept backlog | Connection refused under load |
## Interpreting Results
### Assessment Meanings
| Assessment | Meaning | Action |
|------------|---------|--------|
| `adequate` | Server handles this scale with acceptable latency and error rates | No changes needed |
| `marginal` | Performance is degrading — latency rising, some errors appearing | Tune configuration (Step 5) |
| `undersized` | Server cannot handle this scale — high errors, extreme latency | Upgrade resources or reduce scale target |
### Common Patterns and Fixes
| Pattern | Symptom | Fix |
|---------|---------|-----|
| DB pool exhaustion | p95 spikes, bimodal latency, pending connections >0 | Increase `DB_CONNECTION_POOL_MAX_SIZE` |
| Thread pool saturation | Jetty utilization >90%, queue depth growing | Increase `SERVER_MAX_THREADS`, enable virtual threads |
| Bulk executor overflow | 503 errors with "queue full" | Increase `BULK_OPERATION_QUEUE_SIZE` and `BULK_OPERATION_MAX_THREADS` |
| GC pressure | p99/p95 ratio >3x, periodic latency spikes | Increase heap, tune GC settings |
| Search bottleneck | High search latency in breakdown, low ES connection usage | Increase `ELASTICSEARCH_MAX_CONN_TOTAL` |
| Sequential OK, realistic broken | Performance fine per-type but breaks under mixed load | All of the above — concurrent load compounds contention |
### Sequential vs Realistic Differences
- **<10% RPS drop**: Excellent server handles contention well
- **10-30% RPS drop**: Normal some contention expected
- **30-50% RPS drop**: Concerning review DB pool and thread settings
- **>50% RPS drop**: Critical contention — likely needs resource upgrades
## Troubleshooting
### Server OOM during benchmark
Symptoms: Server process killed, connection refused mid-run.
```bash
# Check if server was OOM-killed
docker logs openmetadata-server-1 2>&1 | tail -50
dmesg | grep -i "oom\|killed"
# Fix: Increase heap
# In docker-compose.yaml:
# JAVA_OPTS: "-Xmx8g -Xms8g"
```
### Connection refused errors
Symptoms: Benchmark shows many `connection_error` entries.
```bash
# Check server is running
curl -s http://localhost:8585/api/v1/system/version
# Check accept queue
curl -s http://localhost:8586/metrics | grep "jetty_queued"
# Fix: Increase SERVER_ACCEPT_QUEUE_SIZE, check firewall rules
```
### Benchmark script hangs
Symptoms: No progress for >5 minutes.
```bash
# Check server health
curl -s http://localhost:8585/api/v1/system/version
# Check server thread dumps for deadlocks
curl -s http://localhost:8586/threads
# The script has a 60s per-request timeout; if the server is extremely slow
# but responsive, individual requests may be timing out and retrying
```
### How to resume after failure
The `--skip-existing` flag lets you resume:
```bash
# First, check what's already completed
ls -la ./sizing-results/sizing-*.json
# Resume from where it left off
./scripts/benchmark-sizing.sh \
--skip-existing \
--output-dir ./sizing-results
```
Individual tiers can also be re-run manually:
```bash
./scripts/perf-test.sh \
--scale 100k \
--realistic \
--workers 30 \
--output ./sizing-results/sizing-100k-realistic.json
```