OpenMetadata/bin/distributed-test/USAGE.md
Mohit Yadav 7b6360a9ed
Reindex Work - Perf , Metrics , Benchmarking and More (#26231)
* Update Perf

* Add multi asset scale count

* Update perf and Usage

* Fix recommendation

* Add Benchmarking script and doc

* Fix Perf

* Add --no break to benchmark

* add more metrics and validation for indexes miss

* Update generated TypeScript types

* Bound Doc Virtual Threads

* Remove Additional Properties from the UI

* Update doc

* Fix Job Getting Marked Stopped

* Server killed logs fixes

* Add Server stat to Quartz Progress

* Fix CPU spiking

* Make Auto Tune Consider JVm configs

* Fix Partition Calculator and Recovery Job Stats

* Update Auto Tune to show up in logs and stored in config

* Fix Auto Tune Config not store in app run record

* Fix OnDemand Job type

* Indexing Failures not flushed fixed

* Fix Stat counting at job level with process job failures

* Add Reindex Job Identifier

* Add Thread Identifiers

* Wait for sink

* Wait for sink

* Fix Stopping to let partitions finish the job

* CPU Budgeting

* More Conservative settings

* Address Review Comment

* fix Open Search Index Manager

* Reapply OpenSearch BulkSink

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-10 08:10:46 +05:30

246 lines
8.9 KiB
Markdown

# Distributed Indexing Load Test Scripts
Scripts for generating test data and triggering reindexing to load-test the OpenMetadata search indexing pipeline.
## Quick Start
```bash
# 1. Start the environment
./scripts/start.sh
# 2. Load test data (~50K entities)
./scripts/perf-test.sh --scale small --server http://localhost:8585
# 3. Trigger reindex
./scripts/trigger-reindex.sh
# 4. Monitor logs
./scripts/logs.sh
# 5. Stop the environment
./scripts/stop.sh
```
## perf-test.sh
Generates entities across 30+ entity types, including time-series data, lineage edges, and data quality entities. Uses concurrent workers for high throughput.
### Scale Presets
Use `--scale` to pick a preset:
| Preset | Approximate Total | Use Case |
|--------|-------------------|----------|
| `small` | ~50K | Quick smoke tests, CI |
| `medium` | ~500K | Integration testing |
| `large` | ~2M | Performance validation |
| `xlarge` | ~5M | Full-scale load testing |
```bash
# Small smoke test
./perf-test.sh --scale small --server http://localhost:8585
# Full 5M load test
./perf-test.sh --scale xlarge --server http://localhost:8585
# Quick mode (~10K, fastest)
./perf-test.sh --quick --server http://localhost:8585
```
Default (no `--scale` or `--quick`) produces ~46K entities for backward compatibility.
### Overriding Individual Counts
Any `--entity-type NUM` flag overrides the preset for that entity type:
```bash
# Small preset but with 100K tables
./perf-test.sh --scale small --tables 100000
# Only create tables and dashboards (everything else stays at preset counts)
./perf-test.sh --scale small --tables 50000 --dashboards 10000
```
### All Flags
#### Entity counts
| Flag | Default | Description |
|------|---------|-------------|
| `--tables NUM` | 20000 | Database tables |
| `--topics NUM` | 3000 | Kafka/messaging topics |
| `--dashboards NUM` | 5000 | Looker dashboards |
| `--charts NUM` | 10000 | Dashboard charts |
| `--pipelines NUM` | 3000 | Airflow pipelines |
| `--stored-procedures NUM` | 0 | Stored procedures |
| `--containers NUM` | 2000 | S3 containers |
| `--search-indexes NUM` | 1000 | Elasticsearch indexes |
| `--mlmodels NUM` | 2000 | ML models |
| `--queries NUM` | 0 | SQL queries |
| `--data-models NUM` | 0 | Dashboard data models |
| `--test-suites NUM` | 0 | Test suites |
| `--test-cases NUM` | 0 | Test cases (linked to tables) |
| `--glossaries NUM` | 50 | Glossaries |
| `--glossary-terms NUM` | 5000 | Glossary terms |
| `--classifications NUM` | 20 | Tag classifications |
| `--tags NUM` | 1000 | Tags |
| `--users NUM` | 0 | Users |
| `--teams NUM` | 0 | Teams |
| `--domains NUM` | 0 | Domains |
| `--data-products NUM` | 0 | Data products (need domains) |
| `--api-collections NUM` | 0 | API collections |
| `--api-endpoints NUM` | 0 | API endpoints (need collections) |
| `--lineage-edges NUM` | 0 | Lineage edges between entities |
#### Time-series entity counts
| Flag | Default | Description |
|------|---------|-------------|
| `--test-case-results NUM` | 0 | Test case results (need test cases) |
| `--entity-report-data NUM` | 0 | Entity report data insights |
| `--web-analytic-views NUM` | 0 | Web analytic entity view reports |
| `--web-analytic-activity NUM` | 0 | Web analytic user activity reports |
| `--raw-cost-analysis NUM` | 0 | Raw cost analysis reports |
| `--aggregated-cost-analysis NUM` | 0 | Aggregated cost analysis reports |
#### Other options
| Flag | Default | Description |
|------|---------|-------------|
| `--server URL` | `http://localhost:8585` | Target OpenMetadata server |
| `--workers NUM` | 20 | Concurrent HTTP workers |
| `--quick` | - | Quick mode preset (~10K entities) |
| `--scale PRESET` | - | Scale preset (small/medium/large/xlarge) |
| `--skip-reads` | - | Skip read benchmarking phase (Phase 8) |
| `--only-reads` | - | Skip write phases; discover existing entities and run reads only |
| `--mixed` | - | Run mixed read/write workload (Phase 9) |
| `--mixed-duration SECS` | 60 | Duration of mixed workload in seconds |
| `--read-ratio PCT` | 80 | Read percentage in mixed workload (0-100) |
| `--realistic` | - | Run Phase 4 entity creation concurrently across entity types using a shared worker pool |
### Entity Creation Order
The script creates entities in dependency order across up to 9 phases:
```
Phase 1 Metadata domains, classifications, tags, glossaries, terms, users, teams
Phase 2 Services database, dashboard, pipeline, messaging, ML, storage, search, API
Phase 3 Infrastructure databases, schemas, API collections
Phase 4 Core entities tables, dashboards, charts, topics, pipelines, storedProcedures,
containers, searchIndexes, mlmodels, queries, dataModels,
apiEndpoints, dataProducts
Phase 5 Data Quality testSuites, testCases
Phase 6 Lineage table->table (60%), table->dashboard (25%), pipeline->table (15%)
Phase 7 Time-Series testCaseResults, entityReportData, webAnalyticViews,
webAnalyticActivity, rawCostAnalysis, aggCostAnalysis
Phase 8 Read Benchmarks entity fetch, paginated list, search queries, lineage traversal
Phase 9 Mixed Workload concurrent reads + writes for configurable duration (--mixed)
```
Phases 8 and 9 are optional:
- Phase 8 runs automatically unless `--skip-reads` is passed
- Phase 9 only runs when `--mixed` is passed
- `--only-reads` skips phases 1-7, discovers existing entities, and runs Phase 8
### Entity Linking
- **Tables, dashboards, pipelines**: IDs collected during Phase 4 for use in lineage (Phase 6)
- **Test cases**: FQNs collected for testCaseResult creation (Phase 7)
- **Lineage edges**: Use collected UUIDs via `PUT /api/v1/lineage`
- Collections are capped at `max(lineage_edges * 2, test_case_results)` to bound memory
### Auto-Scaling Infrastructure
Databases and schemas scale automatically with table count:
- `NUM_DATABASES = max(1, tables / 50000)`
- `SCHEMAS_PER_DB = min(20, tables / (databases * 5000))`
- This keeps ~5000 tables per schema at any scale
### Retry Logic
HTTP requests retry up to 3 times with exponential backoff (1s, 2s, 4s) on:
- 5xx server errors
- Connection errors / timeouts
### Realistic Concurrent Workload (`--realistic`)
By default, `--workers N` creates N concurrent workers **per entity type**, but entity types run
sequentially (all tables first, then all dashboards, etc.). With `--realistic`, all Phase 4 entity
types are created concurrently through a **single shared worker pool**, simulating real-world traffic
where tables, dashboards, topics, and pipelines all hit the server at the same time.
This exposes contention patterns not visible in sequential mode:
- Cross-entity DB lock contention
- Shared thread pool pressure
- Connection pool exhaustion under mixed workloads
```bash
# Realistic mode: all entity types hit the server concurrently
./perf-test.sh --scale 10k --realistic --server http://localhost:8585
# Compare with sequential mode (default)
./perf-test.sh --scale 10k --server http://localhost:8585
```
The report includes a `realistic_combined` entry showing combined RPS and latency distribution
across all entity types, in addition to individual per-entity-type metrics.
### Performance Tips
- Use `--workers 30` or higher if the server can handle it
- Time-series and lineage phases use `min(10, workers)` to avoid overwhelming the server
- At `xlarge` scale, expect the script to run for several hours depending on server capacity
- Monitor server logs for 429/503 errors and reduce workers if needed
### Multi-Scale Benchmarking
Run benchmarks across multiple asset counts to compare performance at different scales:
```bash
# Benchmark at 10k, 50k, 100k, and 200k entities
for scale in 10k 50k 100k 200k; do
./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
--output "/tmp/bench-${scale}.json" --workers 20 2>&1 | tee "/tmp/bench-${scale}.log"
done
```
With read and mixed workload benchmarks included:
```bash
for scale in 10k 50k 100k; do
./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
--mixed --mixed-duration 30 \
--output "/tmp/bench-${scale}.json" 2>&1 | tee "/tmp/bench-${scale}.log"
done
```
Compare results across scales:
```bash
for f in /tmp/bench-*.json; do
echo "=== $(basename $f) ==="
python3 -c "
import json
r = json.load(open('$f'))
o = r['overall']
print(f\" Entities: {o['total_entities_created']:,} RPS: {o['overall_throughput_rps']:.1f}\"
f\" Errors: {o['overall_error_rate_pct']:.1f}% Time: {o['total_wall_clock_s']:.0f}s\")
"
done
```
## Verification After Loading
```bash
# 1. Trigger reindex
./scripts/trigger-reindex.sh
# 2. Check partition table for all entity types
mysql -e "SELECT DISTINCT entityType FROM search_index_partition ORDER BY entityType;"
# 3. Verify counts in UI
# - Data Assets: tables, topics, dashboards, pipelines, etc.
# - Data Quality: test suites and test cases
# - Lineage: visible edges between tables/dashboards/pipelines
# - Data Insights: time-series charts for entity reports, web analytics, cost analysis
```