OpenMetadata/bin/distributed-test/USAGE.md
Mohit Yadav 7b6360a9ed
Reindex Work - Perf , Metrics , Benchmarking and More (#26231)
* Update Perf

* Add multi asset scale count

* Update perf and Usage

* Fix recommendation

* Add Benchmarking script and doc

* Fix Perf

* Add --no break to benchmark

* add more metrics and validation for indexes miss

* Update generated TypeScript types

* Bound Doc Virtual Threads

* Remove Additional Properties from the UI

* Update doc

* Fix Job Getting Marked Stopped

* Server killed logs fixes

* Add Server stat to Quartz Progress

* Fix CPU spiking

* Make Auto Tune Consider JVm configs

* Fix Partition Calculator and Recovery Job Stats

* Update Auto Tune to show up in logs and stored in config

* Fix Auto Tune Config not store in app run record

* Fix OnDemand Job type

* Indexing Failures not flushed fixed

* Fix Stat counting at job level with process job failures

* Add Reindex Job Identifier

* Add Thread Identifiers

* Wait for sink

* Wait for sink

* Fix Stopping to let partitions finish the job

* CPU Budgeting

* More Conservative settings

* Address Review Comment

* fix Open Search Index Manager

* Reapply OpenSearch BulkSink

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-10 08:10:46 +05:30

8.9 KiB

Distributed Indexing Load Test Scripts

Scripts for generating test data and triggering reindexing to load-test the OpenMetadata search indexing pipeline.

Quick Start

# 1. Start the environment
./scripts/start.sh

# 2. Load test data (~50K entities)
./scripts/perf-test.sh --scale small --server http://localhost:8585

# 3. Trigger reindex
./scripts/trigger-reindex.sh

# 4. Monitor logs
./scripts/logs.sh

# 5. Stop the environment
./scripts/stop.sh

perf-test.sh

Generates entities across 30+ entity types, including time-series data, lineage edges, and data quality entities. Uses concurrent workers for high throughput.

Scale Presets

Use --scale to pick a preset:

Preset Approximate Total Use Case
small ~50K Quick smoke tests, CI
medium ~500K Integration testing
large ~2M Performance validation
xlarge ~5M Full-scale load testing
# Small smoke test
./perf-test.sh --scale small --server http://localhost:8585

# Full 5M load test
./perf-test.sh --scale xlarge --server http://localhost:8585

# Quick mode (~10K, fastest)
./perf-test.sh --quick --server http://localhost:8585

Default (no --scale or --quick) produces ~46K entities for backward compatibility.

Overriding Individual Counts

Any --entity-type NUM flag overrides the preset for that entity type:

# Small preset but with 100K tables
./perf-test.sh --scale small --tables 100000

# Only create tables and dashboards (everything else stays at preset counts)
./perf-test.sh --scale small --tables 50000 --dashboards 10000

All Flags

Entity counts

Flag Default Description
--tables NUM 20000 Database tables
--topics NUM 3000 Kafka/messaging topics
--dashboards NUM 5000 Looker dashboards
--charts NUM 10000 Dashboard charts
--pipelines NUM 3000 Airflow pipelines
--stored-procedures NUM 0 Stored procedures
--containers NUM 2000 S3 containers
--search-indexes NUM 1000 Elasticsearch indexes
--mlmodels NUM 2000 ML models
--queries NUM 0 SQL queries
--data-models NUM 0 Dashboard data models
--test-suites NUM 0 Test suites
--test-cases NUM 0 Test cases (linked to tables)
--glossaries NUM 50 Glossaries
--glossary-terms NUM 5000 Glossary terms
--classifications NUM 20 Tag classifications
--tags NUM 1000 Tags
--users NUM 0 Users
--teams NUM 0 Teams
--domains NUM 0 Domains
--data-products NUM 0 Data products (need domains)
--api-collections NUM 0 API collections
--api-endpoints NUM 0 API endpoints (need collections)
--lineage-edges NUM 0 Lineage edges between entities

Time-series entity counts

Flag Default Description
--test-case-results NUM 0 Test case results (need test cases)
--entity-report-data NUM 0 Entity report data insights
--web-analytic-views NUM 0 Web analytic entity view reports
--web-analytic-activity NUM 0 Web analytic user activity reports
--raw-cost-analysis NUM 0 Raw cost analysis reports
--aggregated-cost-analysis NUM 0 Aggregated cost analysis reports

Other options

Flag Default Description
--server URL http://localhost:8585 Target OpenMetadata server
--workers NUM 20 Concurrent HTTP workers
--quick - Quick mode preset (~10K entities)
--scale PRESET - Scale preset (small/medium/large/xlarge)
--skip-reads - Skip read benchmarking phase (Phase 8)
--only-reads - Skip write phases; discover existing entities and run reads only
--mixed - Run mixed read/write workload (Phase 9)
--mixed-duration SECS 60 Duration of mixed workload in seconds
--read-ratio PCT 80 Read percentage in mixed workload (0-100)
--realistic - Run Phase 4 entity creation concurrently across entity types using a shared worker pool

Entity Creation Order

The script creates entities in dependency order across up to 9 phases:

Phase 1  Metadata         domains, classifications, tags, glossaries, terms, users, teams
Phase 2  Services         database, dashboard, pipeline, messaging, ML, storage, search, API
Phase 3  Infrastructure   databases, schemas, API collections
Phase 4  Core entities    tables, dashboards, charts, topics, pipelines, storedProcedures,
                          containers, searchIndexes, mlmodels, queries, dataModels,
                          apiEndpoints, dataProducts
Phase 5  Data Quality     testSuites, testCases
Phase 6  Lineage          table->table (60%), table->dashboard (25%), pipeline->table (15%)
Phase 7  Time-Series      testCaseResults, entityReportData, webAnalyticViews,
                          webAnalyticActivity, rawCostAnalysis, aggCostAnalysis
Phase 8  Read Benchmarks  entity fetch, paginated list, search queries, lineage traversal
Phase 9  Mixed Workload   concurrent reads + writes for configurable duration (--mixed)

Phases 8 and 9 are optional:

  • Phase 8 runs automatically unless --skip-reads is passed
  • Phase 9 only runs when --mixed is passed
  • --only-reads skips phases 1-7, discovers existing entities, and runs Phase 8

Entity Linking

  • Tables, dashboards, pipelines: IDs collected during Phase 4 for use in lineage (Phase 6)
  • Test cases: FQNs collected for testCaseResult creation (Phase 7)
  • Lineage edges: Use collected UUIDs via PUT /api/v1/lineage
  • Collections are capped at max(lineage_edges * 2, test_case_results) to bound memory

Auto-Scaling Infrastructure

Databases and schemas scale automatically with table count:

  • NUM_DATABASES = max(1, tables / 50000)
  • SCHEMAS_PER_DB = min(20, tables / (databases * 5000))
  • This keeps ~5000 tables per schema at any scale

Retry Logic

HTTP requests retry up to 3 times with exponential backoff (1s, 2s, 4s) on:

  • 5xx server errors
  • Connection errors / timeouts

Realistic Concurrent Workload (--realistic)

By default, --workers N creates N concurrent workers per entity type, but entity types run sequentially (all tables first, then all dashboards, etc.). With --realistic, all Phase 4 entity types are created concurrently through a single shared worker pool, simulating real-world traffic where tables, dashboards, topics, and pipelines all hit the server at the same time.

This exposes contention patterns not visible in sequential mode:

  • Cross-entity DB lock contention
  • Shared thread pool pressure
  • Connection pool exhaustion under mixed workloads
# Realistic mode: all entity types hit the server concurrently
./perf-test.sh --scale 10k --realistic --server http://localhost:8585

# Compare with sequential mode (default)
./perf-test.sh --scale 10k --server http://localhost:8585

The report includes a realistic_combined entry showing combined RPS and latency distribution across all entity types, in addition to individual per-entity-type metrics.

Performance Tips

  • Use --workers 30 or higher if the server can handle it
  • Time-series and lineage phases use min(10, workers) to avoid overwhelming the server
  • At xlarge scale, expect the script to run for several hours depending on server capacity
  • Monitor server logs for 429/503 errors and reduce workers if needed

Multi-Scale Benchmarking

Run benchmarks across multiple asset counts to compare performance at different scales:

# Benchmark at 10k, 50k, 100k, and 200k entities
for scale in 10k 50k 100k 200k; do
  ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
    --output "/tmp/bench-${scale}.json" --workers 20 2>&1 | tee "/tmp/bench-${scale}.log"
done

With read and mixed workload benchmarks included:

for scale in 10k 50k 100k; do
  ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
    --mixed --mixed-duration 30 \
    --output "/tmp/bench-${scale}.json" 2>&1 | tee "/tmp/bench-${scale}.log"
done

Compare results across scales:

for f in /tmp/bench-*.json; do
  echo "=== $(basename $f) ==="
  python3 -c "
import json
r = json.load(open('$f'))
o = r['overall']
print(f\"  Entities: {o['total_entities_created']:,}  RPS: {o['overall_throughput_rps']:.1f}\"
      f\"  Errors: {o['overall_error_rate_pct']:.1f}%  Time: {o['total_wall_clock_s']:.0f}s\")
"
done

Verification After Loading

# 1. Trigger reindex
./scripts/trigger-reindex.sh

# 2. Check partition table for all entity types
mysql -e "SELECT DISTINCT entityType FROM search_index_partition ORDER BY entityType;"

# 3. Verify counts in UI
#    - Data Assets: tables, topics, dashboards, pipelines, etc.
#    - Data Quality: test suites and test cases
#    - Lineage: visible edges between tables/dashboards/pipelines
#    - Data Insights: time-series charts for entity reports, web analytics, cost analysis