mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

Reindex Work - Perf , Metrics , Benchmarking and More (#26231 )

* Update Perf

* Add multi asset scale count

* Update perf and Usage

* Fix recommendation

* Add Benchmarking script and doc

* Fix Perf

* Add --no break to benchmark

* add more metrics and validation for indexes miss

* Update generated TypeScript types

* Bound Doc Virtual Threads

* Remove Additional Properties from the UI

* Update doc

* Fix Job Getting Marked Stopped

* Server killed logs fixes

* Add Server stat to Quartz Progress

* Fix CPU spiking

* Make Auto Tune Consider JVm configs

* Fix Partition Calculator and Recovery Job Stats

* Update Auto Tune to show up in logs and stored in config

* Fix Auto Tune Config not store in app run record

* Fix OnDemand Job type

* Indexing Failures not flushed fixed

* Fix Stat counting at job level with process job failures

* Add Reindex Job Identifier

* Add Thread Identifiers

* Wait for sink

* Wait for sink

* Fix Stopping to let partitions finish the job

* CPU Budgeting

* More Conservative settings

* Address Review Comment

* fix Open Search Index Manager

* Reapply OpenSearch BulkSink

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

2026-03-10 08:10:46 +05:30

8.9 KiB

Raw Blame History

Distributed Indexing Load Test Scripts

Scripts for generating test data and triggering reindexing to load-test the OpenMetadata search indexing pipeline.

Quick Start

# 1. Start the environment
./scripts/start.sh

# 2. Load test data (~50K entities)
./scripts/perf-test.sh --scale small --server http://localhost:8585

# 3. Trigger reindex
./scripts/trigger-reindex.sh

# 4. Monitor logs
./scripts/logs.sh

# 5. Stop the environment
./scripts/stop.sh

perf-test.sh

Generates entities across 30+ entity types, including time-series data, lineage edges, and data quality entities. Uses concurrent workers for high throughput.

Scale Presets

Use --scale to pick a preset:

Preset	Approximate Total	Use Case
`small`	~50K	Quick smoke tests, CI
`medium`	~500K	Integration testing
`large`	~2M	Performance validation
`xlarge`	~5M	Full-scale load testing

# Small smoke test
./perf-test.sh --scale small --server http://localhost:8585

# Full 5M load test
./perf-test.sh --scale xlarge --server http://localhost:8585

# Quick mode (~10K, fastest)
./perf-test.sh --quick --server http://localhost:8585

Default (no --scale or --quick) produces ~46K entities for backward compatibility.

Overriding Individual Counts

Any --entity-type NUM flag overrides the preset for that entity type:

# Small preset but with 100K tables
./perf-test.sh --scale small --tables 100000

# Only create tables and dashboards (everything else stays at preset counts)
./perf-test.sh --scale small --tables 50000 --dashboards 10000

All Flags

Entity counts

Flag	Default	Description
`--tables NUM`	20000	Database tables
`--topics NUM`	3000	Kafka/messaging topics
`--dashboards NUM`	5000	Looker dashboards
`--charts NUM`	10000	Dashboard charts
`--pipelines NUM`	3000	Airflow pipelines
`--stored-procedures NUM`	0	Stored procedures
`--containers NUM`	2000	S3 containers
`--search-indexes NUM`	1000	Elasticsearch indexes
`--mlmodels NUM`	2000	ML models
`--queries NUM`	0	SQL queries
`--data-models NUM`	0	Dashboard data models
`--test-suites NUM`	0	Test suites
`--test-cases NUM`	0	Test cases (linked to tables)
`--glossaries NUM`	50	Glossaries
`--glossary-terms NUM`	5000	Glossary terms
`--classifications NUM`	20	Tag classifications
`--tags NUM`	1000	Tags
`--users NUM`	0	Users
`--teams NUM`	0	Teams
`--domains NUM`	0	Domains
`--data-products NUM`	0	Data products (need domains)
`--api-collections NUM`	0	API collections
`--api-endpoints NUM`	0	API endpoints (need collections)
`--lineage-edges NUM`	0	Lineage edges between entities

Time-series entity counts

Flag	Default	Description
`--test-case-results NUM`	0	Test case results (need test cases)
`--entity-report-data NUM`	0	Entity report data insights
`--web-analytic-views NUM`	0	Web analytic entity view reports
`--web-analytic-activity NUM`	0	Web analytic user activity reports
`--raw-cost-analysis NUM`	0	Raw cost analysis reports
`--aggregated-cost-analysis NUM`	0	Aggregated cost analysis reports

Other options

Flag	Default	Description
`--server URL`	`http://localhost:8585`	Target OpenMetadata server
`--workers NUM`	20	Concurrent HTTP workers
`--quick`	-	Quick mode preset (~10K entities)
`--scale PRESET`	-	Scale preset (small/medium/large/xlarge)
`--skip-reads`	-	Skip read benchmarking phase (Phase 8)
`--only-reads`	-	Skip write phases; discover existing entities and run reads only
`--mixed`	-	Run mixed read/write workload (Phase 9)
`--mixed-duration SECS`	60	Duration of mixed workload in seconds
`--read-ratio PCT`	80	Read percentage in mixed workload (0-100)
`--realistic`	-	Run Phase 4 entity creation concurrently across entity types using a shared worker pool

Entity Creation Order

The script creates entities in dependency order across up to 9 phases:

Phase 1  Metadata         domains, classifications, tags, glossaries, terms, users, teams
Phase 2  Services         database, dashboard, pipeline, messaging, ML, storage, search, API
Phase 3  Infrastructure   databases, schemas, API collections
Phase 4  Core entities    tables, dashboards, charts, topics, pipelines, storedProcedures,
                          containers, searchIndexes, mlmodels, queries, dataModels,
                          apiEndpoints, dataProducts
Phase 5  Data Quality     testSuites, testCases
Phase 6  Lineage          table->table (60%), table->dashboard (25%), pipeline->table (15%)
Phase 7  Time-Series      testCaseResults, entityReportData, webAnalyticViews,
                          webAnalyticActivity, rawCostAnalysis, aggCostAnalysis
Phase 8  Read Benchmarks  entity fetch, paginated list, search queries, lineage traversal
Phase 9  Mixed Workload   concurrent reads + writes for configurable duration (--mixed)

Phases 8 and 9 are optional:

Phase 8 runs automatically unless --skip-reads is passed
Phase 9 only runs when --mixed is passed
--only-reads skips phases 1-7, discovers existing entities, and runs Phase 8

Entity Linking

Tables, dashboards, pipelines: IDs collected during Phase 4 for use in lineage (Phase 6)
Test cases: FQNs collected for testCaseResult creation (Phase 7)
Lineage edges: Use collected UUIDs via PUT /api/v1/lineage
Collections are capped at max(lineage_edges * 2, test_case_results) to bound memory

Auto-Scaling Infrastructure

Databases and schemas scale automatically with table count:

NUM_DATABASES = max(1, tables / 50000)
SCHEMAS_PER_DB = min(20, tables / (databases * 5000))
This keeps ~5000 tables per schema at any scale

Retry Logic

HTTP requests retry up to 3 times with exponential backoff (1s, 2s, 4s) on:

5xx server errors
Connection errors / timeouts

Realistic Concurrent Workload (`--realistic`)

By default, --workers N creates N concurrent workers per entity type, but entity types run sequentially (all tables first, then all dashboards, etc.). With --realistic, all Phase 4 entity types are created concurrently through a single shared worker pool, simulating real-world traffic where tables, dashboards, topics, and pipelines all hit the server at the same time.

This exposes contention patterns not visible in sequential mode:

Cross-entity DB lock contention
Shared thread pool pressure
Connection pool exhaustion under mixed workloads

# Realistic mode: all entity types hit the server concurrently
./perf-test.sh --scale 10k --realistic --server http://localhost:8585

# Compare with sequential mode (default)
./perf-test.sh --scale 10k --server http://localhost:8585

The report includes a realistic_combined entry showing combined RPS and latency distribution across all entity types, in addition to individual per-entity-type metrics.

Performance Tips

Use --workers 30 or higher if the server can handle it
Time-series and lineage phases use min(10, workers) to avoid overwhelming the server
At xlarge scale, expect the script to run for several hours depending on server capacity
Monitor server logs for 429/503 errors and reduce workers if needed

Multi-Scale Benchmarking

Run benchmarks across multiple asset counts to compare performance at different scales:

# Benchmark at 10k, 50k, 100k, and 200k entities
for scale in 10k 50k 100k 200k; do
  ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
    --output "/tmp/bench-${scale}.json" --workers 20 2>&1 | tee "/tmp/bench-${scale}.log"
done

With read and mixed workload benchmarks included:

for scale in 10k 50k 100k; do
  ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \
    --mixed --mixed-duration 30 \
    --output "/tmp/bench-${scale}.json" 2>&1 | tee "/tmp/bench-${scale}.log"
done

Compare results across scales:

for f in /tmp/bench-*.json; do
  echo "=== $(basename $f) ==="
  python3 -c "
import json
r = json.load(open('$f'))
o = r['overall']
print(f\"  Entities: {o['total_entities_created']:,}  RPS: {o['overall_throughput_rps']:.1f}\"
      f\"  Errors: {o['overall_error_rate_pct']:.1f}%  Time: {o['total_wall_clock_s']:.0f}s\")
"
done

Verification After Loading

# 1. Trigger reindex
./scripts/trigger-reindex.sh

# 2. Check partition table for all entity types
mysql -e "SELECT DISTINCT entityType FROM search_index_partition ORDER BY entityType;"

# 3. Verify counts in UI
#    - Data Assets: tables, topics, dashboards, pipelines, etc.
#    - Data Quality: test suites and test cases
#    - Lineage: visible edges between tables/dashboards/pipelines
#    - Data Insights: time-series charts for entity reports, web analytics, cost analysis

8.9 KiB Raw Blame History