# Distributed Indexing Load Test Scripts Scripts for generating test data and triggering reindexing to load-test the OpenMetadata search indexing pipeline. ## Quick Start ```bash # 1. Start the environment ./scripts/start.sh # 2. Load test data (~50K entities) ./scripts/perf-test.sh --scale small --server http://localhost:8585 # 3. Trigger reindex ./scripts/trigger-reindex.sh # 4. Monitor logs ./scripts/logs.sh # 5. Stop the environment ./scripts/stop.sh ``` ## perf-test.sh Generates entities across 30+ entity types, including time-series data, lineage edges, and data quality entities. Uses concurrent workers for high throughput. ### Scale Presets Use `--scale` to pick a preset: | Preset | Approximate Total | Use Case | |--------|-------------------|----------| | `small` | ~50K | Quick smoke tests, CI | | `medium` | ~500K | Integration testing | | `large` | ~2M | Performance validation | | `xlarge` | ~5M | Full-scale load testing | ```bash # Small smoke test ./perf-test.sh --scale small --server http://localhost:8585 # Full 5M load test ./perf-test.sh --scale xlarge --server http://localhost:8585 # Quick mode (~10K, fastest) ./perf-test.sh --quick --server http://localhost:8585 ``` Default (no `--scale` or `--quick`) produces ~46K entities for backward compatibility. ### Overriding Individual Counts Any `--entity-type NUM` flag overrides the preset for that entity type: ```bash # Small preset but with 100K tables ./perf-test.sh --scale small --tables 100000 # Only create tables and dashboards (everything else stays at preset counts) ./perf-test.sh --scale small --tables 50000 --dashboards 10000 ``` ### All Flags #### Entity counts | Flag | Default | Description | |------|---------|-------------| | `--tables NUM` | 20000 | Database tables | | `--topics NUM` | 3000 | Kafka/messaging topics | | `--dashboards NUM` | 5000 | Looker dashboards | | `--charts NUM` | 10000 | Dashboard charts | | `--pipelines NUM` | 3000 | Airflow pipelines | | `--stored-procedures NUM` | 0 | Stored procedures | | `--containers NUM` | 2000 | S3 containers | | `--search-indexes NUM` | 1000 | Elasticsearch indexes | | `--mlmodels NUM` | 2000 | ML models | | `--queries NUM` | 0 | SQL queries | | `--data-models NUM` | 0 | Dashboard data models | | `--test-suites NUM` | 0 | Test suites | | `--test-cases NUM` | 0 | Test cases (linked to tables) | | `--glossaries NUM` | 50 | Glossaries | | `--glossary-terms NUM` | 5000 | Glossary terms | | `--classifications NUM` | 20 | Tag classifications | | `--tags NUM` | 1000 | Tags | | `--users NUM` | 0 | Users | | `--teams NUM` | 0 | Teams | | `--domains NUM` | 0 | Domains | | `--data-products NUM` | 0 | Data products (need domains) | | `--api-collections NUM` | 0 | API collections | | `--api-endpoints NUM` | 0 | API endpoints (need collections) | | `--lineage-edges NUM` | 0 | Lineage edges between entities | #### Time-series entity counts | Flag | Default | Description | |------|---------|-------------| | `--test-case-results NUM` | 0 | Test case results (need test cases) | | `--entity-report-data NUM` | 0 | Entity report data insights | | `--web-analytic-views NUM` | 0 | Web analytic entity view reports | | `--web-analytic-activity NUM` | 0 | Web analytic user activity reports | | `--raw-cost-analysis NUM` | 0 | Raw cost analysis reports | | `--aggregated-cost-analysis NUM` | 0 | Aggregated cost analysis reports | #### Other options | Flag | Default | Description | |------|---------|-------------| | `--server URL` | `http://localhost:8585` | Target OpenMetadata server | | `--workers NUM` | 20 | Concurrent HTTP workers | | `--quick` | - | Quick mode preset (~10K entities) | | `--scale PRESET` | - | Scale preset (small/medium/large/xlarge) | | `--skip-reads` | - | Skip read benchmarking phase (Phase 8) | | `--only-reads` | - | Skip write phases; discover existing entities and run reads only | | `--mixed` | - | Run mixed read/write workload (Phase 9) | | `--mixed-duration SECS` | 60 | Duration of mixed workload in seconds | | `--read-ratio PCT` | 80 | Read percentage in mixed workload (0-100) | | `--realistic` | - | Run Phase 4 entity creation concurrently across entity types using a shared worker pool | ### Entity Creation Order The script creates entities in dependency order across up to 9 phases: ``` Phase 1 Metadata domains, classifications, tags, glossaries, terms, users, teams Phase 2 Services database, dashboard, pipeline, messaging, ML, storage, search, API Phase 3 Infrastructure databases, schemas, API collections Phase 4 Core entities tables, dashboards, charts, topics, pipelines, storedProcedures, containers, searchIndexes, mlmodels, queries, dataModels, apiEndpoints, dataProducts Phase 5 Data Quality testSuites, testCases Phase 6 Lineage table->table (60%), table->dashboard (25%), pipeline->table (15%) Phase 7 Time-Series testCaseResults, entityReportData, webAnalyticViews, webAnalyticActivity, rawCostAnalysis, aggCostAnalysis Phase 8 Read Benchmarks entity fetch, paginated list, search queries, lineage traversal Phase 9 Mixed Workload concurrent reads + writes for configurable duration (--mixed) ``` Phases 8 and 9 are optional: - Phase 8 runs automatically unless `--skip-reads` is passed - Phase 9 only runs when `--mixed` is passed - `--only-reads` skips phases 1-7, discovers existing entities, and runs Phase 8 ### Entity Linking - **Tables, dashboards, pipelines**: IDs collected during Phase 4 for use in lineage (Phase 6) - **Test cases**: FQNs collected for testCaseResult creation (Phase 7) - **Lineage edges**: Use collected UUIDs via `PUT /api/v1/lineage` - Collections are capped at `max(lineage_edges * 2, test_case_results)` to bound memory ### Auto-Scaling Infrastructure Databases and schemas scale automatically with table count: - `NUM_DATABASES = max(1, tables / 50000)` - `SCHEMAS_PER_DB = min(20, tables / (databases * 5000))` - This keeps ~5000 tables per schema at any scale ### Retry Logic HTTP requests retry up to 3 times with exponential backoff (1s, 2s, 4s) on: - 5xx server errors - Connection errors / timeouts ### Realistic Concurrent Workload (`--realistic`) By default, `--workers N` creates N concurrent workers **per entity type**, but entity types run sequentially (all tables first, then all dashboards, etc.). With `--realistic`, all Phase 4 entity types are created concurrently through a **single shared worker pool**, simulating real-world traffic where tables, dashboards, topics, and pipelines all hit the server at the same time. This exposes contention patterns not visible in sequential mode: - Cross-entity DB lock contention - Shared thread pool pressure - Connection pool exhaustion under mixed workloads ```bash # Realistic mode: all entity types hit the server concurrently ./perf-test.sh --scale 10k --realistic --server http://localhost:8585 # Compare with sequential mode (default) ./perf-test.sh --scale 10k --server http://localhost:8585 ``` The report includes a `realistic_combined` entry showing combined RPS and latency distribution across all entity types, in addition to individual per-entity-type metrics. ### Performance Tips - Use `--workers 30` or higher if the server can handle it - Time-series and lineage phases use `min(10, workers)` to avoid overwhelming the server - At `xlarge` scale, expect the script to run for several hours depending on server capacity - Monitor server logs for 429/503 errors and reduce workers if needed ### Multi-Scale Benchmarking Run benchmarks across multiple asset counts to compare performance at different scales: ```bash # Benchmark at 10k, 50k, 100k, and 200k entities for scale in 10k 50k 100k 200k; do ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \ --output "/tmp/bench-${scale}.json" --workers 20 2>&1 | tee "/tmp/bench-${scale}.log" done ``` With read and mixed workload benchmarks included: ```bash for scale in 10k 50k 100k; do ./scripts/perf-test.sh --scale "$scale" --server http://localhost:8585 \ --mixed --mixed-duration 30 \ --output "/tmp/bench-${scale}.json" 2>&1 | tee "/tmp/bench-${scale}.log" done ``` Compare results across scales: ```bash for f in /tmp/bench-*.json; do echo "=== $(basename $f) ===" python3 -c " import json r = json.load(open('$f')) o = r['overall'] print(f\" Entities: {o['total_entities_created']:,} RPS: {o['overall_throughput_rps']:.1f}\" f\" Errors: {o['overall_error_rate_pct']:.1f}% Time: {o['total_wall_clock_s']:.0f}s\") " done ``` ## Verification After Loading ```bash # 1. Trigger reindex ./scripts/trigger-reindex.sh # 2. Check partition table for all entity types mysql -e "SELECT DISTINCT entityType FROM search_index_partition ORDER BY entityType;" # 3. Verify counts in UI # - Data Assets: tables, topics, dashboards, pipelines, etc. # - Data Quality: test suites and test cases # - Lineage: visible edges between tables/dashboards/pipelines # - Data Insights: time-series charts for entity reports, web analytics, cost analysis ```