mirror of
https://github.com/open-metadata/OpenMetadata
synced 2026-05-24 09:39:11 +00:00
Improve indexing (#26154)
* Add Prometheus metrics for reindexing pipeline via Micrometer Bridge the existing reindexing atomic counters to Prometheus so operators can alert on failures, latency spikes, and backpressure without relying solely on database-flushed stats.
- Add ReindexingMetrics singleton (initialize/getInstance pattern matching
CacheMetrics) with job lifecycle counters, stage success/failed/warnings
counters, bulk request timers with SLA buckets, payload size distribution,
backpressure and promotion counters, and active/pending gauges
- Register in MicrometerBundle after StreamableLogsMetrics
- Instrument ReindexingOrchestrator.run() with job started/completed/failed/stopped
- Bridge StageStatsTracker.flush() deltas to Prometheus per stage and entity type
- Add bulk request latency timer and payload size recording in OpenSearchBulkSink
- Record backpressure events in SearchIndexExecutor.handleBackpressure()
- Record promotion success/failure in DefaultRecreateHandler
- Add ReindexingMetricsTest with 24 tests covering all metric types
* Add Improvements
* Auto Gene
* Use Auto Config in distributed
* Fix Partition Claim Spread
* Make partition use config
* Correct total count
* Fix Wait time to 5 mins
* Revert om yaml
* Fix Sink sync
* Add Failure Handling at different stages
* Update script to create entities
* Move to scripts
* Add usage and fix script
* Fix Script
* Update generated TypeScript types
* Fix Staging miss
* Fix Stats reconcilation issue
* Revert workflow handler
* Fix Partition worker early sync
* Update Logs
* Update logs EntityRepository
* Error failure test
* Review Comments fix
* Fix Non Distributed live feed
* Fix Non Distributed stats feed
* Fix Review comments
* Fix Time Series cutt off
* Update generated TypeScript types
* Md
* Benchmark addition
* Fix date time warning
* Update load test to do benchmark analysis
* Disagnostic and update perf test
* Move load test to bin
* Fix Review Comments
* Add numeric values
* Move to localhost by default
* Fix Perf test issues
* Review Comments
* Add Preflight Fixes
* Add Preflight fixes for stale entry
* Remove stale entry on ApplicationHandler
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
(cherry picked from commit b59aa7fc44)
This commit is contained in:
parent
bed6f80e55
commit
1fe8a02fde
81 changed files with 10223 additions and 2406 deletions
274
bin/distributed-test/DIAGNOSTICS.md
Normal file
274
bin/distributed-test/DIAGNOSTICS.md
Normal file
|
|
@ -0,0 +1,274 @@
|
|||
# Server-Side Diagnostics & Load Test Correlation
|
||||
|
||||
The diagnostics endpoint (`GET /api/v1/system/diagnostics`) provides a single-call performance snapshot of the OpenMetadata server. Combined with the load test script, it enables pinpointing **where** time is spent during high-load scenarios and produces actionable tuning recommendations.
|
||||
|
||||
## The Diagnostics Endpoint
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```bash
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
http://localhost:8585/api/v1/system/diagnostics | python3 -m json.tool
|
||||
```
|
||||
|
||||
### Response Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2026-03-02T19:00:00Z",
|
||||
"jvm": { ... },
|
||||
"jetty": { ... },
|
||||
"database": { ... },
|
||||
"bulk_executor": { ... },
|
||||
"request_latency": { ... }
|
||||
}
|
||||
```
|
||||
|
||||
Each section is explained below.
|
||||
|
||||
---
|
||||
|
||||
## Understanding Each Section
|
||||
|
||||
### JVM
|
||||
|
||||
| Field | What It Tells You |
|
||||
|-------|-------------------|
|
||||
| `heap_used_bytes` / `heap_max_bytes` | Current heap consumption vs maximum. |
|
||||
| `heap_usage_pct` | If >85% after load, GC pressure is likely adding tail latency. |
|
||||
| `gc_pause_total_ms` | Cumulative GC pause time since JVM start. Compare before/after load to see how much GC occurred during the test. |
|
||||
| `gc_count` | Total GC collections. A large delta during load means frequent stop-the-world pauses. |
|
||||
| `thread_count` / `thread_peak` | Active JVM threads. Correlate with Jetty thread pool. |
|
||||
| `cpu_process_pct` | Process CPU utilization (0-100). If pinned at 100%, the server is CPU-bound. |
|
||||
| `uptime_seconds` | Useful to confirm the server wasn't restarted mid-test. |
|
||||
|
||||
### Jetty (Thread Pool)
|
||||
|
||||
| Field | What It Tells You |
|
||||
|-------|-------------------|
|
||||
| `threads_busy` / `threads_max` | How many request-handling threads are in use. |
|
||||
| `utilization_pct` | **Key metric.** If >90% with `queue_size > 0`, the thread pool is saturated and requests are queuing. |
|
||||
| `queue_size` | Requests waiting for a free thread. Non-zero means latency is being added by queuing. |
|
||||
| `queue_time_avg_ms` | Average time a request waits in the queue before getting a thread. |
|
||||
| `virtual_threads_enabled` | Whether Java 21 virtual threads are active (eliminates thread pool as a bottleneck). |
|
||||
|
||||
### Database (HikariCP Pool)
|
||||
|
||||
| Field | What It Tells You |
|
||||
|-------|-------------------|
|
||||
| `pool_active` / `pool_max` | Active DB connections vs maximum pool size. |
|
||||
| `pool_usage_pct` | **Key metric.** If >80%, connection contention is likely. Requests wait for a free connection. |
|
||||
| `pool_pending` | Threads waiting for a DB connection. If >0 during load, the pool is undersized. |
|
||||
| `pool_idle` | Spare connections. If 0 during load, the pool is fully utilized. |
|
||||
|
||||
### Bulk Executor
|
||||
|
||||
| Field | What It Tells You |
|
||||
|-------|-------------------|
|
||||
| `queue_depth` / `queue_capacity` | Items in the async processing queue. |
|
||||
| `queue_usage_pct` | If >70%, approaching the rejection threshold (HTTP 503 errors). |
|
||||
| `active_threads` / `max_threads` | Worker threads actively processing bulk operations. |
|
||||
| `has_capacity` | `false` means the next bulk submission will be rejected with 503. |
|
||||
|
||||
### Request Latency (Per-Endpoint Breakdown)
|
||||
|
||||
This is the most actionable section. For each `METHOD /endpoint` combination:
|
||||
|
||||
| Field | What It Tells You |
|
||||
|-------|-------------------|
|
||||
| `count` | Total requests processed for this endpoint. |
|
||||
| `avg_total_ms` | Average end-to-end latency. |
|
||||
| `avg_db_ms` / `db_pct` | Time spent in database queries and its percentage of total. |
|
||||
| `avg_search_ms` / `search_pct` | Time spent in search/Elasticsearch operations. |
|
||||
| `avg_internal_ms` / `internal_pct` | Time in Java code (serialization, validation, business logic). |
|
||||
| `avg_db_ops` / `avg_search_ops` | Average number of DB/search round-trips per request. |
|
||||
|
||||
**Reading the breakdown:** If `PUT /v1/tables` shows `db_pct: 56%`, then 56% of the request time is spent waiting for database queries. Combined with `database.pool_usage_pct: 85%`, this tells you the DB connection pool is the bottleneck.
|
||||
|
||||
---
|
||||
|
||||
## Load Test Integration
|
||||
|
||||
The load test script automatically queries the diagnostics endpoint at three points:
|
||||
|
||||
1. **Before load** — baseline snapshot
|
||||
2. **During load** — sampled every 10 seconds by the health monitor
|
||||
3. **After load** — final snapshot for comparison
|
||||
|
||||
### Running a Load Test with Diagnostics
|
||||
|
||||
```bash
|
||||
# Basic: diagnostics are collected automatically
|
||||
./perf-test.sh --scale small --server http://localhost:8585 --admin-port 8586
|
||||
|
||||
# With explicit token
|
||||
./perf-test.sh --scale medium --server http://localhost:8585 \
|
||||
--admin-port 8586 --token "$MY_TOKEN" --output /tmp/bench.json
|
||||
```
|
||||
|
||||
The `--admin-port` flag enables both Prometheus scraping and diagnostics collection. Diagnostics work without it too (they use the main API port).
|
||||
|
||||
### Console Output
|
||||
|
||||
After the benchmark table, you'll see a `SERVER-SIDE BREAKDOWN` section:
|
||||
|
||||
```
|
||||
SERVER-SIDE BREAKDOWN (from /api/v1/system/diagnostics):
|
||||
JVM: heap 1.2GB/2GB (60%), GC pauses +450ms during load
|
||||
Jetty: 142/150 threads busy (95%), queue depth: 23
|
||||
DB Pool: 85/100 active (85%), 12 pending connections
|
||||
Bulk Executor: queue 450/1000 (45%)
|
||||
|
||||
Latency Breakdown (PUT endpoints):
|
||||
Endpoint Total DB% Search% Internal%
|
||||
/v1/tables 320ms 56.2% 14.1% 29.7%
|
||||
/v1/topics 180ms 48.0% 22.0% 30.0%
|
||||
/v1/dashboards 250ms 52.0% 18.0% 30.0%
|
||||
|
||||
BOTTLENECK: DB bottleneck on PUT /v1/tables: 56.2% of request time in DB, pool at 85.0% utilization
|
||||
```
|
||||
|
||||
### JSON Report
|
||||
|
||||
The report includes top-level `diagnostics_before` and `diagnostics_after` objects, plus `cluster_sizing.server_side_analysis`:
|
||||
|
||||
```bash
|
||||
cat /tmp/bench.json | python3 -c "
|
||||
import json, sys
|
||||
r = json.load(sys.stdin)
|
||||
|
||||
# Check if diagnostics were available
|
||||
diag = r.get('diagnostics_after', {})
|
||||
if diag:
|
||||
jvm = diag['jvm']
|
||||
print(f'Heap: {jvm[\"heap_usage_pct\"]}%')
|
||||
print(f'GC pauses: {jvm[\"gc_pause_total_ms\"]}ms')
|
||||
|
||||
jetty = diag['jetty']
|
||||
print(f'Jetty: {jetty[\"threads_busy\"]}/{jetty[\"threads_max\"]} ({jetty[\"utilization_pct\"]}%)')
|
||||
|
||||
db = diag['database']
|
||||
print(f'DB pool: {db[\"pool_active\"]}/{db[\"pool_max\"]} ({db[\"pool_usage_pct\"]}%)')
|
||||
|
||||
for ep, data in diag.get('request_latency', {}).items():
|
||||
print(f'{ep}: total={data[\"avg_total_ms\"]}ms '
|
||||
f'DB={data[\"db_pct\"]}% Search={data[\"search_pct\"]}% '
|
||||
f'Internal={data[\"internal_pct\"]}%')
|
||||
else:
|
||||
print('Diagnostics not available (server may be older version)')
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bottleneck Detection Rules
|
||||
|
||||
The load test applies these rules automatically and surfaces them in findings:
|
||||
|
||||
| Condition | Diagnosis | Recommended Fix |
|
||||
|-----------|-----------|-----------------|
|
||||
| `db_pct > 60%` AND `pool_usage_pct > 80%` | DB is the bottleneck | `export DB_CONNECTION_POOL_MAX_SIZE=150` |
|
||||
| `jetty.utilization_pct > 90%` AND `queue_size > 0` | Thread pool saturated | `export SERVER_MAX_THREADS=300` or enable virtual threads |
|
||||
| `search_pct > 30%` for any endpoint | Search indexing consuming latency | `export ELASTICSEARCH_MAX_CONN_TOTAL=50` |
|
||||
| `bulk_executor.queue_usage_pct > 70%` | Near bulk rejection threshold | `export BULK_OPERATION_QUEUE_SIZE=2000` |
|
||||
| `jvm.heap_usage_pct > 85%` after load | Memory pressure / GC tail latency | Increase JVM heap (`-Xmx`) |
|
||||
|
||||
---
|
||||
|
||||
## Common Scenarios
|
||||
|
||||
### Scenario 1: High Latency, DB is the Bottleneck
|
||||
|
||||
**Symptoms:** p95 latency >2s, `db_pct` >60%, `pool_usage_pct` >80%.
|
||||
|
||||
```
|
||||
Latency Breakdown:
|
||||
/v1/tables 320ms DB=62% Search=12% Internal=26%
|
||||
DB Pool: 95/100 active (95%), 8 pending
|
||||
```
|
||||
|
||||
**What's happening:** Every PUT requires multiple DB round-trips. At 95% pool utilization with 8 pending connections, requests are waiting for a free connection.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
export DB_CONNECTION_POOL_MAX_SIZE=150
|
||||
export DB_CONNECTION_TIMEOUT=10000 # Fail fast instead of waiting 30s
|
||||
```
|
||||
|
||||
### Scenario 2: Thread Pool Exhaustion
|
||||
|
||||
**Symptoms:** Connection refused errors, `utilization_pct` >95%, `queue_size` growing.
|
||||
|
||||
```
|
||||
Jetty: 148/150 threads busy (99%), queue depth: 45
|
||||
```
|
||||
|
||||
**What's happening:** All Jetty threads are busy. New requests queue up, adding latency. If the queue fills, connections get refused.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
export SERVER_MAX_THREADS=300
|
||||
# OR enable virtual threads (preferred for I/O-bound workloads):
|
||||
export SERVER_ENABLE_VIRTUAL_THREAD=true
|
||||
```
|
||||
|
||||
### Scenario 3: GC Pressure
|
||||
|
||||
**Symptoms:** Periodic latency spikes, `heap_usage_pct` >85%, large GC pause delta.
|
||||
|
||||
```
|
||||
JVM: heap 1.8GB/2GB (90%), GC pauses +2300ms during load
|
||||
```
|
||||
|
||||
**What's happening:** The JVM is spending significant time in garbage collection. This manifests as periodic latency spikes and throughput drops.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
# Increase heap
|
||||
export OPENMETADATA_HEAP_OPTS="-Xmx4g -Xms4g"
|
||||
```
|
||||
|
||||
### Scenario 4: Bulk Executor Queue Filling
|
||||
|
||||
**Symptoms:** HTTP 503 errors on PUT endpoints.
|
||||
|
||||
```
|
||||
Bulk Executor: queue 980/1000 (98%)
|
||||
has_capacity: false
|
||||
```
|
||||
|
||||
**What's happening:** The async processing queue is full. New requests that need bulk processing are rejected with 503.
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
export BULK_OPERATION_QUEUE_SIZE=2000
|
||||
export BULK_OPERATION_MAX_THREADS=20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Comparing Before/After Snapshots
|
||||
|
||||
The most valuable analysis comes from comparing diagnostics before and after load:
|
||||
|
||||
| Metric | Before | After | Interpretation |
|
||||
|--------|--------|-------|----------------|
|
||||
| `heap_usage_pct` | 25% | 85% | Significant memory allocation during load |
|
||||
| `gc_pause_total_ms` | 200 | 2500 | 2.3s of GC pauses during the test |
|
||||
| `pool_active` | 2 | 95 | Pool went from idle to near-max |
|
||||
| `pool_pending` | 0 | 8 | Connection contention appeared |
|
||||
| `queue_depth` | 0 | 450 | Bulk queue built up under load |
|
||||
|
||||
If the `diagnostics_during` samples are available in the health monitor data, you can plot these metrics over time to see exactly when bottlenecks emerged.
|
||||
|
||||
---
|
||||
|
||||
## Graceful Fallback
|
||||
|
||||
If the server doesn't have the diagnostics endpoint (older version), the load test:
|
||||
- Prints a notice: `Diagnostics endpoint returned status=404 (may not be available)`
|
||||
- Falls back to Prometheus scraping (if `--admin-port` is set)
|
||||
- Skips the `SERVER-SIDE BREAKDOWN` section in the console output
|
||||
- Omits `diagnostics_before`/`diagnostics_after` from the JSON report
|
||||
|
||||
No hard dependency — the load test works with or without it.
|
||||
173
bin/distributed-test/USAGE.md
Normal file
173
bin/distributed-test/USAGE.md
Normal file
|
|
@ -0,0 +1,173 @@
|
|||
# Distributed Indexing Load Test Scripts
|
||||
|
||||
Scripts for generating test data and triggering reindexing to load-test the OpenMetadata search indexing pipeline.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Start the environment
|
||||
./scripts/start.sh
|
||||
|
||||
# 2. Load test data (~50K entities)
|
||||
./scripts/perf-test.sh --scale small --server http://localhost:8585
|
||||
|
||||
# 3. Trigger reindex
|
||||
./scripts/trigger-reindex.sh
|
||||
|
||||
# 4. Monitor logs
|
||||
./scripts/logs.sh
|
||||
|
||||
# 5. Stop the environment
|
||||
./scripts/stop.sh
|
||||
```
|
||||
|
||||
## perf-test.sh
|
||||
|
||||
Generates entities across 30+ entity types, including time-series data, lineage edges, and data quality entities. Uses concurrent workers for high throughput.
|
||||
|
||||
### Scale Presets
|
||||
|
||||
Use `--scale` to pick a preset:
|
||||
|
||||
| Preset | Approximate Total | Use Case |
|
||||
|--------|-------------------|----------|
|
||||
| `small` | ~50K | Quick smoke tests, CI |
|
||||
| `medium` | ~500K | Integration testing |
|
||||
| `large` | ~2M | Performance validation |
|
||||
| `xlarge` | ~5M | Full-scale load testing |
|
||||
|
||||
```bash
|
||||
# Small smoke test
|
||||
./perf-test.sh --scale small --server http://localhost:8585
|
||||
|
||||
# Full 5M load test
|
||||
./perf-test.sh --scale xlarge --server http://localhost:8585
|
||||
|
||||
# Quick mode (~10K, fastest)
|
||||
./perf-test.sh --quick --server http://localhost:8585
|
||||
```
|
||||
|
||||
Default (no `--scale` or `--quick`) produces ~46K entities for backward compatibility.
|
||||
|
||||
### Overriding Individual Counts
|
||||
|
||||
Any `--entity-type NUM` flag overrides the preset for that entity type:
|
||||
|
||||
```bash
|
||||
# Small preset but with 100K tables
|
||||
./perf-test.sh --scale small --tables 100000
|
||||
|
||||
# Only create tables and dashboards (everything else stays at preset counts)
|
||||
./perf-test.sh --scale small --tables 50000 --dashboards 10000
|
||||
```
|
||||
|
||||
### All Flags
|
||||
|
||||
#### Entity counts
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--tables NUM` | 20000 | Database tables |
|
||||
| `--topics NUM` | 3000 | Kafka/messaging topics |
|
||||
| `--dashboards NUM` | 5000 | Looker dashboards |
|
||||
| `--charts NUM` | 10000 | Dashboard charts |
|
||||
| `--pipelines NUM` | 3000 | Airflow pipelines |
|
||||
| `--stored-procedures NUM` | 0 | Stored procedures |
|
||||
| `--containers NUM` | 2000 | S3 containers |
|
||||
| `--search-indexes NUM` | 1000 | Elasticsearch indexes |
|
||||
| `--mlmodels NUM` | 2000 | ML models |
|
||||
| `--queries NUM` | 0 | SQL queries |
|
||||
| `--data-models NUM` | 0 | Dashboard data models |
|
||||
| `--test-suites NUM` | 0 | Test suites |
|
||||
| `--test-cases NUM` | 0 | Test cases (linked to tables) |
|
||||
| `--glossaries NUM` | 50 | Glossaries |
|
||||
| `--glossary-terms NUM` | 5000 | Glossary terms |
|
||||
| `--classifications NUM` | 20 | Tag classifications |
|
||||
| `--tags NUM` | 1000 | Tags |
|
||||
| `--users NUM` | 0 | Users |
|
||||
| `--teams NUM` | 0 | Teams |
|
||||
| `--domains NUM` | 0 | Domains |
|
||||
| `--data-products NUM` | 0 | Data products (need domains) |
|
||||
| `--api-collections NUM` | 0 | API collections |
|
||||
| `--api-endpoints NUM` | 0 | API endpoints (need collections) |
|
||||
| `--lineage-edges NUM` | 0 | Lineage edges between entities |
|
||||
|
||||
#### Time-series entity counts
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--test-case-results NUM` | 0 | Test case results (need test cases) |
|
||||
| `--entity-report-data NUM` | 0 | Entity report data insights |
|
||||
| `--web-analytic-views NUM` | 0 | Web analytic entity view reports |
|
||||
| `--web-analytic-activity NUM` | 0 | Web analytic user activity reports |
|
||||
| `--raw-cost-analysis NUM` | 0 | Raw cost analysis reports |
|
||||
| `--aggregated-cost-analysis NUM` | 0 | Aggregated cost analysis reports |
|
||||
|
||||
#### Other options
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--server URL` | `http://localhost:8585` | Target OpenMetadata server |
|
||||
| `--workers NUM` | 20 | Concurrent HTTP workers |
|
||||
| `--quick` | - | Quick mode preset (~10K entities) |
|
||||
| `--scale PRESET` | - | Scale preset (small/medium/large/xlarge) |
|
||||
|
||||
### Entity Creation Order
|
||||
|
||||
The script creates entities in dependency order across 7 phases:
|
||||
|
||||
```
|
||||
Phase 1 Metadata domains, classifications, tags, glossaries, terms, users, teams
|
||||
Phase 2 Services database, dashboard, pipeline, messaging, ML, storage, search, API
|
||||
Phase 3 Infrastructure databases, schemas, API collections
|
||||
Phase 4 Core entities tables, dashboards, charts, topics, pipelines, storedProcedures,
|
||||
containers, searchIndexes, mlmodels, queries, dataModels,
|
||||
apiEndpoints, dataProducts
|
||||
Phase 5 Data Quality testSuites, testCases
|
||||
Phase 6 Lineage table->table (60%), table->dashboard (25%), pipeline->table (15%)
|
||||
Phase 7 Time-Series testCaseResults, entityReportData, webAnalyticViews,
|
||||
webAnalyticActivity, rawCostAnalysis, aggCostAnalysis
|
||||
```
|
||||
|
||||
### Entity Linking
|
||||
|
||||
- **Tables, dashboards, pipelines**: IDs collected during Phase 4 for use in lineage (Phase 6)
|
||||
- **Test cases**: FQNs collected for testCaseResult creation (Phase 7)
|
||||
- **Lineage edges**: Use collected UUIDs via `PUT /api/v1/lineage`
|
||||
- Collections are capped at `max(lineage_edges * 2, test_case_results)` to bound memory
|
||||
|
||||
### Auto-Scaling Infrastructure
|
||||
|
||||
Databases and schemas scale automatically with table count:
|
||||
- `NUM_DATABASES = max(1, tables / 50000)`
|
||||
- `SCHEMAS_PER_DB = min(20, tables / (databases * 5000))`
|
||||
- This keeps ~5000 tables per schema at any scale
|
||||
|
||||
### Retry Logic
|
||||
|
||||
HTTP requests retry up to 3 times with exponential backoff (1s, 2s, 4s) on:
|
||||
- 5xx server errors
|
||||
- Connection errors / timeouts
|
||||
|
||||
### Performance Tips
|
||||
|
||||
- Use `--workers 30` or higher if the server can handle it
|
||||
- Time-series and lineage phases use `min(10, workers)` to avoid overwhelming the server
|
||||
- At `xlarge` scale, expect the script to run for several hours depending on server capacity
|
||||
- Monitor server logs for 429/503 errors and reduce workers if needed
|
||||
|
||||
## Verification After Loading
|
||||
|
||||
```bash
|
||||
# 1. Trigger reindex
|
||||
./scripts/trigger-reindex.sh
|
||||
|
||||
# 2. Check partition table for all entity types
|
||||
mysql -e "SELECT DISTINCT entityType FROM search_index_partition ORDER BY entityType;"
|
||||
|
||||
# 3. Verify counts in UI
|
||||
# - Data Assets: tables, topics, dashboards, pipelines, etc.
|
||||
# - Data Quality: test suites and test cases
|
||||
# - Lineage: visible edges between tables/dashboards/pipelines
|
||||
# - Data Insights: time-series charts for entity reports, web analytics, cost analysis
|
||||
```
|
||||
2700
bin/distributed-test/scripts/perf-test.sh
Executable file
2700
bin/distributed-test/scripts/perf-test.sh
Executable file
File diff suppressed because it is too large
Load diff
|
|
@ -115,7 +115,7 @@ echo "All services are up and running!"
|
|||
echo "======================================"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo " 1. Load test data: ./scripts/load-test-data.sh --tables 10000"
|
||||
echo " 1. Load test data: ./scripts/perf-test.sh --tables 10000"
|
||||
echo " 2. Trigger reindexing: ./scripts/trigger-reindex.sh"
|
||||
echo " 3. Watch logs: ./scripts/logs.sh -f"
|
||||
echo ""
|
||||
|
|
@ -49,10 +49,10 @@ cd docker/development/distributed-test
|
|||
|
||||
```bash
|
||||
# Load 10,000 tables (default)
|
||||
./scripts/load-test-data.sh
|
||||
./scripts/perf-test.sh
|
||||
|
||||
# Or specify the number
|
||||
./scripts/load-test-data.sh --tables 50000 --databases 50
|
||||
./scripts/perf-test.sh --tables 50000 --databases 50
|
||||
```
|
||||
|
||||
### 3. Trigger Reindexing
|
||||
|
|
@ -163,7 +163,7 @@ OPENMETADATA_HEAP_OPTS=-Xmx1G -Xms1G
|
|||
### Verify Partition Distribution
|
||||
|
||||
1. Start all 3 servers
|
||||
2. Load test data: `./scripts/load-test-data.sh --tables 10000`
|
||||
2. Load test data: `./scripts/perf-test.sh --tables 10000`
|
||||
3. Trigger reindex: `./scripts/trigger-reindex.sh --recreate`
|
||||
4. Watch logs: `./scripts/logs.sh -f --grep "partition"`
|
||||
|
||||
|
|
@ -242,7 +242,7 @@ distributed-test/
|
|||
│ ├── stop.sh # Stop environment
|
||||
│ ├── logs.sh # View aggregated logs
|
||||
│ ├── trigger-reindex.sh # Trigger reindexing
|
||||
│ └── load-test-data.sh # Load test data
|
||||
│ └── perf-test.sh # Load test data
|
||||
├── local/
|
||||
│ ├── docker-compose-deps.yml # Dependencies only (for IDE debugging)
|
||||
│ ├── server1.yaml # Server 1 config (port 8585)
|
||||
|
|
|
|||
|
|
@ -1,960 +0,0 @@
|
|||
#!/bin/bash
|
||||
# Load test data for distributed indexing testing
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
# Default values
|
||||
SERVER_URL="http://localhost:8585"
|
||||
NUM_TABLES=20000
|
||||
NUM_DASHBOARDS=5000
|
||||
NUM_CHARTS=10000
|
||||
NUM_PIPELINES=3000
|
||||
NUM_TOPICS=3000
|
||||
NUM_MLMODELS=2000
|
||||
NUM_CONTAINERS=2000
|
||||
NUM_SEARCH_INDEXES=1000
|
||||
NUM_GLOSSARIES=50
|
||||
NUM_TERMS_PER_GLOSSARY=100
|
||||
NUM_CLASSIFICATIONS=20
|
||||
NUM_TAGS_PER_CLASSIFICATION=50
|
||||
NUM_DATABASES=10
|
||||
|
||||
# Parse arguments
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case $1 in
|
||||
--tables)
|
||||
NUM_TABLES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--dashboards)
|
||||
NUM_DASHBOARDS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--charts)
|
||||
NUM_CHARTS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--pipelines)
|
||||
NUM_PIPELINES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--topics)
|
||||
NUM_TOPICS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--mlmodels)
|
||||
NUM_MLMODELS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--containers)
|
||||
NUM_CONTAINERS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--search-indexes)
|
||||
NUM_SEARCH_INDEXES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--glossaries)
|
||||
NUM_GLOSSARIES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--terms-per-glossary)
|
||||
NUM_TERMS_PER_GLOSSARY="$2"
|
||||
shift 2
|
||||
;;
|
||||
--classifications)
|
||||
NUM_CLASSIFICATIONS="$2"
|
||||
shift 2
|
||||
;;
|
||||
--tags-per-classification)
|
||||
NUM_TAGS_PER_CLASSIFICATION="$2"
|
||||
shift 2
|
||||
;;
|
||||
--databases)
|
||||
NUM_DATABASES="$2"
|
||||
shift 2
|
||||
;;
|
||||
--server)
|
||||
SERVER_URL="$2"
|
||||
shift 2
|
||||
;;
|
||||
--quick)
|
||||
# Quick mode for rapid testing - smaller dataset
|
||||
NUM_TABLES=3000
|
||||
NUM_DASHBOARDS=1000
|
||||
NUM_CHARTS=2000
|
||||
NUM_PIPELINES=500
|
||||
NUM_TOPICS=500
|
||||
NUM_MLMODELS=300
|
||||
NUM_CONTAINERS=300
|
||||
NUM_SEARCH_INDEXES=200
|
||||
NUM_GLOSSARIES=10
|
||||
NUM_TERMS_PER_GLOSSARY=50
|
||||
NUM_CLASSIFICATIONS=5
|
||||
NUM_TAGS_PER_CLASSIFICATION=20
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
echo "Usage: $0 [OPTIONS]"
|
||||
echo ""
|
||||
echo "Options:"
|
||||
echo " --tables NUM Number of tables to create (default: 20000)"
|
||||
echo " --dashboards NUM Number of dashboards to create (default: 5000)"
|
||||
echo " --charts NUM Number of charts to create (default: 10000)"
|
||||
echo " --pipelines NUM Number of pipelines to create (default: 3000)"
|
||||
echo " --topics NUM Number of topics to create (default: 3000)"
|
||||
echo " --mlmodels NUM Number of ML models to create (default: 2000)"
|
||||
echo " --containers NUM Number of containers to create (default: 2000)"
|
||||
echo " --search-indexes NUM Number of search indexes to create (default: 1000)"
|
||||
echo " --glossaries NUM Number of glossaries to create (default: 50)"
|
||||
echo " --terms-per-glossary NUM Number of terms per glossary (default: 100)"
|
||||
echo " --classifications NUM Number of classifications to create (default: 20)"
|
||||
echo " --tags-per-classification Number of tags per classification (default: 50)"
|
||||
echo " --databases NUM Number of databases (default: 10)"
|
||||
echo " --server URL Target server URL (default: http://localhost:8585)"
|
||||
echo " --quick Quick mode with smaller dataset (~10k entities)"
|
||||
echo " -h, --help Show this help message"
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo "Unknown option: $1"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
NUM_GLOSSARY_TERMS=$((NUM_GLOSSARIES * NUM_TERMS_PER_GLOSSARY))
|
||||
NUM_TAGS=$((NUM_CLASSIFICATIONS * NUM_TAGS_PER_CLASSIFICATION))
|
||||
TOTAL=$((NUM_TABLES + NUM_DASHBOARDS + NUM_CHARTS + NUM_PIPELINES + NUM_TOPICS + NUM_MLMODELS + NUM_CONTAINERS + NUM_SEARCH_INDEXES + NUM_GLOSSARIES + NUM_GLOSSARY_TERMS + NUM_CLASSIFICATIONS + NUM_TAGS))
|
||||
|
||||
echo "======================================"
|
||||
echo "Loading Test Data for Distributed Indexing"
|
||||
echo "======================================"
|
||||
echo "Server: $SERVER_URL"
|
||||
echo ""
|
||||
echo "Entity counts:"
|
||||
echo " - Tables: $NUM_TABLES"
|
||||
echo " - Dashboards: $NUM_DASHBOARDS"
|
||||
echo " - Charts: $NUM_CHARTS"
|
||||
echo " - Pipelines: $NUM_PIPELINES"
|
||||
echo " - Topics: $NUM_TOPICS"
|
||||
echo " - ML Models: $NUM_MLMODELS"
|
||||
echo " - Containers: $NUM_CONTAINERS"
|
||||
echo " - Search Indexes: $NUM_SEARCH_INDEXES"
|
||||
echo " - Glossaries: $NUM_GLOSSARIES"
|
||||
echo " - Glossary Terms: $NUM_GLOSSARY_TERMS"
|
||||
echo " - Classifications: $NUM_CLASSIFICATIONS"
|
||||
echo " - Tags: $NUM_TAGS"
|
||||
echo " --------------------------"
|
||||
echo " - Total: $TOTAL"
|
||||
echo ""
|
||||
|
||||
# Use Python with urllib (built-in, no extra packages needed)
|
||||
python3 << EOF
|
||||
import urllib.request
|
||||
import urllib.error
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import random
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
SERVER_URL = "${SERVER_URL}"
|
||||
NUM_TABLES = ${NUM_TABLES}
|
||||
NUM_DASHBOARDS = ${NUM_DASHBOARDS}
|
||||
NUM_CHARTS = ${NUM_CHARTS}
|
||||
NUM_PIPELINES = ${NUM_PIPELINES}
|
||||
NUM_TOPICS = ${NUM_TOPICS}
|
||||
NUM_MLMODELS = ${NUM_MLMODELS}
|
||||
NUM_CONTAINERS = ${NUM_CONTAINERS}
|
||||
NUM_SEARCH_INDEXES = ${NUM_SEARCH_INDEXES}
|
||||
NUM_GLOSSARIES = ${NUM_GLOSSARIES}
|
||||
NUM_TERMS_PER_GLOSSARY = ${NUM_TERMS_PER_GLOSSARY}
|
||||
NUM_CLASSIFICATIONS = ${NUM_CLASSIFICATIONS}
|
||||
NUM_TAGS_PER_CLASSIFICATION = ${NUM_TAGS_PER_CLASSIFICATION}
|
||||
NUM_DATABASES = ${NUM_DATABASES}
|
||||
|
||||
print(f"Connecting to {SERVER_URL}...")
|
||||
sys.stdout.flush()
|
||||
|
||||
def make_request(url, data=None, method="GET", headers=None):
|
||||
if headers is None:
|
||||
headers = {}
|
||||
headers["Content-Type"] = "application/json"
|
||||
if data:
|
||||
data = json.dumps(data).encode('utf-8')
|
||||
req = urllib.request.Request(url, data=data, headers=headers, method=method)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=60) as response:
|
||||
return response.status, json.loads(response.read().decode('utf-8'))
|
||||
except urllib.error.HTTPError as e:
|
||||
try:
|
||||
body = e.read().decode('utf-8')
|
||||
except:
|
||||
body = str(e)
|
||||
return e.code, body
|
||||
except Exception as e:
|
||||
return 0, str(e)
|
||||
|
||||
# Use admin JWT token directly
|
||||
token = "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
|
||||
print("Using admin JWT token for authentication.")
|
||||
|
||||
headers = {"Content-Type": "application/json"}
|
||||
if token:
|
||||
headers["Authorization"] = f"Bearer {token}"
|
||||
|
||||
overall_start = time.time()
|
||||
stats = {
|
||||
"tables": 0, "dashboards": 0, "charts": 0, "pipelines": 0, "topics": 0,
|
||||
"mlmodels": 0, "containers": 0, "searchIndexes": 0,
|
||||
"glossaries": 0, "glossaryTerms": 0, "classifications": 0, "tags": 0
|
||||
}
|
||||
|
||||
# Store created entity FQNs for linking
|
||||
dashboard_fqns = []
|
||||
|
||||
# ========== CLASSIFICATIONS & TAGS ==========
|
||||
if NUM_CLASSIFICATIONS > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Classifications and Tags")
|
||||
print("=" * 50)
|
||||
|
||||
created_classifications = 0
|
||||
created_tags = 0
|
||||
start_time = time.time()
|
||||
|
||||
for i in range(NUM_CLASSIFICATIONS):
|
||||
classification_data = {
|
||||
"name": f"TestClassification_{i:04d}",
|
||||
"description": f"Test classification {i} for distributed indexing testing"
|
||||
}
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/classifications",
|
||||
data=classification_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
created_classifications += 1
|
||||
classification_fqn = resp["fullyQualifiedName"]
|
||||
|
||||
# Create tags for this classification
|
||||
def create_tag(args):
|
||||
class_fqn, tag_idx = args
|
||||
tag_data = {
|
||||
"name": f"Tag_{tag_idx:04d}",
|
||||
"classification": class_fqn,
|
||||
"description": f"Test tag {tag_idx} in classification"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/tags", data=tag_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=10) as executor:
|
||||
futures = [executor.submit(create_tag, (classification_fqn, j)) for j in range(NUM_TAGS_PER_CLASSIFICATION)]
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created_tags += 1
|
||||
|
||||
if (i + 1) % 5 == 0 or i == NUM_CLASSIFICATIONS - 1:
|
||||
elapsed = time.time() - start_time
|
||||
print(f" Classifications: {created_classifications}/{NUM_CLASSIFICATIONS}, Tags: {created_tags}/{NUM_CLASSIFICATIONS * NUM_TAGS_PER_CLASSIFICATION}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["classifications"] = created_classifications
|
||||
stats["tags"] = created_tags
|
||||
print(f"Classifications completed: {created_classifications} created")
|
||||
print(f"Tags completed: {created_tags} created")
|
||||
|
||||
# ========== GLOSSARIES & TERMS ==========
|
||||
if NUM_GLOSSARIES > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Glossaries and Terms")
|
||||
print("=" * 50)
|
||||
|
||||
created_glossaries = 0
|
||||
created_terms = 0
|
||||
start_time = time.time()
|
||||
|
||||
for i in range(NUM_GLOSSARIES):
|
||||
glossary_data = {
|
||||
"name": f"TestGlossary_{i:04d}",
|
||||
"displayName": f"Test Glossary {i}",
|
||||
"description": f"Test glossary {i} for distributed indexing testing"
|
||||
}
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/glossaries",
|
||||
data=glossary_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
created_glossaries += 1
|
||||
glossary_fqn = resp["fullyQualifiedName"]
|
||||
|
||||
# Create terms for this glossary
|
||||
def create_term(args):
|
||||
gloss_fqn, term_idx = args
|
||||
term_data = {
|
||||
"name": f"Term_{term_idx:04d}",
|
||||
"glossary": gloss_fqn,
|
||||
"displayName": f"Term {term_idx}",
|
||||
"description": f"Test glossary term {term_idx}"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/glossaryTerms", data=term_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=10) as executor:
|
||||
futures = [executor.submit(create_term, (glossary_fqn, j)) for j in range(NUM_TERMS_PER_GLOSSARY)]
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created_terms += 1
|
||||
|
||||
if (i + 1) % 10 == 0 or i == NUM_GLOSSARIES - 1:
|
||||
elapsed = time.time() - start_time
|
||||
print(f" Glossaries: {created_glossaries}/{NUM_GLOSSARIES}, Terms: {created_terms}/{NUM_GLOSSARIES * NUM_TERMS_PER_GLOSSARY}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["glossaries"] = created_glossaries
|
||||
stats["glossaryTerms"] = created_terms
|
||||
print(f"Glossaries completed: {created_glossaries} created")
|
||||
print(f"Glossary Terms completed: {created_terms} created")
|
||||
|
||||
# ========== TABLES ==========
|
||||
if NUM_TABLES > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Tables")
|
||||
print("=" * 50)
|
||||
|
||||
# Create database service
|
||||
print("Creating database service...")
|
||||
sys.stdout.flush()
|
||||
service_data = {
|
||||
"name": "test-service-distributed",
|
||||
"serviceType": "Mysql",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "Mysql",
|
||||
"username": "test",
|
||||
"authType": {"password": "test"},
|
||||
"hostPort": "localhost:3306"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/databaseServices",
|
||||
data=service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
db_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Database service created: {db_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create database service: {status} - {resp}")
|
||||
sys.exit(1)
|
||||
|
||||
# Create databases
|
||||
print(f"Creating {NUM_DATABASES} databases...")
|
||||
sys.stdout.flush()
|
||||
database_fqns = []
|
||||
for i in range(NUM_DATABASES):
|
||||
db_data = {"name": f"test_db_{i:04d}", "service": db_service_fqn}
|
||||
status, resp = make_request(f"{SERVER_URL}/api/v1/databases", data=db_data, method="PUT", headers=headers)
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
database_fqns.append(resp["fullyQualifiedName"])
|
||||
|
||||
print(f"Created {len(database_fqns)} databases")
|
||||
|
||||
# Create schemas
|
||||
print("Creating schemas...")
|
||||
schema_fqns = []
|
||||
for db_fqn in database_fqns:
|
||||
schema_data = {"name": "public", "database": db_fqn}
|
||||
status, resp = make_request(f"{SERVER_URL}/api/v1/databaseSchemas", data=schema_data, method="PUT", headers=headers)
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
schema_fqns.append(resp["fullyQualifiedName"])
|
||||
|
||||
print(f"Created {len(schema_fqns)} schemas")
|
||||
|
||||
if not schema_fqns:
|
||||
print("ERROR: No schemas created. Cannot continue with tables.")
|
||||
else:
|
||||
# Create tables
|
||||
print(f"Creating {NUM_TABLES} tables...")
|
||||
sys.stdout.flush()
|
||||
tables_per_schema = NUM_TABLES // len(schema_fqns)
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_table(args):
|
||||
schema_fqn, table_idx = args
|
||||
table_data = {
|
||||
"name": f"table_{table_idx:06d}",
|
||||
"databaseSchema": schema_fqn,
|
||||
"columns": [
|
||||
{"name": "id", "dataType": "BIGINT", "description": "Primary key"},
|
||||
{"name": "name", "dataType": "VARCHAR", "dataLength": 255, "description": "Name field"},
|
||||
{"name": "created_at", "dataType": "TIMESTAMP", "description": "Creation timestamp"},
|
||||
{"name": "data", "dataType": "JSON", "description": "JSON data field"}
|
||||
]
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/tables", data=table_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
# Prepare tasks
|
||||
tasks = []
|
||||
table_idx = 0
|
||||
for schema_fqn in schema_fqns:
|
||||
for _ in range(tables_per_schema):
|
||||
tasks.append((schema_fqn, table_idx))
|
||||
table_idx += 1
|
||||
if table_idx >= NUM_TABLES:
|
||||
break
|
||||
if table_idx >= NUM_TABLES:
|
||||
break
|
||||
|
||||
# Execute with thread pool
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_table, task): task for task in tasks}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 1000 == 0 or total == len(tasks):
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Tables: {total}/{NUM_TABLES} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["tables"] = created
|
||||
print(f"Tables completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== DASHBOARDS ==========
|
||||
if NUM_DASHBOARDS > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Dashboards")
|
||||
print("=" * 50)
|
||||
|
||||
# Create dashboard service
|
||||
print("Creating dashboard service...")
|
||||
sys.stdout.flush()
|
||||
dashboard_service_data = {
|
||||
"name": "test-dashboard-service",
|
||||
"serviceType": "Looker",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "Looker",
|
||||
"clientId": "test-client-id",
|
||||
"clientSecret": "test-client-secret",
|
||||
"hostPort": "https://looker.example.com"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/dashboardServices",
|
||||
data=dashboard_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
dashboard_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Dashboard service created: {dashboard_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create dashboard service: {status} - {resp}")
|
||||
dashboard_service_fqn = None
|
||||
|
||||
if dashboard_service_fqn:
|
||||
# Create dashboards
|
||||
print(f"Creating {NUM_DASHBOARDS} dashboards...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_dashboard(idx):
|
||||
dashboard_data = {
|
||||
"name": f"dashboard_{idx:06d}",
|
||||
"service": dashboard_service_fqn,
|
||||
"displayName": f"Test Dashboard {idx}",
|
||||
"description": f"Auto-generated test dashboard {idx} for distributed indexing testing"
|
||||
}
|
||||
status, resp = make_request(f"{SERVER_URL}/api/v1/dashboards", data=dashboard_data, method="PUT", headers=headers)
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
return resp.get("fullyQualifiedName")
|
||||
return None
|
||||
|
||||
# Execute with thread pool
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_dashboard, i): i for i in range(NUM_DASHBOARDS)}
|
||||
for future in as_completed(futures):
|
||||
result = future.result()
|
||||
if result:
|
||||
created += 1
|
||||
dashboard_fqns.append(result)
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 1000 == 0 or total == NUM_DASHBOARDS:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Dashboards: {total}/{NUM_DASHBOARDS} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["dashboards"] = created
|
||||
print(f"Dashboards completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== CHARTS ==========
|
||||
if NUM_CHARTS > 0 and dashboard_fqns:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Charts")
|
||||
print("=" * 50)
|
||||
|
||||
print(f"Creating {NUM_CHARTS} charts linked to dashboards...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_chart(idx):
|
||||
# Link chart to a random dashboard
|
||||
dashboard_fqn = random.choice(dashboard_fqns) if dashboard_fqns else None
|
||||
chart_data = {
|
||||
"name": f"chart_{idx:06d}",
|
||||
"service": dashboard_service_fqn,
|
||||
"displayName": f"Test Chart {idx}",
|
||||
"chartType": random.choice(["Line", "Bar", "Pie", "Area", "Scatter", "Table"]),
|
||||
"description": f"Auto-generated test chart {idx}"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/charts", data=chart_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_chart, i): i for i in range(NUM_CHARTS)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 1000 == 0 or total == NUM_CHARTS:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Charts: {total}/{NUM_CHARTS} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["charts"] = created
|
||||
print(f"Charts completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== PIPELINES ==========
|
||||
if NUM_PIPELINES > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Pipelines")
|
||||
print("=" * 50)
|
||||
|
||||
# Create pipeline service
|
||||
print("Creating pipeline service...")
|
||||
sys.stdout.flush()
|
||||
pipeline_service_data = {
|
||||
"name": "test-pipeline-service",
|
||||
"serviceType": "Airflow",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "Airflow",
|
||||
"hostPort": "http://airflow.example.com:8080",
|
||||
"connection": {
|
||||
"type": "BackendConnection"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/pipelineServices",
|
||||
data=pipeline_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
pipeline_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Pipeline service created: {pipeline_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create pipeline service: {status} - {resp}")
|
||||
pipeline_service_fqn = None
|
||||
|
||||
if pipeline_service_fqn:
|
||||
# Create pipelines
|
||||
print(f"Creating {NUM_PIPELINES} pipelines...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_pipeline(idx):
|
||||
pipeline_data = {
|
||||
"name": f"pipeline_{idx:06d}",
|
||||
"service": pipeline_service_fqn,
|
||||
"displayName": f"Test Pipeline {idx}",
|
||||
"description": f"Auto-generated test pipeline {idx} for distributed indexing testing"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/pipelines", data=pipeline_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
# Execute with thread pool
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_pipeline, i): i for i in range(NUM_PIPELINES)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 1000 == 0 or total == NUM_PIPELINES:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Pipelines: {total}/{NUM_PIPELINES} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["pipelines"] = created
|
||||
print(f"Pipelines completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== TOPICS ==========
|
||||
if NUM_TOPICS > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Topics")
|
||||
print("=" * 50)
|
||||
|
||||
# Create messaging service
|
||||
print("Creating messaging service...")
|
||||
sys.stdout.flush()
|
||||
messaging_service_data = {
|
||||
"name": "test-messaging-service",
|
||||
"serviceType": "Kafka",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "Kafka",
|
||||
"bootstrapServers": "localhost:9092"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/messagingServices",
|
||||
data=messaging_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
messaging_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Messaging service created: {messaging_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create messaging service: {status} - {resp}")
|
||||
messaging_service_fqn = None
|
||||
|
||||
if messaging_service_fqn:
|
||||
# Create topics
|
||||
print(f"Creating {NUM_TOPICS} topics...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_topic(idx):
|
||||
topic_data = {
|
||||
"name": f"topic_{idx:06d}",
|
||||
"service": messaging_service_fqn,
|
||||
"partitions": 3,
|
||||
"replicationFactor": 1,
|
||||
"description": f"Auto-generated test topic {idx} for distributed indexing testing"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/topics", data=topic_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
# Execute with thread pool
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_topic, i): i for i in range(NUM_TOPICS)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 1000 == 0 or total == NUM_TOPICS:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Topics: {total}/{NUM_TOPICS} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["topics"] = created
|
||||
print(f"Topics completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== ML MODELS ==========
|
||||
if NUM_MLMODELS > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating ML Models")
|
||||
print("=" * 50)
|
||||
|
||||
# Create ML model service
|
||||
print("Creating ML model service...")
|
||||
sys.stdout.flush()
|
||||
mlmodel_service_data = {
|
||||
"name": "test-mlmodel-service",
|
||||
"serviceType": "Mlflow",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "Mlflow",
|
||||
"trackingUri": "http://mlflow.example.com:5000",
|
||||
"registryUri": "http://mlflow.example.com:5000"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/mlmodelServices",
|
||||
data=mlmodel_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
mlmodel_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"ML model service created: {mlmodel_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create ML model service: {status} - {resp}")
|
||||
mlmodel_service_fqn = None
|
||||
|
||||
if mlmodel_service_fqn:
|
||||
print(f"Creating {NUM_MLMODELS} ML models...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
algorithms = ["LinearRegression", "RandomForest", "XGBoost", "NeuralNetwork", "SVM", "KMeans", "DecisionTree"]
|
||||
|
||||
def create_mlmodel(idx):
|
||||
mlmodel_data = {
|
||||
"name": f"mlmodel_{idx:06d}",
|
||||
"service": mlmodel_service_fqn,
|
||||
"algorithm": random.choice(algorithms),
|
||||
"displayName": f"Test ML Model {idx}",
|
||||
"description": f"Auto-generated test ML model {idx}"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/mlmodels", data=mlmodel_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_mlmodel, i): i for i in range(NUM_MLMODELS)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 500 == 0 or total == NUM_MLMODELS:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" ML Models: {total}/{NUM_MLMODELS} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["mlmodels"] = created
|
||||
print(f"ML Models completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== CONTAINERS ==========
|
||||
if NUM_CONTAINERS > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Containers")
|
||||
print("=" * 50)
|
||||
|
||||
# Create storage service
|
||||
print("Creating storage service...")
|
||||
sys.stdout.flush()
|
||||
storage_service_data = {
|
||||
"name": "test-storage-service",
|
||||
"serviceType": "S3",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "S3",
|
||||
"awsConfig": {
|
||||
"awsAccessKeyId": "test-key",
|
||||
"awsSecretAccessKey": "test-secret",
|
||||
"awsRegion": "us-east-1"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/storageServices",
|
||||
data=storage_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
storage_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Storage service created: {storage_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create storage service: {status} - {resp}")
|
||||
storage_service_fqn = None
|
||||
|
||||
if storage_service_fqn:
|
||||
print(f"Creating {NUM_CONTAINERS} containers...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_container(idx):
|
||||
container_data = {
|
||||
"name": f"container_{idx:06d}",
|
||||
"service": storage_service_fqn,
|
||||
"displayName": f"Test Container {idx}",
|
||||
"description": f"Auto-generated test container {idx}"
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/containers", data=container_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_container, i): i for i in range(NUM_CONTAINERS)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 500 == 0 or total == NUM_CONTAINERS:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Containers: {total}/{NUM_CONTAINERS} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["containers"] = created
|
||||
print(f"Containers completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== SEARCH INDEXES ==========
|
||||
if NUM_SEARCH_INDEXES > 0:
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Creating Search Indexes")
|
||||
print("=" * 50)
|
||||
|
||||
# Create search service
|
||||
print("Creating search service...")
|
||||
sys.stdout.flush()
|
||||
search_service_data = {
|
||||
"name": "test-search-service",
|
||||
"serviceType": "ElasticSearch",
|
||||
"connection": {
|
||||
"config": {
|
||||
"type": "ElasticSearch",
|
||||
"hostPort": "http://elasticsearch.example.com:9200"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
status, resp = make_request(
|
||||
f"{SERVER_URL}/api/v1/services/searchServices",
|
||||
data=search_service_data,
|
||||
method="PUT",
|
||||
headers=headers
|
||||
)
|
||||
|
||||
if status in [200, 201] and isinstance(resp, dict):
|
||||
search_service_fqn = resp["fullyQualifiedName"]
|
||||
print(f"Search service created: {search_service_fqn}")
|
||||
else:
|
||||
print(f"Failed to create search service: {status} - {resp}")
|
||||
search_service_fqn = None
|
||||
|
||||
if search_service_fqn:
|
||||
print(f"Creating {NUM_SEARCH_INDEXES} search indexes...")
|
||||
sys.stdout.flush()
|
||||
created = 0
|
||||
failed = 0
|
||||
start_time = time.time()
|
||||
|
||||
def create_search_index(idx):
|
||||
search_index_data = {
|
||||
"name": f"search_index_{idx:06d}",
|
||||
"service": search_service_fqn,
|
||||
"displayName": f"Test Search Index {idx}",
|
||||
"description": f"Auto-generated test search index {idx}",
|
||||
"fields": [
|
||||
{"name": "id", "dataType": "KEYWORD"},
|
||||
{"name": "content", "dataType": "TEXT"},
|
||||
{"name": "timestamp", "dataType": "DATE"}
|
||||
]
|
||||
}
|
||||
status, _ = make_request(f"{SERVER_URL}/api/v1/searchIndexes", data=search_index_data, method="PUT", headers=headers)
|
||||
return status in [200, 201]
|
||||
|
||||
with ThreadPoolExecutor(max_workers=20) as executor:
|
||||
futures = {executor.submit(create_search_index, i): i for i in range(NUM_SEARCH_INDEXES)}
|
||||
for future in as_completed(futures):
|
||||
if future.result():
|
||||
created += 1
|
||||
else:
|
||||
failed += 1
|
||||
total = created + failed
|
||||
if total % 200 == 0 or total == NUM_SEARCH_INDEXES:
|
||||
elapsed = time.time() - start_time
|
||||
rate = total / elapsed if elapsed > 0 else 0
|
||||
print(f" Search Indexes: {total}/{NUM_SEARCH_INDEXES} ({rate:.1f}/sec) - Created: {created}, Failed: {failed}")
|
||||
sys.stdout.flush()
|
||||
|
||||
stats["searchIndexes"] = created
|
||||
print(f"Search Indexes completed: {created} created, {failed} failed")
|
||||
|
||||
# ========== SUMMARY ==========
|
||||
overall_elapsed = time.time() - overall_start
|
||||
total_created = sum(stats.values())
|
||||
|
||||
print("")
|
||||
print("=" * 50)
|
||||
print("Test Data Loading Complete!")
|
||||
print("=" * 50)
|
||||
print("")
|
||||
print("Summary:")
|
||||
print(f" Tables: {stats['tables']:>6}")
|
||||
print(f" Dashboards: {stats['dashboards']:>6}")
|
||||
print(f" Charts: {stats['charts']:>6}")
|
||||
print(f" Pipelines: {stats['pipelines']:>6}")
|
||||
print(f" Topics: {stats['topics']:>6}")
|
||||
print(f" ML Models: {stats['mlmodels']:>6}")
|
||||
print(f" Containers: {stats['containers']:>6}")
|
||||
print(f" Search Indexes: {stats['searchIndexes']:>6}")
|
||||
print(f" Glossaries: {stats['glossaries']:>6}")
|
||||
print(f" Glossary Terms: {stats['glossaryTerms']:>6}")
|
||||
print(f" Classifications: {stats['classifications']:>6}")
|
||||
print(f" Tags: {stats['tags']:>6}")
|
||||
print(f" --------------------------")
|
||||
print(f" Total: {total_created:>6}")
|
||||
print("")
|
||||
print(f"Time: {overall_elapsed:.1f} seconds")
|
||||
if overall_elapsed > 0:
|
||||
print(f"Rate: {total_created/overall_elapsed:.1f} entities/second")
|
||||
EOF
|
||||
|
||||
echo ""
|
||||
echo "Test data loaded. You can now trigger reindexing with: ./scripts/trigger-reindex.sh"
|
||||
|
|
@ -130,6 +130,7 @@ import org.openmetadata.service.resources.databases.DatasourceConfig;
|
|||
import org.openmetadata.service.resources.filters.ETagRequestFilter;
|
||||
import org.openmetadata.service.resources.filters.ETagResponseFilter;
|
||||
import org.openmetadata.service.resources.settings.SettingsCache;
|
||||
import org.openmetadata.service.resources.system.DiagnosticsResource;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.search.SearchRepositoryFactory;
|
||||
import org.openmetadata.service.secrets.SecretsManagerFactory;
|
||||
|
|
@ -993,6 +994,7 @@ public class OpenMetadataApplication extends Application<OpenMetadataApplication
|
|||
SecurityConfigurationManager.getInstance().getAuthenticatorHandler(),
|
||||
limits);
|
||||
environment.jersey().register(new AuditLogResource(authorizer, auditLogRepository));
|
||||
environment.jersey().register(new DiagnosticsResource(authorizer));
|
||||
environment.jersey().register(new JsonPatchProvider());
|
||||
environment.jersey().register(new JsonPatchMessageBodyReader());
|
||||
|
||||
|
|
|
|||
|
|
@ -103,7 +103,10 @@ public class ApplicationHandler {
|
|||
public void cleanupStaleJobs() {
|
||||
try {
|
||||
LOG.info("Cleaning up stale application jobs from previous server runs");
|
||||
Entity.getCollectionDAO().appExtensionTimeSeriesDao().markAllStaleEntriesFailed();
|
||||
CollectionDAO.AppExtensionTimeSeries dao =
|
||||
Entity.getCollectionDAO().appExtensionTimeSeriesDao();
|
||||
dao.markStaleEntriesStoppedByName("SearchIndexingApplication");
|
||||
dao.markAllStaleEntriesFailed();
|
||||
LOG.info("Stale application jobs cleanup completed successfully");
|
||||
} catch (Exception e) {
|
||||
LOG.error("Failed to cleanup stale application jobs", e);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,35 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
/**
|
||||
* Replaces fixed-delay sleep in backpressure loops with exponential backoff. Starts at an initial
|
||||
* delay and doubles on each call up to a configurable maximum. Call {@link #reset()} when
|
||||
* backpressure clears so the next occurrence starts fresh.
|
||||
*/
|
||||
public class AdaptiveBackoff {
|
||||
|
||||
private final long initialMs;
|
||||
private final long maxMs;
|
||||
private long currentMs;
|
||||
|
||||
public AdaptiveBackoff(long initialMs, long maxMs) {
|
||||
if (initialMs <= 0) {
|
||||
throw new IllegalArgumentException("initialMs must be > 0");
|
||||
}
|
||||
if (maxMs < initialMs) {
|
||||
throw new IllegalArgumentException("maxMs must be >= initialMs");
|
||||
}
|
||||
this.initialMs = initialMs;
|
||||
this.maxMs = maxMs;
|
||||
this.currentMs = initialMs;
|
||||
}
|
||||
|
||||
public long nextDelay() {
|
||||
long delay = currentMs;
|
||||
currentMs = Math.min(currentMs * 2, maxMs);
|
||||
return delay;
|
||||
}
|
||||
|
||||
public void reset() {
|
||||
currentMs = initialMs;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,127 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Iterator;
|
||||
import java.util.concurrent.ConcurrentLinkedDeque;
|
||||
import java.util.concurrent.atomic.AtomicReference;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
|
||||
/**
|
||||
* Sliding-window circuit breaker for bulk search-index requests.
|
||||
*
|
||||
* <p>State transitions: CLOSED → OPEN (after N failures in window) → HALF_OPEN (probe after
|
||||
* interval) → CLOSED (on probe success) or back to OPEN (on probe failure).
|
||||
*/
|
||||
@Slf4j
|
||||
public class BulkCircuitBreaker {
|
||||
|
||||
public enum State {
|
||||
CLOSED,
|
||||
OPEN,
|
||||
HALF_OPEN
|
||||
}
|
||||
|
||||
private final int failureThreshold;
|
||||
private final long windowMs;
|
||||
private final long halfOpenProbeMs;
|
||||
|
||||
private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
|
||||
private final ConcurrentLinkedDeque<Long> failureTimestamps = new ConcurrentLinkedDeque<>();
|
||||
private volatile long openedAt;
|
||||
|
||||
public BulkCircuitBreaker(int failureThreshold, long windowMs, long halfOpenProbeMs) {
|
||||
if (failureThreshold <= 0) {
|
||||
throw new IllegalArgumentException("failureThreshold must be > 0");
|
||||
}
|
||||
if (windowMs <= 0) {
|
||||
throw new IllegalArgumentException("windowMs must be > 0");
|
||||
}
|
||||
if (halfOpenProbeMs <= 0) {
|
||||
throw new IllegalArgumentException("halfOpenProbeMs must be > 0");
|
||||
}
|
||||
this.failureThreshold = failureThreshold;
|
||||
this.windowMs = windowMs;
|
||||
this.halfOpenProbeMs = halfOpenProbeMs;
|
||||
}
|
||||
|
||||
public boolean allowRequest() {
|
||||
State current = state.get();
|
||||
if (current == State.CLOSED) {
|
||||
return true;
|
||||
}
|
||||
if (current == State.HALF_OPEN) {
|
||||
return true;
|
||||
}
|
||||
// OPEN: check if probe interval has elapsed
|
||||
if (System.currentTimeMillis() - openedAt >= halfOpenProbeMs) {
|
||||
if (state.compareAndSet(State.OPEN, State.HALF_OPEN)) {
|
||||
LOG.warn("Circuit breaker transitioning OPEN → HALF_OPEN (probe request allowed)");
|
||||
recordTransition("open_to_half_open");
|
||||
}
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
public void recordSuccess() {
|
||||
if (state.compareAndSet(State.HALF_OPEN, State.CLOSED)) {
|
||||
failureTimestamps.clear();
|
||||
LOG.warn("Circuit breaker transitioning HALF_OPEN → CLOSED (probe succeeded)");
|
||||
recordTransition("half_open_to_closed");
|
||||
}
|
||||
}
|
||||
|
||||
public void recordFailure() {
|
||||
long now = System.currentTimeMillis();
|
||||
|
||||
State current = state.get();
|
||||
if (current == State.HALF_OPEN) {
|
||||
if (state.compareAndSet(State.HALF_OPEN, State.OPEN)) {
|
||||
openedAt = now;
|
||||
LOG.warn("Circuit breaker transitioning HALF_OPEN → OPEN (probe failed)");
|
||||
recordTransition("half_open_to_open");
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
failureTimestamps.addLast(now);
|
||||
pruneOldFailures(now);
|
||||
|
||||
if (failureTimestamps.size() >= failureThreshold
|
||||
&& state.compareAndSet(State.CLOSED, State.OPEN)) {
|
||||
openedAt = now;
|
||||
LOG.warn(
|
||||
"Circuit breaker transitioning CLOSED → OPEN ({} failures in {}ms window)",
|
||||
failureThreshold,
|
||||
windowMs);
|
||||
recordTransition("closed_to_open");
|
||||
}
|
||||
}
|
||||
|
||||
public State getState() {
|
||||
return state.get();
|
||||
}
|
||||
|
||||
public void reset() {
|
||||
state.set(State.CLOSED);
|
||||
failureTimestamps.clear();
|
||||
}
|
||||
|
||||
private void pruneOldFailures(long now) {
|
||||
long cutoff = now - windowMs;
|
||||
Iterator<Long> it = failureTimestamps.iterator();
|
||||
while (it.hasNext()) {
|
||||
if (it.next() < cutoff) {
|
||||
it.remove();
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void recordTransition(String transition) {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.recordCircuitBreakerTrip(transition);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -30,14 +30,20 @@ public interface BulkSink {
|
|||
@FunctionalInterface
|
||||
interface FailureCallback {
|
||||
/**
|
||||
* Called when a document fails to index in ES/OpenSearch.
|
||||
* Called when a document fails to index.
|
||||
*
|
||||
* @param entityType The type of entity that failed
|
||||
* @param entityId The ID of the entity (from document ID), may be null for build failures
|
||||
* @param entityFqn The FQN of the entity, may be null if not available
|
||||
* @param errorMessage The error message from ES/OpenSearch
|
||||
* @param errorMessage The error message describing the failure
|
||||
* @param stage The pipeline stage where the failure occurred (PROCESS or SINK)
|
||||
*/
|
||||
void onFailure(String entityType, String entityId, String entityFqn, String errorMessage);
|
||||
void onFailure(
|
||||
String entityType,
|
||||
String entityId,
|
||||
String entityFqn,
|
||||
String errorMessage,
|
||||
IndexingFailureRecorder.FailureStage stage);
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -60,6 +66,16 @@ public interface BulkSink {
|
|||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the process stage statistics. This tracks document building/transformation
|
||||
* separately from the actual sink (bulk indexing) stats.
|
||||
*
|
||||
* @return StepStats with process success/failed counts, or null if not supported
|
||||
*/
|
||||
default StepStats getProcessStats() {
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Wait for all pending vector embedding tasks to complete. This is important for ensuring
|
||||
* no vector tasks are lost when the job completes. The sink's close() method should also
|
||||
|
|
@ -82,6 +98,30 @@ public interface BulkSink {
|
|||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the number of currently active (in-flight) bulk requests.
|
||||
*
|
||||
* @return Number of active bulk requests
|
||||
*/
|
||||
default int getActiveBulkRequestCount() {
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Wait for vector embedding tasks to complete and return detailed result including timing.
|
||||
*
|
||||
* @param timeoutSeconds Maximum time to wait
|
||||
* @return VectorCompletionResult with completion status, pending count, and wait time
|
||||
*/
|
||||
default VectorCompletionResult awaitVectorCompletionWithDetails(int timeoutSeconds) {
|
||||
long start = System.currentTimeMillis();
|
||||
boolean ok = awaitVectorCompletion(timeoutSeconds);
|
||||
long waited = System.currentTimeMillis() - start;
|
||||
return ok
|
||||
? VectorCompletionResult.success(waited)
|
||||
: VectorCompletionResult.timeout(getPendingVectorTaskCount(), waited);
|
||||
}
|
||||
|
||||
/** Key for passing StageStatsTracker through context data to the sink. */
|
||||
String STATS_TRACKER_CONTEXT_KEY = "stageStatsTracker";
|
||||
}
|
||||
|
|
|
|||
|
|
@ -122,6 +122,51 @@ public class CompositeProgressListener implements ReindexingProgressListener {
|
|||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onReaderFailure(String entityType, String entityId, String error, FailureType type) {
|
||||
for (ReindexingProgressListener listener : listeners) {
|
||||
try {
|
||||
listener.onReaderFailure(entityType, entityId, error, type);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Listener {} failed on onReaderFailure", listener.getClass().getSimpleName(), e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onProcessFailure(String entityType, String entityId, String error) {
|
||||
for (ReindexingProgressListener listener : listeners) {
|
||||
try {
|
||||
listener.onProcessFailure(entityType, entityId, error);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Listener {} failed on onProcessFailure", listener.getClass().getSimpleName(), e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onSinkFailure(String entityType, String entityId, String error) {
|
||||
for (ReindexingProgressListener listener : listeners) {
|
||||
try {
|
||||
listener.onSinkFailure(entityType, entityId, error);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Listener {} failed on onSinkFailure", listener.getClass().getSimpleName(), e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onSubIndexingCompleted(String entityType, String subIndex, StepStats subIndexStats) {
|
||||
for (ReindexingProgressListener listener : listeners) {
|
||||
try {
|
||||
listener.onSubIndexingCompleted(entityType, subIndex, subIndexStats);
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Listener {} failed on onSubIndexingCompleted", listener.getClass().getSimpleName(), e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onJobCompleted(Stats finalStats, long elapsedMillis) {
|
||||
for (ReindexingProgressListener listener : listeners) {
|
||||
|
|
|
|||
|
|
@ -0,0 +1,649 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.common.utils.CommonUtil.listOrEmpty;
|
||||
import static org.openmetadata.service.Entity.QUERY_COST_RECORD;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESOLUTION_STATUS;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESULT;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.HashMap;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Optional;
|
||||
import java.util.Set;
|
||||
import java.util.UUID;
|
||||
import java.util.concurrent.CountDownLatch;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.ScheduledExecutorService;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
import java.util.concurrent.atomic.AtomicReference;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.analytics.ReportData;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
import org.openmetadata.schema.type.Include;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.DistributedSearchIndexExecutor;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.IndexJobStatus;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.SearchIndexJob;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.jdbi3.EntityTimeSeriesRepository;
|
||||
import org.openmetadata.service.jdbi3.ListFilter;
|
||||
import org.openmetadata.service.search.RecreateIndexHandler;
|
||||
import org.openmetadata.service.search.ReindexContext;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.search.vector.VectorIndexService;
|
||||
import org.openmetadata.service.util.FullyQualifiedName;
|
||||
|
||||
@Slf4j
|
||||
public class DistributedIndexingStrategy implements IndexingStrategy {
|
||||
|
||||
private static final Set<String> TIME_SERIES_ENTITIES =
|
||||
Set.of(
|
||||
ReportData.ReportDataType.ENTITY_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.RAW_COST_ANALYSIS_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.WEB_ANALYTIC_USER_ACTIVITY_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.WEB_ANALYTIC_ENTITY_VIEW_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.AGGREGATED_COST_ANALYSIS_REPORT_DATA.value(),
|
||||
TEST_CASE_RESOLUTION_STATUS,
|
||||
TEST_CASE_RESULT,
|
||||
QUERY_COST_RECORD);
|
||||
|
||||
private static final long MONITOR_POLL_INTERVAL_MS = 2000;
|
||||
|
||||
private final CollectionDAO collectionDAO;
|
||||
private final SearchRepository searchRepository;
|
||||
private final EventPublisherJob jobData;
|
||||
private final UUID appId;
|
||||
private final Long appStartTime;
|
||||
private final String createdBy;
|
||||
|
||||
private final CompositeProgressListener listeners = new CompositeProgressListener();
|
||||
private final AtomicBoolean stopped = new AtomicBoolean(false);
|
||||
private final AtomicReference<Stats> currentStats = new AtomicReference<>();
|
||||
|
||||
private volatile DistributedSearchIndexExecutor distributedExecutor;
|
||||
private volatile BulkSink searchIndexSink;
|
||||
private volatile ReindexingConfiguration config;
|
||||
|
||||
public DistributedIndexingStrategy(
|
||||
CollectionDAO collectionDAO,
|
||||
SearchRepository searchRepository,
|
||||
EventPublisherJob jobData,
|
||||
UUID appId,
|
||||
Long appStartTime,
|
||||
String createdBy) {
|
||||
this.collectionDAO = collectionDAO;
|
||||
this.searchRepository = searchRepository;
|
||||
this.jobData = jobData;
|
||||
this.appId = appId;
|
||||
this.appStartTime = appStartTime;
|
||||
this.createdBy = createdBy;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void addListener(ReindexingProgressListener listener) {
|
||||
listeners.addListener(listener);
|
||||
}
|
||||
|
||||
@Override
|
||||
public ExecutionResult execute(ReindexingConfiguration config, ReindexingJobContext context) {
|
||||
long startTime = System.currentTimeMillis();
|
||||
try {
|
||||
return doExecute(config, context, startTime);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Distributed reindexing failed", e);
|
||||
Stats stats = currentStats.get();
|
||||
return ExecutionResult.fromStats(stats, ExecutionResult.Status.FAILED, startTime);
|
||||
}
|
||||
}
|
||||
|
||||
private ExecutionResult doExecute(
|
||||
ReindexingConfiguration config, ReindexingJobContext context, long startTime)
|
||||
throws Exception {
|
||||
|
||||
this.config = config;
|
||||
LOG.info("Starting distributed reindexing for entities: {}", config.entities());
|
||||
|
||||
Stats stats = initializeTotalRecords(config.entities());
|
||||
currentStats.set(stats);
|
||||
|
||||
int partitionSize = jobData.getPartitionSize() != null ? jobData.getPartitionSize() : 10000;
|
||||
distributedExecutor = new DistributedSearchIndexExecutor(collectionDAO, partitionSize);
|
||||
distributedExecutor.performStartupRecovery();
|
||||
|
||||
distributedExecutor.addListener(listeners);
|
||||
|
||||
SearchIndexJob distributedJob =
|
||||
distributedExecutor.createJob(config.entities(), jobData, createdBy, config);
|
||||
|
||||
LOG.info(
|
||||
"Created distributed job {} with {} total records",
|
||||
distributedJob.getId(),
|
||||
distributedJob.getTotalRecords());
|
||||
|
||||
searchIndexSink =
|
||||
searchRepository.createBulkSink(
|
||||
config.batchSize(), config.maxConcurrentRequests(), config.payloadSize());
|
||||
|
||||
RecreateIndexHandler recreateIndexHandler = searchRepository.createReindexHandler();
|
||||
ReindexContext recreateContext = null;
|
||||
|
||||
if (config.recreateIndex()) {
|
||||
recreateContext = recreateIndexHandler.reCreateIndexes(config.entities());
|
||||
if (recreateContext != null && !recreateContext.isEmpty()) {
|
||||
distributedExecutor.updateStagedIndexMapping(recreateContext.getStagedIndexMapping());
|
||||
}
|
||||
}
|
||||
|
||||
distributedExecutor.setAppContext(appId, appStartTime);
|
||||
distributedExecutor.execute(
|
||||
searchIndexSink, recreateContext, Boolean.TRUE.equals(config.recreateIndex()), config);
|
||||
|
||||
monitorDistributedJob(distributedJob.getId());
|
||||
|
||||
flushAndAwaitSink();
|
||||
|
||||
SearchIndexJob finalJob = distributedExecutor.getJobWithFreshStats();
|
||||
Map<String, Object> metadata = new HashMap<>();
|
||||
|
||||
if (finalJob != null) {
|
||||
StepStats sinkStats = searchIndexSink != null ? searchIndexSink.getStats() : null;
|
||||
updateStatsFromDistributedJob(stats, finalJob, sinkStats);
|
||||
|
||||
if (searchIndexSink != null) {
|
||||
StepStats sinkVectorStats = searchIndexSink.getVectorStats();
|
||||
if (sinkVectorStats != null && sinkVectorStats.getTotalRecords() > 0) {
|
||||
stats.setVectorStats(sinkVectorStats);
|
||||
}
|
||||
}
|
||||
|
||||
if (finalJob.getServerStats() != null && !finalJob.getServerStats().isEmpty()) {
|
||||
metadata.put("serverStats", finalJob.getServerStats());
|
||||
metadata.put("serverCount", finalJob.getServerStats().size());
|
||||
metadata.put("distributedJobId", finalJob.getId().toString());
|
||||
}
|
||||
}
|
||||
|
||||
currentStats.set(stats);
|
||||
|
||||
boolean success =
|
||||
finalizeAllEntityReindex(
|
||||
recreateIndexHandler,
|
||||
recreateContext,
|
||||
!stopped.get() && !hasIncompleteProcessing(stats));
|
||||
|
||||
ExecutionResult.Status resultStatus = determineStatus(stats);
|
||||
|
||||
StatsReconciler.reconcile(stats);
|
||||
|
||||
return ExecutionResult.builder()
|
||||
.status(resultStatus)
|
||||
.totalRecords(stats.getJobStats().getTotalRecords())
|
||||
.successRecords(stats.getJobStats().getSuccessRecords())
|
||||
.failedRecords(stats.getJobStats().getFailedRecords())
|
||||
.startTime(startTime)
|
||||
.endTime(System.currentTimeMillis())
|
||||
.finalStats(stats)
|
||||
.metadata(metadata)
|
||||
.build();
|
||||
}
|
||||
|
||||
private void flushAndAwaitSink() {
|
||||
if (searchIndexSink == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
int pendingVectorTasks = searchIndexSink.getPendingVectorTaskCount();
|
||||
if (pendingVectorTasks > 0) {
|
||||
LOG.info("Waiting for {} pending vector embedding tasks to complete", pendingVectorTasks);
|
||||
boolean vectorComplete = searchIndexSink.awaitVectorCompletion(120);
|
||||
if (!vectorComplete) {
|
||||
LOG.warn("Vector embedding wait timed out - some tasks may not be reflected in stats");
|
||||
}
|
||||
}
|
||||
|
||||
LOG.info("Flushing sink and waiting for pending bulk requests");
|
||||
boolean flushComplete = searchIndexSink.flushAndAwait(60);
|
||||
if (!flushComplete) {
|
||||
LOG.warn("Sink flush timed out - some requests may not be reflected in stats");
|
||||
}
|
||||
|
||||
try {
|
||||
searchIndexSink.close();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error closing search index sink", e);
|
||||
}
|
||||
}
|
||||
|
||||
private void monitorDistributedJob(UUID jobId) {
|
||||
CountDownLatch completionLatch = new CountDownLatch(1);
|
||||
ScheduledExecutorService monitor =
|
||||
Executors.newSingleThreadScheduledExecutor(
|
||||
Thread.ofPlatform().name("distributed-monitor").factory());
|
||||
|
||||
try {
|
||||
monitor.scheduleAtFixedRate(
|
||||
() -> {
|
||||
try {
|
||||
if (stopped.get()) {
|
||||
LOG.info("Stop signal received, stopping distributed job");
|
||||
distributedExecutor.stop();
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
SearchIndexJob job = distributedExecutor.getJobWithFreshStats();
|
||||
if (job == null) {
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
IndexJobStatus status = job.getStatus();
|
||||
if (status == IndexJobStatus.COMPLETED
|
||||
|| status == IndexJobStatus.COMPLETED_WITH_ERRORS
|
||||
|| status == IndexJobStatus.FAILED
|
||||
|| status == IndexJobStatus.STOPPED) {
|
||||
LOG.info("Distributed job {} completed with status: {}", jobId, status);
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
updateStatsFromDistributedJob(currentStats.get(), job, null);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error in distributed job monitor task for job {}", jobId, e);
|
||||
}
|
||||
},
|
||||
0,
|
||||
MONITOR_POLL_INTERVAL_MS,
|
||||
TimeUnit.MILLISECONDS);
|
||||
|
||||
completionLatch.await();
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
LOG.warn("Distributed job monitoring interrupted");
|
||||
} finally {
|
||||
monitor.shutdownNow();
|
||||
}
|
||||
}
|
||||
|
||||
private void updateStatsFromDistributedJob(
|
||||
Stats stats, SearchIndexJob distributedJob, StepStats actualSinkStats) {
|
||||
if (stats == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
CollectionDAO.SearchIndexServerStatsDAO.AggregatedServerStats serverStatsAggr = null;
|
||||
try {
|
||||
serverStatsAggr =
|
||||
Entity.getCollectionDAO()
|
||||
.searchIndexServerStatsDAO()
|
||||
.getAggregatedStats(distributedJob.getId().toString());
|
||||
if (serverStatsAggr != null) {
|
||||
LOG.info(
|
||||
"Fetched aggregated server stats for job {}: readerSuccess={}, readerFailed={}, "
|
||||
+ "sinkSuccess={}, sinkFailed={}",
|
||||
distributedJob.getId(),
|
||||
serverStatsAggr.readerSuccess(),
|
||||
serverStatsAggr.readerFailed(),
|
||||
serverStatsAggr.sinkSuccess(),
|
||||
serverStatsAggr.sinkFailed());
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not fetch aggregated server stats for job {}", distributedJob.getId(), e);
|
||||
}
|
||||
|
||||
long successRecords;
|
||||
long failedRecords;
|
||||
String statsSource;
|
||||
|
||||
if (serverStatsAggr != null && serverStatsAggr.sinkSuccess() > 0) {
|
||||
successRecords = serverStatsAggr.sinkSuccess();
|
||||
failedRecords =
|
||||
serverStatsAggr.readerFailed()
|
||||
+ serverStatsAggr.sinkFailed()
|
||||
+ serverStatsAggr.processFailed();
|
||||
statsSource = "serverStatsTable";
|
||||
} else if (actualSinkStats != null) {
|
||||
successRecords = actualSinkStats.getSuccessRecords();
|
||||
failedRecords = actualSinkStats.getFailedRecords();
|
||||
statsSource = "localSink";
|
||||
} else {
|
||||
successRecords = distributedJob.getSuccessRecords();
|
||||
failedRecords = distributedJob.getFailedRecords();
|
||||
statsSource = "partition-based";
|
||||
}
|
||||
|
||||
LOG.debug(
|
||||
"Stats source: {}, success={}, failed={}", statsSource, successRecords, failedRecords);
|
||||
|
||||
StepStats jobStats = stats.getJobStats();
|
||||
if (jobStats != null) {
|
||||
jobStats.setSuccessRecords(saturatedToInt(successRecords));
|
||||
jobStats.setFailedRecords(saturatedToInt(failedRecords));
|
||||
}
|
||||
|
||||
StepStats readerStats = stats.getReaderStats();
|
||||
if (readerStats != null) {
|
||||
readerStats.setTotalRecords(saturatedToInt(distributedJob.getTotalRecords()));
|
||||
long readerFailed = serverStatsAggr != null ? serverStatsAggr.readerFailed() : 0;
|
||||
long readerWarnings = serverStatsAggr != null ? serverStatsAggr.readerWarnings() : 0;
|
||||
long readerSuccess =
|
||||
serverStatsAggr != null
|
||||
? serverStatsAggr.readerSuccess()
|
||||
: distributedJob.getTotalRecords() - readerFailed - readerWarnings;
|
||||
readerStats.setSuccessRecords(saturatedToInt(readerSuccess));
|
||||
readerStats.setFailedRecords(saturatedToInt(readerFailed));
|
||||
readerStats.setWarningRecords(saturatedToInt(readerWarnings));
|
||||
}
|
||||
|
||||
StepStats processStats = stats.getProcessStats();
|
||||
if (processStats != null && serverStatsAggr != null) {
|
||||
long processSuccess = serverStatsAggr.processSuccess();
|
||||
long processFailed = serverStatsAggr.processFailed();
|
||||
processStats.setTotalRecords(saturatedToInt(processSuccess + processFailed));
|
||||
processStats.setSuccessRecords(saturatedToInt(processSuccess));
|
||||
processStats.setFailedRecords(saturatedToInt(processFailed));
|
||||
}
|
||||
|
||||
StepStats sinkStats = stats.getSinkStats();
|
||||
if (sinkStats != null) {
|
||||
if (serverStatsAggr != null) {
|
||||
long sinkSuccess = serverStatsAggr.sinkSuccess();
|
||||
long sinkFailed = serverStatsAggr.sinkFailed();
|
||||
long actualSinkTotal = sinkSuccess + sinkFailed;
|
||||
sinkStats.setTotalRecords(saturatedToInt(actualSinkTotal));
|
||||
sinkStats.setSuccessRecords(saturatedToInt(sinkSuccess));
|
||||
sinkStats.setFailedRecords(saturatedToInt(sinkFailed));
|
||||
} else {
|
||||
long sinkTotal = distributedJob.getTotalRecords();
|
||||
sinkStats.setTotalRecords(saturatedToInt(sinkTotal));
|
||||
sinkStats.setSuccessRecords(saturatedToInt(successRecords));
|
||||
sinkStats.setFailedRecords(saturatedToInt(failedRecords));
|
||||
}
|
||||
}
|
||||
|
||||
StepStats vectorStats = stats.getVectorStats();
|
||||
if (vectorStats != null && serverStatsAggr != null) {
|
||||
long vectorSuccess = serverStatsAggr.vectorSuccess();
|
||||
long vectorFailed = serverStatsAggr.vectorFailed();
|
||||
vectorStats.setTotalRecords(saturatedToInt(vectorSuccess + vectorFailed));
|
||||
vectorStats.setSuccessRecords(saturatedToInt(vectorSuccess));
|
||||
vectorStats.setFailedRecords(saturatedToInt(vectorFailed));
|
||||
}
|
||||
|
||||
if (distributedJob.getEntityStats() != null && stats.getEntityStats() != null) {
|
||||
for (Map.Entry<String, SearchIndexJob.EntityTypeStats> entry :
|
||||
distributedJob.getEntityStats().entrySet()) {
|
||||
StepStats entityStats =
|
||||
stats.getEntityStats().getAdditionalProperties().get(entry.getKey());
|
||||
if (entityStats != null) {
|
||||
entityStats.setSuccessRecords(saturatedToInt(entry.getValue().getSuccessRecords()));
|
||||
entityStats.setFailedRecords(saturatedToInt(entry.getValue().getFailedRecords()));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
StatsReconciler.reconcile(stats);
|
||||
}
|
||||
|
||||
private static int saturatedToInt(long value) {
|
||||
return (int) Math.min(value, Integer.MAX_VALUE);
|
||||
}
|
||||
|
||||
private ExecutionResult.Status determineStatus(Stats stats) {
|
||||
if (stopped.get()) {
|
||||
return ExecutionResult.Status.STOPPED;
|
||||
}
|
||||
if (hasIncompleteProcessing(stats)) {
|
||||
return ExecutionResult.Status.COMPLETED_WITH_ERRORS;
|
||||
}
|
||||
return ExecutionResult.Status.COMPLETED;
|
||||
}
|
||||
|
||||
private boolean hasIncompleteProcessing(Stats stats) {
|
||||
if (stats == null || stats.getJobStats() == null) {
|
||||
return false;
|
||||
}
|
||||
StepStats jobStats = stats.getJobStats();
|
||||
long failed = jobStats.getFailedRecords() != null ? jobStats.getFailedRecords() : 0;
|
||||
long processed = jobStats.getSuccessRecords() != null ? jobStats.getSuccessRecords() : 0;
|
||||
long total = jobStats.getTotalRecords() != null ? jobStats.getTotalRecords() : 0;
|
||||
return failed > 0 || (total > 0 && processed < total);
|
||||
}
|
||||
|
||||
private boolean finalizeAllEntityReindex(
|
||||
RecreateIndexHandler recreateIndexHandler,
|
||||
ReindexContext recreateContext,
|
||||
boolean finalSuccess) {
|
||||
if (recreateIndexHandler == null || recreateContext == null) {
|
||||
return finalSuccess;
|
||||
}
|
||||
|
||||
Set<String> promotedEntities = Collections.emptySet();
|
||||
if (distributedExecutor != null && distributedExecutor.getEntityTracker() != null) {
|
||||
promotedEntities = distributedExecutor.getEntityTracker().getPromotedEntities();
|
||||
}
|
||||
|
||||
// Get per-entity stats for determining per-entity success
|
||||
Map<String, SearchIndexJob.EntityTypeStats> entityStatsMap = Collections.emptyMap();
|
||||
if (distributedExecutor != null) {
|
||||
SearchIndexJob finalJob = distributedExecutor.getJobWithFreshStats();
|
||||
if (finalJob != null && finalJob.getEntityStats() != null) {
|
||||
entityStatsMap = finalJob.getEntityStats();
|
||||
}
|
||||
}
|
||||
|
||||
LOG.debug(
|
||||
"Finalization: finalSuccess={}, promotedEntities={}, allEntities={}",
|
||||
finalSuccess,
|
||||
promotedEntities,
|
||||
recreateContext.getEntities());
|
||||
|
||||
Set<String> entitiesToFinalize = new HashSet<>(recreateContext.getEntities());
|
||||
entitiesToFinalize.removeAll(promotedEntities);
|
||||
|
||||
boolean hasVectorIndex = entitiesToFinalize.remove(VectorIndexService.VECTOR_INDEX_KEY);
|
||||
|
||||
LOG.debug("Entities to finalize={}, already promoted={}", entitiesToFinalize, promotedEntities);
|
||||
|
||||
try {
|
||||
if (!entitiesToFinalize.isEmpty()) {
|
||||
LOG.info(
|
||||
"Finalizing {} remaining entities (already promoted: {})",
|
||||
entitiesToFinalize.size(),
|
||||
promotedEntities.size());
|
||||
|
||||
for (String entityType : entitiesToFinalize) {
|
||||
try {
|
||||
boolean entitySuccess = computeEntitySuccess(entityType, entityStatsMap);
|
||||
LOG.debug(
|
||||
"Finalizing entity '{}' with perEntitySuccess={} (globalSuccess={})",
|
||||
entityType,
|
||||
entitySuccess,
|
||||
finalSuccess);
|
||||
finalizeEntityReindex(recreateIndexHandler, recreateContext, entityType, entitySuccess);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize reindex for entity: {}", entityType, ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (hasVectorIndex) {
|
||||
boolean vectorSuccess =
|
||||
finalSuccess
|
||||
|| (currentStats.get() != null && !hasIncompleteProcessing(currentStats.get()));
|
||||
try {
|
||||
finalizeEntityReindex(
|
||||
recreateIndexHandler,
|
||||
recreateContext,
|
||||
VectorIndexService.VECTOR_INDEX_KEY,
|
||||
vectorSuccess);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize vector index", ex);
|
||||
}
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error during entity finalization", e);
|
||||
}
|
||||
|
||||
return finalSuccess;
|
||||
}
|
||||
|
||||
private boolean computeEntitySuccess(
|
||||
String entityType, Map<String, SearchIndexJob.EntityTypeStats> entityStatsMap) {
|
||||
if (entityStatsMap == null || entityStatsMap.isEmpty()) {
|
||||
return false;
|
||||
}
|
||||
SearchIndexJob.EntityTypeStats stats = entityStatsMap.get(entityType);
|
||||
if (stats == null) {
|
||||
// Entity not in stats means 0 records — nothing to index = success
|
||||
return true;
|
||||
}
|
||||
return stats.getFailedRecords() == 0
|
||||
&& stats.getSuccessRecords() + stats.getFailedRecords() >= stats.getTotalRecords();
|
||||
}
|
||||
|
||||
private void finalizeEntityReindex(
|
||||
RecreateIndexHandler recreateIndexHandler,
|
||||
ReindexContext recreateContext,
|
||||
String entityType,
|
||||
boolean success) {
|
||||
try {
|
||||
var entityReindexContext =
|
||||
org.openmetadata.service.search.EntityReindexContext.builder()
|
||||
.entityType(entityType)
|
||||
.originalIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.canonicalIndex(recreateContext.getCanonicalIndex(entityType).orElse(null))
|
||||
.activeIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.stagedIndex(recreateContext.getStagedIndex(entityType).orElse(null))
|
||||
.canonicalAliases(recreateContext.getCanonicalAlias(entityType).orElse(null))
|
||||
.existingAliases(recreateContext.getExistingAliases(entityType))
|
||||
.parentAliases(
|
||||
new HashSet<>(listOrEmpty(recreateContext.getParentAliases(entityType))))
|
||||
.build();
|
||||
|
||||
recreateIndexHandler.finalizeReindex(entityReindexContext, success);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize index recreation flow for {}", entityType, ex);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Stats> getStats() {
|
||||
return Optional.ofNullable(currentStats.get());
|
||||
}
|
||||
|
||||
@Override
|
||||
public void stop() {
|
||||
if (stopped.compareAndSet(false, true)) {
|
||||
LOG.info("Stopping distributed indexing strategy");
|
||||
|
||||
if (distributedExecutor != null) {
|
||||
try {
|
||||
distributedExecutor.stop();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error stopping distributed executor", e);
|
||||
}
|
||||
}
|
||||
// Do NOT close the sink here — workers may still be writing to it.
|
||||
// The sink is properly flushed and closed by flushAndAwaitSink() in doExecute()
|
||||
// after the monitor exits and the executor's finally block completes.
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isStopped() {
|
||||
return stopped.get();
|
||||
}
|
||||
|
||||
Stats initializeTotalRecords(Set<String> entities) {
|
||||
Stats stats = new Stats();
|
||||
stats.setEntityStats(new org.openmetadata.schema.system.EntityStats());
|
||||
stats.setJobStats(new StepStats());
|
||||
stats.setReaderStats(new StepStats());
|
||||
stats.setProcessStats(new StepStats());
|
||||
stats.setSinkStats(new StepStats());
|
||||
stats.setVectorStats(new StepStats());
|
||||
|
||||
List<String> ordered = EntityPriority.sortByPriority(entities);
|
||||
int total = 0;
|
||||
for (String entityType : ordered) {
|
||||
int entityTotal = getEntityTotal(entityType);
|
||||
total += entityTotal;
|
||||
|
||||
StepStats entityStats = new StepStats();
|
||||
entityStats.setTotalRecords(entityTotal);
|
||||
entityStats.setSuccessRecords(0);
|
||||
entityStats.setFailedRecords(0);
|
||||
stats.getEntityStats().getAdditionalProperties().put(entityType, entityStats);
|
||||
}
|
||||
|
||||
stats.getJobStats().setTotalRecords(total);
|
||||
stats.getJobStats().setSuccessRecords(0);
|
||||
stats.getJobStats().setFailedRecords(0);
|
||||
|
||||
stats.getReaderStats().setTotalRecords(total);
|
||||
stats.getReaderStats().setSuccessRecords(0);
|
||||
stats.getReaderStats().setFailedRecords(0);
|
||||
|
||||
stats.getProcessStats().setTotalRecords(0);
|
||||
stats.getProcessStats().setSuccessRecords(0);
|
||||
stats.getProcessStats().setFailedRecords(0);
|
||||
|
||||
stats.getSinkStats().setTotalRecords(0);
|
||||
stats.getSinkStats().setSuccessRecords(0);
|
||||
stats.getSinkStats().setFailedRecords(0);
|
||||
|
||||
stats.getVectorStats().setTotalRecords(0);
|
||||
stats.getVectorStats().setSuccessRecords(0);
|
||||
stats.getVectorStats().setFailedRecords(0);
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
private int getEntityTotal(String entityType) {
|
||||
try {
|
||||
String correctedType = "queryCostResult".equals(entityType) ? QUERY_COST_RECORD : entityType;
|
||||
|
||||
if (!TIME_SERIES_ENTITIES.contains(correctedType)) {
|
||||
return Entity.getEntityRepository(correctedType)
|
||||
.getDao()
|
||||
.listCount(new ListFilter(Include.ALL));
|
||||
} else {
|
||||
ListFilter listFilter = new ListFilter(null);
|
||||
EntityTimeSeriesRepository<?> repository;
|
||||
|
||||
if (isDataInsightIndex(correctedType)) {
|
||||
listFilter.addQueryParam("entityFQNHash", FullyQualifiedName.buildHash(correctedType));
|
||||
repository = Entity.getEntityTimeSeriesRepository(Entity.ENTITY_REPORT_DATA);
|
||||
} else {
|
||||
repository = Entity.getEntityTimeSeriesRepository(correctedType);
|
||||
}
|
||||
|
||||
if (config != null) {
|
||||
long startTs = config.getTimeSeriesStartTs(correctedType);
|
||||
if (startTs > 0) {
|
||||
long endTs = System.currentTimeMillis();
|
||||
return repository.getTimeSeriesDao().listCount(listFilter, startTs, endTs, false);
|
||||
}
|
||||
}
|
||||
return repository.getTimeSeriesDao().listCount(listFilter);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Error getting total for '{}'", entityType, e);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
private boolean isDataInsightIndex(String entityType) {
|
||||
return entityType.endsWith("ReportData");
|
||||
}
|
||||
|
||||
DistributedSearchIndexExecutor getDistributedExecutor() {
|
||||
return distributedExecutor;
|
||||
}
|
||||
}
|
||||
|
|
@ -66,6 +66,10 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
private final AtomicLong totalSuccess = new AtomicLong(0);
|
||||
private final AtomicLong totalFailed = new AtomicLong(0);
|
||||
|
||||
// Process stage metrics (document building/transformation)
|
||||
private final AtomicLong processSuccess = new AtomicLong(0);
|
||||
private final AtomicLong processFailed = new AtomicLong(0);
|
||||
|
||||
// Configuration
|
||||
private volatile int batchSize;
|
||||
private volatile int maxConcurrentRequests;
|
||||
|
|
@ -99,6 +103,7 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
concurrentRequests,
|
||||
maxPayloadSizeBytes / (1024 * 1024));
|
||||
|
||||
BulkCircuitBreaker circuitBreaker = new BulkCircuitBreaker(5, 30_000, 10_000);
|
||||
return new CustomBulkProcessor(
|
||||
searchClient,
|
||||
bulkActions,
|
||||
|
|
@ -110,7 +115,8 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
totalSubmitted,
|
||||
totalSuccess,
|
||||
totalFailed,
|
||||
this::updateStats);
|
||||
this::updateStats,
|
||||
circuitBreaker);
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -258,12 +264,14 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
tracker.incrementPendingSink();
|
||||
}
|
||||
bulkProcessor.add(operation, docId, entityType, tracker, estimatedSize);
|
||||
processSuccess.incrementAndGet();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.SUCCESS);
|
||||
}
|
||||
} catch (EntityNotFoundException e) {
|
||||
LOG.error("Entity Not Found Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -274,12 +282,14 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
entityTypeName,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
entity.getFullyQualifiedName(),
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Encountered Issue while building SearchDoc from Entity Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -290,7 +300,8 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
entityTypeName,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
entity.getFullyQualifiedName(),
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -317,12 +328,14 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
tracker.incrementPendingSink();
|
||||
}
|
||||
bulkProcessor.add(operation, docId, entityType, tracker, estimatedSize);
|
||||
processSuccess.incrementAndGet();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.SUCCESS);
|
||||
}
|
||||
} catch (EntityNotFoundException e) {
|
||||
LOG.error("Entity Not Found Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -332,12 +345,14 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
entityType,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
null,
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Encountered Issue while building SearchDoc from Entity Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -347,7 +362,8 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
entityType,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
null,
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -377,6 +393,16 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
.withFailedRecords((int) failed);
|
||||
}
|
||||
|
||||
@Override
|
||||
public StepStats getProcessStats() {
|
||||
long success = processSuccess.get();
|
||||
long failed = processFailed.get();
|
||||
return new StepStats()
|
||||
.withTotalRecords((int) (success + failed))
|
||||
.withSuccessRecords((int) success)
|
||||
.withFailedRecords((int) failed);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
try {
|
||||
|
|
@ -404,6 +430,11 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public int getActiveBulkRequestCount() {
|
||||
return bulkProcessor.activeBulkRequests.get();
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean flushAndAwait(int timeoutSeconds) {
|
||||
try {
|
||||
|
|
@ -501,6 +532,7 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
private final int maxRetries;
|
||||
private volatile boolean closed = false;
|
||||
private volatile FailureCallback failureCallback;
|
||||
private final BulkCircuitBreaker circuitBreaker;
|
||||
|
||||
CustomBulkProcessor(
|
||||
ElasticSearchClient client,
|
||||
|
|
@ -513,7 +545,8 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
AtomicLong totalSubmitted,
|
||||
AtomicLong totalSuccess,
|
||||
AtomicLong totalFailed,
|
||||
Runnable statsUpdater) {
|
||||
Runnable statsUpdater,
|
||||
BulkCircuitBreaker circuitBreaker) {
|
||||
this.asyncClient = new ElasticsearchAsyncClient(client.getNewClient()._transport());
|
||||
this.bulkActions = bulkActions;
|
||||
this.maxPayloadSizeBytes = maxPayloadSizeBytes;
|
||||
|
|
@ -524,6 +557,7 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
this.totalSuccess = totalSuccess;
|
||||
this.totalFailed = totalFailed;
|
||||
this.statsUpdater = statsUpdater;
|
||||
this.circuitBreaker = circuitBreaker;
|
||||
this.scheduler = Executors.newScheduledThreadPool(1);
|
||||
|
||||
scheduler.scheduleAtFixedRate(
|
||||
|
|
@ -668,17 +702,35 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
|
||||
private void executeBulkWithRetry(
|
||||
List<BulkOperation> operations, long executionId, int numberOfActions, int attemptNumber) {
|
||||
if (!circuitBreaker.allowRequest()) {
|
||||
LOG.warn(
|
||||
"Circuit breaker OPEN - fail-fast for bulk request {} with {} actions",
|
||||
executionId,
|
||||
numberOfActions);
|
||||
totalFailed.addAndGet(numberOfActions);
|
||||
statsUpdater.run();
|
||||
activeBulkRequests.decrementAndGet();
|
||||
concurrentRequestSemaphore.release();
|
||||
return;
|
||||
}
|
||||
|
||||
CompletableFuture<BulkResponse> future =
|
||||
asyncClient.bulk(b -> b.operations(operations).refresh(Refresh.False));
|
||||
|
||||
future.whenComplete(
|
||||
(response, error) -> {
|
||||
boolean retryScheduled = false;
|
||||
try {
|
||||
if (error != null) {
|
||||
handleBulkFailure(operations, executionId, numberOfActions, attemptNumber, error);
|
||||
circuitBreaker.recordFailure();
|
||||
retryScheduled =
|
||||
handleBulkFailure(
|
||||
operations, executionId, numberOfActions, attemptNumber, error);
|
||||
} else if (response.errors()) {
|
||||
circuitBreaker.recordSuccess();
|
||||
handlePartialFailure(response, executionId, numberOfActions);
|
||||
} else {
|
||||
circuitBreaker.recordSuccess();
|
||||
totalSuccess.addAndGet(numberOfActions);
|
||||
LOG.debug(
|
||||
"Bulk request {} completed successfully with {} actions",
|
||||
|
|
@ -707,9 +759,7 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
}
|
||||
}
|
||||
} finally {
|
||||
if (error != null && shouldRetry(attemptNumber, error)) {
|
||||
// Don't release resources yet, we're retrying
|
||||
} else {
|
||||
if (!retryScheduled) {
|
||||
activeBulkRequests.decrementAndGet();
|
||||
concurrentRequestSemaphore.release();
|
||||
}
|
||||
|
|
@ -717,13 +767,14 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
});
|
||||
}
|
||||
|
||||
private void handleBulkFailure(
|
||||
private boolean handleBulkFailure(
|
||||
List<BulkOperation> operations,
|
||||
long executionId,
|
||||
int numberOfActions,
|
||||
int attemptNumber,
|
||||
Throwable error) {
|
||||
if (shouldRetry(attemptNumber, error)) {
|
||||
if (shouldRetry(attemptNumber, error)
|
||||
&& circuitBreaker.getState() != BulkCircuitBreaker.State.OPEN) {
|
||||
long backoffTime = calculateBackoff(attemptNumber);
|
||||
LOG.warn(
|
||||
"Bulk request {} failed (attempt {}), retrying in {}ms: {}",
|
||||
|
|
@ -736,6 +787,7 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
() -> executeBulkWithRetry(operations, executionId, numberOfActions, attemptNumber + 1),
|
||||
backoffTime,
|
||||
TimeUnit.MILLISECONDS);
|
||||
return true;
|
||||
} else {
|
||||
totalFailed.addAndGet(numberOfActions);
|
||||
LOG.error(
|
||||
|
|
@ -758,11 +810,17 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
tracker.recordSink(StatsResult.FAILED);
|
||||
}
|
||||
if (failureCallback != null) {
|
||||
failureCallback.onFailure(entityType, docId, null, error.getMessage());
|
||||
failureCallback.onFailure(
|
||||
entityType,
|
||||
docId,
|
||||
null,
|
||||
error.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.SINK);
|
||||
}
|
||||
}
|
||||
}
|
||||
statsUpdater.run();
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -791,7 +849,8 @@ public class ElasticSearchBulkSink implements BulkSink {
|
|||
tracker.recordSink(StatsResult.FAILED);
|
||||
}
|
||||
if (failureCallback != null) {
|
||||
failureCallback.onFailure(entityType, docId, null, failureMessage);
|
||||
failureCallback.onFailure(
|
||||
entityType, docId, null, failureMessage, IndexingFailureRecorder.FailureStage.SINK);
|
||||
}
|
||||
} else {
|
||||
// Clean up on success
|
||||
|
|
|
|||
|
|
@ -0,0 +1,38 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Set;
|
||||
|
||||
/**
|
||||
* Per-entity-type batch sizing based on typical document size. Large entity types (tables,
|
||||
* dashboards, etc.) produce bigger search documents, so we use smaller batches. Small entity types
|
||||
* (users, tags, etc.) produce tiny documents, so we can use larger batches.
|
||||
*/
|
||||
public final class EntityBatchSizeEstimator {
|
||||
|
||||
private static final Set<String> LARGE_ENTITIES =
|
||||
Set.of("table", "topic", "dashboard", "mlmodel", "container", "storedProcedure");
|
||||
|
||||
private static final Set<String> SMALL_ENTITIES =
|
||||
Set.of("user", "team", "bot", "role", "policy", "tag", "classification");
|
||||
|
||||
private static final int MIN_BATCH_SIZE = 25;
|
||||
private static final int MAX_BATCH_SIZE = 1000;
|
||||
|
||||
private EntityBatchSizeEstimator() {}
|
||||
|
||||
public static int estimateBatchSize(String entityType, int baseBatchSize) {
|
||||
if (baseBatchSize <= 0) {
|
||||
return baseBatchSize;
|
||||
}
|
||||
|
||||
if (LARGE_ENTITIES.contains(entityType)) {
|
||||
return Math.max(baseBatchSize / 2, MIN_BATCH_SIZE);
|
||||
}
|
||||
|
||||
if (SMALL_ENTITIES.contains(entityType)) {
|
||||
return Math.min(baseBatchSize * 2, MAX_BATCH_SIZE);
|
||||
}
|
||||
|
||||
return baseBatchSize;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,125 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Comparator;
|
||||
import java.util.LinkedHashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
/**
|
||||
* Defines priority tiers for entity types during reindexing. Higher-priority entities are processed
|
||||
* first so that foundational data (services, users) is available in the search index before
|
||||
* dependent data (tables, dashboards) is indexed.
|
||||
*/
|
||||
public final class EntityPriority {
|
||||
|
||||
enum Tier {
|
||||
CRITICAL(0, 100),
|
||||
HIGH(1, 80),
|
||||
MEDIUM(2, 60),
|
||||
LOW(3, 40),
|
||||
LOWEST(4, 20);
|
||||
|
||||
private final int order;
|
||||
private final int numericPriority;
|
||||
|
||||
Tier(int order, int numericPriority) {
|
||||
this.order = order;
|
||||
this.numericPriority = numericPriority;
|
||||
}
|
||||
|
||||
int order() {
|
||||
return order;
|
||||
}
|
||||
|
||||
int numericPriority() {
|
||||
return numericPriority;
|
||||
}
|
||||
}
|
||||
|
||||
private static final Map<String, Tier> ENTITY_TIERS = new LinkedHashMap<>();
|
||||
|
||||
static {
|
||||
// P0 CRITICAL: Service entities — hierarchy parents, must exist first
|
||||
for (String s :
|
||||
List.of(
|
||||
"databaseService",
|
||||
"messagingService",
|
||||
"dashboardService",
|
||||
"pipelineService",
|
||||
"mlmodelService",
|
||||
"storageService",
|
||||
"searchService",
|
||||
"apiService",
|
||||
"metadataService")) {
|
||||
ENTITY_TIERS.put(s, Tier.CRITICAL);
|
||||
}
|
||||
|
||||
// P1 HIGH: Identity/org entities — referenced by everything
|
||||
for (String s : List.of("user", "team", "role", "bot", "persona")) {
|
||||
ENTITY_TIERS.put(s, Tier.HIGH);
|
||||
}
|
||||
|
||||
// P2 MEDIUM: Core data assets
|
||||
for (String s :
|
||||
List.of(
|
||||
"table",
|
||||
"database",
|
||||
"databaseSchema",
|
||||
"dashboard",
|
||||
"chart",
|
||||
"pipeline",
|
||||
"topic",
|
||||
"mlmodel",
|
||||
"container",
|
||||
"storedProcedure",
|
||||
"query",
|
||||
"dashboardDataModel",
|
||||
"api",
|
||||
"apiEndpoint",
|
||||
"apiCollection")) {
|
||||
ENTITY_TIERS.put(s, Tier.MEDIUM);
|
||||
}
|
||||
|
||||
// P4 LOWEST: Time series entities
|
||||
for (String s :
|
||||
List.of(
|
||||
"entityReportData",
|
||||
"rawCostAnalysisReportData",
|
||||
"webAnalyticUserActivityReportData",
|
||||
"webAnalyticEntityViewReportData",
|
||||
"aggregatedCostAnalysisReportData",
|
||||
"testCaseResolutionStatus",
|
||||
"testCaseResult",
|
||||
"queryCostRecord")) {
|
||||
ENTITY_TIERS.put(s, Tier.LOWEST);
|
||||
}
|
||||
|
||||
// P3 LOW: Everything else not explicitly listed defaults to LOW via getTier()
|
||||
}
|
||||
|
||||
private EntityPriority() {}
|
||||
|
||||
static Tier getTier(String entityType) {
|
||||
return ENTITY_TIERS.getOrDefault(entityType, Tier.LOW);
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns a numeric priority score for the given entity type. Higher values mean higher priority.
|
||||
* Used by the distributed indexing path to store priority on partitions for SQL-level ordering.
|
||||
*/
|
||||
public static int getNumericPriority(String entityType) {
|
||||
return getTier(entityType).numericPriority();
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns entity types sorted by priority tier. Entities within the same tier preserve their
|
||||
* original iteration order from the input set.
|
||||
*/
|
||||
public static List<String> sortByPriority(Set<String> entities) {
|
||||
List<String> list = new ArrayList<>(entities);
|
||||
list.sort(Comparator.comparingInt(e -> getTier(e).order()));
|
||||
return list;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,336 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.service.Entity.QUERY_COST_RECORD;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESOLUTION_STATUS;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESULT;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Phaser;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.analytics.ReportData;
|
||||
import org.openmetadata.schema.utils.ResultList;
|
||||
import org.openmetadata.service.exception.SearchIndexException;
|
||||
import org.openmetadata.service.util.RestUtil;
|
||||
import org.openmetadata.service.workflows.searchIndex.PaginatedEntitiesSource;
|
||||
import org.openmetadata.service.workflows.searchIndex.PaginatedEntityTimeSeriesSource;
|
||||
|
||||
/**
|
||||
* Standalone reader that encapsulates all entity reading logic. Decoupled from queues and sinks —
|
||||
* delivers batches via a callback interface.
|
||||
*/
|
||||
@Slf4j
|
||||
public class EntityReader implements AutoCloseable {
|
||||
|
||||
static final Set<String> TIME_SERIES_ENTITIES =
|
||||
Set.of(
|
||||
ReportData.ReportDataType.ENTITY_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.RAW_COST_ANALYSIS_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.WEB_ANALYTIC_USER_ACTIVITY_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.WEB_ANALYTIC_ENTITY_VIEW_REPORT_DATA.value(),
|
||||
ReportData.ReportDataType.AGGREGATED_COST_ANALYSIS_REPORT_DATA.value(),
|
||||
TEST_CASE_RESOLUTION_STATUS,
|
||||
TEST_CASE_RESULT,
|
||||
QUERY_COST_RECORD);
|
||||
|
||||
private static final int MAX_READERS_PER_ENTITY = 5;
|
||||
|
||||
@FunctionalInterface
|
||||
public interface BatchCallback {
|
||||
void onBatchRead(String entityType, ResultList<?> batch, int offset)
|
||||
throws InterruptedException;
|
||||
}
|
||||
|
||||
@FunctionalInterface
|
||||
interface KeysetBatchReader {
|
||||
ResultList<?> readNextKeyset(String cursor) throws SearchIndexException;
|
||||
}
|
||||
|
||||
@FunctionalInterface
|
||||
interface BoundaryFinder {
|
||||
List<String> findBoundaries(int numReaders, int totalRecords);
|
||||
}
|
||||
|
||||
private static final int DEFAULT_MAX_RETRY_ATTEMPTS = 3;
|
||||
private static final long DEFAULT_RETRY_BACKOFF_MS = 500;
|
||||
|
||||
private final ExecutorService producerExecutor;
|
||||
private final AtomicBoolean stopped;
|
||||
private final int maxRetryAttempts;
|
||||
private final long retryBackoffMs;
|
||||
|
||||
public EntityReader(ExecutorService producerExecutor, AtomicBoolean stopped) {
|
||||
this(producerExecutor, stopped, DEFAULT_MAX_RETRY_ATTEMPTS, DEFAULT_RETRY_BACKOFF_MS);
|
||||
}
|
||||
|
||||
public EntityReader(
|
||||
ExecutorService producerExecutor,
|
||||
AtomicBoolean stopped,
|
||||
int maxRetryAttempts,
|
||||
long retryBackoffMs) {
|
||||
this.producerExecutor = producerExecutor;
|
||||
this.stopped = stopped;
|
||||
this.maxRetryAttempts = maxRetryAttempts;
|
||||
this.retryBackoffMs = retryBackoffMs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Read all entities of a given type, invoking callback for each batch.
|
||||
*
|
||||
* @param entityType The entity type to read
|
||||
* @param totalRecords Total records expected for this entity
|
||||
* @param batchSize Batch size for reading
|
||||
* @param phaser Phaser for completion tracking (readers will register/deregister)
|
||||
* @param callback Callback invoked with each batch
|
||||
* @return Number of readers submitted
|
||||
*/
|
||||
public int readEntity(
|
||||
String entityType, int totalRecords, int batchSize, Phaser phaser, BatchCallback callback) {
|
||||
return readEntity(entityType, totalRecords, batchSize, phaser, callback, null, null);
|
||||
}
|
||||
|
||||
public int readEntity(
|
||||
String entityType,
|
||||
int totalRecords,
|
||||
int batchSize,
|
||||
Phaser phaser,
|
||||
BatchCallback callback,
|
||||
Long timeSeriesStartTs,
|
||||
Long timeSeriesEndTs) {
|
||||
if (totalRecords <= 0) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
int numReaders =
|
||||
Math.min(calculateNumberOfReaders(totalRecords, batchSize), MAX_READERS_PER_ENTITY);
|
||||
phaser.bulkRegister(numReaders);
|
||||
|
||||
try {
|
||||
if (TIME_SERIES_ENTITIES.contains(entityType)) {
|
||||
submitReaders(
|
||||
entityType,
|
||||
totalRecords,
|
||||
batchSize,
|
||||
numReaders,
|
||||
phaser,
|
||||
callback,
|
||||
() -> {
|
||||
PaginatedEntityTimeSeriesSource source =
|
||||
(timeSeriesStartTs != null)
|
||||
? new PaginatedEntityTimeSeriesSource(
|
||||
entityType,
|
||||
batchSize,
|
||||
getSearchIndexFields(entityType),
|
||||
totalRecords,
|
||||
timeSeriesStartTs,
|
||||
timeSeriesEndTs)
|
||||
: new PaginatedEntityTimeSeriesSource(
|
||||
entityType, batchSize, getSearchIndexFields(entityType), totalRecords);
|
||||
return source::readWithCursor;
|
||||
},
|
||||
(readers, total) -> {
|
||||
List<String> cursors = new ArrayList<>();
|
||||
int perReader = total / readers;
|
||||
for (int i = 1; i < readers; i++) {
|
||||
cursors.add(RestUtil.encodeCursor(String.valueOf(i * perReader)));
|
||||
}
|
||||
return cursors;
|
||||
});
|
||||
} else {
|
||||
PaginatedEntitiesSource entSource =
|
||||
new PaginatedEntitiesSource(
|
||||
entityType, batchSize, getSearchIndexFields(entityType), totalRecords);
|
||||
submitReaders(
|
||||
entityType,
|
||||
totalRecords,
|
||||
batchSize,
|
||||
numReaders,
|
||||
phaser,
|
||||
callback,
|
||||
() -> {
|
||||
PaginatedEntitiesSource source =
|
||||
new PaginatedEntitiesSource(
|
||||
entityType, batchSize, getSearchIndexFields(entityType), totalRecords);
|
||||
return source::readNextKeyset;
|
||||
},
|
||||
entSource::findBoundaryCursors);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Failed to submit readers for {}, deregistering {} phaser parties",
|
||||
entityType,
|
||||
numReaders,
|
||||
e);
|
||||
for (int i = 0; i < numReaders; i++) {
|
||||
phaser.arriveAndDeregister();
|
||||
}
|
||||
throw e;
|
||||
}
|
||||
|
||||
return numReaders;
|
||||
}
|
||||
|
||||
public void stop() {
|
||||
stopped.set(true);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
stop();
|
||||
}
|
||||
|
||||
private void submitReaders(
|
||||
String entityType,
|
||||
int totalRecords,
|
||||
int batchSize,
|
||||
int numReaders,
|
||||
Phaser phaser,
|
||||
BatchCallback callback,
|
||||
java.util.function.Supplier<KeysetBatchReader> readerFactory,
|
||||
BoundaryFinder boundaryFinder) {
|
||||
if (numReaders == 1) {
|
||||
KeysetBatchReader reader = readerFactory.get();
|
||||
producerExecutor.submit(
|
||||
() ->
|
||||
readKeysetBatches(
|
||||
entityType, Integer.MAX_VALUE, batchSize, null, reader, phaser, callback));
|
||||
return;
|
||||
}
|
||||
|
||||
List<String> boundaries = boundaryFinder.findBoundaries(numReaders, totalRecords);
|
||||
int actualReaders = boundaries.size() + 1;
|
||||
int recordsPerReader = (totalRecords + actualReaders - 1) / actualReaders;
|
||||
|
||||
if (actualReaders < numReaders) {
|
||||
LOG.warn(
|
||||
"Boundary discovery for {} returned {} cursors (expected {}), using {} readers",
|
||||
entityType,
|
||||
boundaries.size(),
|
||||
numReaders - 1,
|
||||
actualReaders);
|
||||
for (int j = 0; j < numReaders - actualReaders; j++) {
|
||||
phaser.arriveAndDeregister();
|
||||
}
|
||||
}
|
||||
|
||||
for (int i = 0; i < actualReaders; i++) {
|
||||
String startCursor = (i == 0) ? null : boundaries.get(i - 1);
|
||||
int limit = (i == actualReaders - 1) ? Integer.MAX_VALUE : recordsPerReader;
|
||||
KeysetBatchReader readerSource = readerFactory.get();
|
||||
final int readerLimit = limit;
|
||||
producerExecutor.submit(
|
||||
() ->
|
||||
readKeysetBatches(
|
||||
entityType, readerLimit, batchSize, startCursor, readerSource, phaser, callback));
|
||||
}
|
||||
}
|
||||
|
||||
private void readKeysetBatches(
|
||||
String entityType,
|
||||
int recordLimit,
|
||||
int batchSize,
|
||||
String startCursor,
|
||||
KeysetBatchReader batchReader,
|
||||
Phaser phaser,
|
||||
BatchCallback callback) {
|
||||
try {
|
||||
String keysetCursor = startCursor;
|
||||
int processed = 0;
|
||||
|
||||
while (processed < recordLimit && !stopped.get()) {
|
||||
ResultList<?> result = readWithRetry(batchReader, keysetCursor, entityType);
|
||||
if (stopped.get()) {
|
||||
break;
|
||||
}
|
||||
|
||||
if (result == null || result.getData().isEmpty()) {
|
||||
LOG.debug(
|
||||
"Reader for {} exhausted at processed={} of limit={} (empty result)",
|
||||
entityType,
|
||||
processed,
|
||||
recordLimit);
|
||||
break;
|
||||
}
|
||||
|
||||
callback.onBatchRead(entityType, result, processed);
|
||||
|
||||
int readCount = result.getData().size();
|
||||
int errorCount = result.getErrors() != null ? result.getErrors().size() : 0;
|
||||
int warningsCount = result.getWarningsCount() != null ? result.getWarningsCount() : 0;
|
||||
processed += readCount + errorCount + warningsCount;
|
||||
|
||||
keysetCursor = result.getPaging() != null ? result.getPaging().getAfter() : null;
|
||||
if (keysetCursor == null) {
|
||||
LOG.debug(
|
||||
"Reader for {} exhausted at processed={} of limit={} (null cursor)",
|
||||
entityType,
|
||||
processed,
|
||||
recordLimit);
|
||||
break;
|
||||
}
|
||||
}
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
LOG.warn("Interrupted during reading of {}", entityType);
|
||||
} catch (SearchIndexException e) {
|
||||
LOG.error("Error reading keyset batch for {}", entityType, e);
|
||||
} catch (Exception e) {
|
||||
if (!stopped.get()) {
|
||||
LOG.error("Error in keyset reading for {}", entityType, e);
|
||||
}
|
||||
} finally {
|
||||
phaser.arriveAndDeregister();
|
||||
}
|
||||
}
|
||||
|
||||
private ResultList<?> readWithRetry(
|
||||
KeysetBatchReader batchReader, String keysetCursor, String entityType)
|
||||
throws SearchIndexException, InterruptedException {
|
||||
for (int attempt = 0; attempt <= maxRetryAttempts; attempt++) {
|
||||
try {
|
||||
return batchReader.readNextKeyset(keysetCursor);
|
||||
} catch (SearchIndexException e) {
|
||||
if (attempt >= maxRetryAttempts || !isTransientError(e)) {
|
||||
throw e;
|
||||
}
|
||||
long backoff = retryBackoffMs * (1L << attempt);
|
||||
LOG.warn(
|
||||
"Transient read failure for {} (attempt {}/{}), retrying in {}ms",
|
||||
entityType,
|
||||
attempt + 1,
|
||||
maxRetryAttempts,
|
||||
backoff);
|
||||
Thread.sleep(Math.min(backoff, 10_000));
|
||||
}
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
static boolean isTransientError(SearchIndexException e) {
|
||||
String msg = e.getMessage();
|
||||
if (msg == null) {
|
||||
return false;
|
||||
}
|
||||
String lower = msg.toLowerCase();
|
||||
return lower.contains("timeout")
|
||||
|| lower.contains("connection")
|
||||
|| lower.contains("pool exhausted")
|
||||
|| lower.contains("connectexception")
|
||||
|| lower.contains("sockettimeoutexception");
|
||||
}
|
||||
|
||||
static List<String> getSearchIndexFields(String entityType) {
|
||||
if (TIME_SERIES_ENTITIES.contains(entityType)) {
|
||||
return List.of();
|
||||
}
|
||||
return List.of("*");
|
||||
}
|
||||
|
||||
static int calculateNumberOfReaders(int totalEntityRecords, int batchSize) {
|
||||
if (batchSize <= 0) return 1;
|
||||
return (totalEntityRecords + batchSize - 1) / batchSize;
|
||||
}
|
||||
}
|
||||
|
|
@ -1,5 +1,8 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
|
||||
/**
|
||||
|
|
@ -13,52 +16,61 @@ public record ExecutionResult(
|
|||
long failedRecords,
|
||||
long startTime,
|
||||
long endTime,
|
||||
Stats finalStats) {
|
||||
Stats finalStats,
|
||||
Map<String, Object> metadata) {
|
||||
|
||||
public ExecutionResult(
|
||||
Status status,
|
||||
long totalRecords,
|
||||
long successRecords,
|
||||
long failedRecords,
|
||||
long startTime,
|
||||
long endTime,
|
||||
Stats finalStats) {
|
||||
this(
|
||||
status,
|
||||
totalRecords,
|
||||
successRecords,
|
||||
failedRecords,
|
||||
startTime,
|
||||
endTime,
|
||||
finalStats,
|
||||
Collections.emptyMap());
|
||||
}
|
||||
|
||||
/** Execution status values */
|
||||
public enum Status {
|
||||
/** Job completed successfully with all records processed */
|
||||
COMPLETED,
|
||||
/** Job completed but some records failed */
|
||||
COMPLETED_WITH_ERRORS,
|
||||
/** Job failed due to an exception */
|
||||
FAILED,
|
||||
/** Job was stopped by user request */
|
||||
STOPPED
|
||||
}
|
||||
|
||||
/** Get the duration of the execution in milliseconds */
|
||||
public long getDurationMillis() {
|
||||
return endTime - startTime;
|
||||
}
|
||||
|
||||
/** Get the duration of the execution in seconds */
|
||||
public long getDurationSeconds() {
|
||||
return getDurationMillis() / 1000;
|
||||
}
|
||||
|
||||
/** Get the success rate as a percentage (0-100) */
|
||||
public double getSuccessRate() {
|
||||
return totalRecords > 0 ? (successRecords * 100.0) / totalRecords : 0;
|
||||
}
|
||||
|
||||
/** Get the processing rate in records per second */
|
||||
public double getRecordsPerSecond() {
|
||||
long durationSeconds = getDurationSeconds();
|
||||
return durationSeconds > 0 ? (double) successRecords / durationSeconds : 0;
|
||||
}
|
||||
|
||||
/** Check if the execution was successful (no failures) */
|
||||
public boolean isSuccessful() {
|
||||
return status == Status.COMPLETED;
|
||||
}
|
||||
|
||||
/** Check if the execution completed (regardless of errors) */
|
||||
public boolean isCompleted() {
|
||||
return status == Status.COMPLETED || status == Status.COMPLETED_WITH_ERRORS;
|
||||
}
|
||||
|
||||
/** Builder for creating ExecutionResult instances */
|
||||
public static Builder builder() {
|
||||
return new Builder();
|
||||
}
|
||||
|
|
@ -71,6 +83,7 @@ public record ExecutionResult(
|
|||
private long startTime;
|
||||
private long endTime;
|
||||
private Stats finalStats;
|
||||
private Map<String, Object> metadata = new HashMap<>();
|
||||
|
||||
public Builder status(Status status) {
|
||||
this.status = status;
|
||||
|
|
@ -107,9 +120,26 @@ public record ExecutionResult(
|
|||
return this;
|
||||
}
|
||||
|
||||
public Builder metadata(Map<String, Object> metadata) {
|
||||
this.metadata = metadata != null ? metadata : new HashMap<>();
|
||||
return this;
|
||||
}
|
||||
|
||||
public Builder addMetadata(String key, Object value) {
|
||||
this.metadata.put(key, value);
|
||||
return this;
|
||||
}
|
||||
|
||||
public ExecutionResult build() {
|
||||
return new ExecutionResult(
|
||||
status, totalRecords, successRecords, failedRecords, startTime, endTime, finalStats);
|
||||
status,
|
||||
totalRecords,
|
||||
successRecords,
|
||||
failedRecords,
|
||||
startTime,
|
||||
endTime,
|
||||
finalStats,
|
||||
Collections.unmodifiableMap(metadata));
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -55,6 +55,11 @@ public class IndexingFailureRecorder implements AutoCloseable {
|
|||
stackTrace);
|
||||
}
|
||||
|
||||
public void recordReaderEntityFailure(
|
||||
String entityType, String entityId, String entityFqn, String errorMessage) {
|
||||
recordFailure(entityType, entityId, entityFqn, FailureStage.READER, errorMessage, null);
|
||||
}
|
||||
|
||||
public void recordSinkFailure(
|
||||
String entityType, String entityId, String entityFqn, String errorMessage) {
|
||||
recordSinkFailure(entityType, entityId, entityFqn, errorMessage, null);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,534 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.common.utils.CommonUtil.listOrEmpty;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.util.HashMap;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.BlockingQueue;
|
||||
import java.util.concurrent.CountDownLatch;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.LinkedBlockingQueue;
|
||||
import java.util.concurrent.Phaser;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.TimeoutException;
|
||||
import java.util.concurrent.atomic.AtomicBoolean;
|
||||
import java.util.concurrent.atomic.AtomicReference;
|
||||
import lombok.Getter;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.EntityInterface;
|
||||
import org.openmetadata.schema.EntityTimeSeriesInterface;
|
||||
import org.openmetadata.schema.system.IndexingError;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
import org.openmetadata.schema.utils.ResultList;
|
||||
import org.openmetadata.service.search.EntityReindexContext;
|
||||
import org.openmetadata.service.search.RecreateIndexHandler;
|
||||
import org.openmetadata.service.search.ReindexContext;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.workflows.searchIndex.ReindexingUtil;
|
||||
import org.slf4j.MDC;
|
||||
|
||||
/**
|
||||
* Quartz-decoupled indexing pipeline that orchestrates: entity discovery -> reader -> queue -> sink.
|
||||
* This class can be used by SearchIndexExecutor, CLI tools, REST APIs, or unit tests.
|
||||
*/
|
||||
@Slf4j
|
||||
public class IndexingPipeline implements AutoCloseable {
|
||||
|
||||
private static final String POISON_PILL = "__POISON_PILL__";
|
||||
private static final int DEFAULT_QUEUE_SIZE = 20000;
|
||||
private static final int MAX_CONSUMER_THREADS = 20;
|
||||
private static final int MAX_JOB_THREADS = 30;
|
||||
private static final String ENTITY_TYPE_KEY = "entityType";
|
||||
private static final String RECREATE_INDEX = "recreateIndex";
|
||||
|
||||
private final SearchRepository searchRepository;
|
||||
private final CompositeProgressListener listeners;
|
||||
private final AtomicBoolean stopped = new AtomicBoolean(false);
|
||||
@Getter private final AtomicReference<Stats> stats = new AtomicReference<>();
|
||||
|
||||
private BulkSink searchIndexSink;
|
||||
private RecreateIndexHandler recreateIndexHandler;
|
||||
private ReindexContext recreateContext;
|
||||
private EntityReader entityReader;
|
||||
private ExecutorService consumerExecutor;
|
||||
private ExecutorService producerExecutor;
|
||||
private ExecutorService jobExecutor;
|
||||
private BlockingQueue<IndexingTask<?>> taskQueue;
|
||||
private final Set<String> promotedEntities = java.util.concurrent.ConcurrentHashMap.newKeySet();
|
||||
|
||||
record IndexingTask<T>(String entityType, ResultList<T> entities, int offset) {}
|
||||
|
||||
public IndexingPipeline(SearchRepository searchRepository) {
|
||||
this.searchRepository = searchRepository;
|
||||
this.listeners = new CompositeProgressListener();
|
||||
}
|
||||
|
||||
public IndexingPipeline addListener(ReindexingProgressListener listener) {
|
||||
listeners.addListener(listener);
|
||||
return this;
|
||||
}
|
||||
|
||||
public ExecutionResult execute(
|
||||
ReindexingConfiguration config,
|
||||
ReindexingJobContext context,
|
||||
Set<String> entities,
|
||||
BulkSink sink,
|
||||
RecreateIndexHandler handler,
|
||||
ReindexContext recreateCtx) {
|
||||
this.searchIndexSink = sink;
|
||||
this.recreateIndexHandler = handler;
|
||||
this.recreateContext = recreateCtx;
|
||||
long startTime = System.currentTimeMillis();
|
||||
|
||||
stats.set(initializeStats(entities));
|
||||
listeners.onJobStarted(context);
|
||||
|
||||
try {
|
||||
runPipeline(config, entities);
|
||||
closeSink();
|
||||
finalizeReindex();
|
||||
return buildResult(startTime);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Pipeline execution failed", e);
|
||||
listeners.onJobFailed(stats.get(), e);
|
||||
return ExecutionResult.fromStats(stats.get(), ExecutionResult.Status.FAILED, startTime);
|
||||
}
|
||||
}
|
||||
|
||||
private void runPipeline(ReindexingConfiguration config, Set<String> entities)
|
||||
throws InterruptedException {
|
||||
int numConsumers =
|
||||
config.consumerThreads() > 0 ? Math.min(config.consumerThreads(), MAX_CONSUMER_THREADS) : 2;
|
||||
int queueSize = config.queueSize() > 0 ? config.queueSize() : DEFAULT_QUEUE_SIZE;
|
||||
int batchSize = config.batchSize();
|
||||
|
||||
taskQueue = new LinkedBlockingQueue<>(queueSize);
|
||||
consumerExecutor =
|
||||
Executors.newFixedThreadPool(
|
||||
numConsumers, Thread.ofPlatform().name("pipeline-consumer-", 0).factory());
|
||||
producerExecutor =
|
||||
Executors.newFixedThreadPool(
|
||||
config.producerThreads() > 0 ? config.producerThreads() : 2,
|
||||
Thread.ofPlatform().name("pipeline-producer-", 0).factory());
|
||||
jobExecutor =
|
||||
Executors.newFixedThreadPool(
|
||||
Math.min(entities.size(), MAX_JOB_THREADS),
|
||||
Thread.ofPlatform().name("pipeline-job-", 0).factory());
|
||||
|
||||
entityReader = new EntityReader(producerExecutor, stopped);
|
||||
|
||||
CountDownLatch consumerLatch = new CountDownLatch(numConsumers);
|
||||
Map<String, String> mdc = MDC.getCopyOfContextMap();
|
||||
for (int i = 0; i < numConsumers; i++) {
|
||||
final int id = i;
|
||||
consumerExecutor.submit(
|
||||
() -> {
|
||||
if (mdc != null) MDC.setContextMap(mdc);
|
||||
try {
|
||||
runConsumer(id, consumerLatch);
|
||||
} finally {
|
||||
MDC.clear();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
try {
|
||||
readAllEntities(config, entities, batchSize);
|
||||
signalConsumersToStop(numConsumers);
|
||||
consumerLatch.await();
|
||||
} catch (InterruptedException e) {
|
||||
stopped.set(true);
|
||||
Thread.currentThread().interrupt();
|
||||
throw e;
|
||||
} finally {
|
||||
shutdownExecutors();
|
||||
}
|
||||
}
|
||||
|
||||
private void readAllEntities(ReindexingConfiguration config, Set<String> entities, int batchSize)
|
||||
throws InterruptedException {
|
||||
List<String> ordered = EntityPriority.sortByPriority(entities);
|
||||
Phaser producerPhaser = new Phaser(entities.size());
|
||||
Map<String, String> mdc = MDC.getCopyOfContextMap();
|
||||
|
||||
for (String entityType : ordered) {
|
||||
jobExecutor.submit(
|
||||
() -> {
|
||||
if (mdc != null) MDC.setContextMap(mdc);
|
||||
try {
|
||||
int totalRecords = getTotalEntityRecords(entityType);
|
||||
listeners.onEntityTypeStarted(entityType, totalRecords);
|
||||
|
||||
int effectiveBatchSize =
|
||||
EntityBatchSizeEstimator.estimateBatchSize(entityType, batchSize);
|
||||
Long filterStartTs = null;
|
||||
Long filterEndTs = null;
|
||||
long startTs = config.getTimeSeriesStartTs(entityType);
|
||||
if (startTs > 0) {
|
||||
filterStartTs = startTs;
|
||||
filterEndTs = System.currentTimeMillis();
|
||||
}
|
||||
entityReader.readEntity(
|
||||
entityType,
|
||||
totalRecords,
|
||||
effectiveBatchSize,
|
||||
producerPhaser,
|
||||
(type, batch, offset) -> {
|
||||
if (!stopped.get()) {
|
||||
taskQueue.put(new IndexingTask<>(type, batch, offset));
|
||||
}
|
||||
},
|
||||
filterStartTs,
|
||||
filterEndTs);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error reading entity type {}", entityType, e);
|
||||
} finally {
|
||||
producerPhaser.arriveAndDeregister();
|
||||
MDC.clear();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
int phase = 0;
|
||||
while (!producerPhaser.isTerminated()) {
|
||||
if (stopped.get() || Thread.currentThread().isInterrupted()) {
|
||||
break;
|
||||
}
|
||||
try {
|
||||
producerPhaser.awaitAdvanceInterruptibly(phase, 1, TimeUnit.SECONDS);
|
||||
break;
|
||||
} catch (TimeoutException e) {
|
||||
// Continue
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@SuppressWarnings("unchecked")
|
||||
private void runConsumer(int consumerId, CountDownLatch consumerLatch) {
|
||||
try {
|
||||
while (!stopped.get()) {
|
||||
IndexingTask<?> task = taskQueue.poll(200, TimeUnit.MILLISECONDS);
|
||||
if (task == null) continue;
|
||||
if (POISON_PILL.equals(task.entityType())) break;
|
||||
|
||||
String entityType = task.entityType();
|
||||
ResultList<?> entities = task.entities();
|
||||
Map<String, Object> contextData = createContextData(entityType);
|
||||
|
||||
int readerSuccess = listOrEmpty(entities.getData()).size();
|
||||
int readerFailed = listOrEmpty(entities.getErrors()).size();
|
||||
int readerWarnings = entities.getWarningsCount() != null ? entities.getWarningsCount() : 0;
|
||||
updateReaderStats(readerSuccess, readerFailed, readerWarnings);
|
||||
|
||||
try {
|
||||
if (!EntityReader.TIME_SERIES_ENTITIES.contains(entityType)) {
|
||||
searchIndexSink.write(
|
||||
(java.util.List<EntityInterface>) entities.getData(), contextData);
|
||||
} else {
|
||||
searchIndexSink.write(
|
||||
(java.util.List<EntityTimeSeriesInterface>) entities.getData(), contextData);
|
||||
}
|
||||
|
||||
StepStats entityStats = new StepStats();
|
||||
entityStats.setSuccessRecords(readerSuccess);
|
||||
entityStats.setFailedRecords(readerFailed);
|
||||
updateEntityAndJobStats(entityType, entityStats);
|
||||
listeners.onProgressUpdate(stats.get(), null);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Sink error for {}", entityType, e);
|
||||
IndexingError error =
|
||||
new IndexingError()
|
||||
.withErrorSource(IndexingError.ErrorSource.SINK)
|
||||
.withMessage(e.getMessage());
|
||||
listeners.onError(entityType, error, stats.get());
|
||||
}
|
||||
}
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
} finally {
|
||||
consumerLatch.countDown();
|
||||
}
|
||||
}
|
||||
|
||||
private Map<String, Object> createContextData(String entityType) {
|
||||
Map<String, Object> contextData = new HashMap<>();
|
||||
contextData.put(ENTITY_TYPE_KEY, entityType);
|
||||
contextData.put(RECREATE_INDEX, recreateContext != null);
|
||||
if (recreateContext != null) {
|
||||
contextData.put(ReindexingUtil.RECREATE_CONTEXT, recreateContext);
|
||||
recreateContext
|
||||
.getStagedIndex(entityType)
|
||||
.ifPresent(index -> contextData.put(ReindexingUtil.TARGET_INDEX_KEY, index));
|
||||
}
|
||||
return contextData;
|
||||
}
|
||||
|
||||
private void signalConsumersToStop(int numConsumers) throws InterruptedException {
|
||||
for (int i = 0; i < numConsumers; i++) {
|
||||
taskQueue.put(new IndexingTask<>(POISON_PILL, null, -1));
|
||||
}
|
||||
}
|
||||
|
||||
private void closeSink() throws IOException {
|
||||
if (searchIndexSink != null) {
|
||||
int pendingVectorTasks = searchIndexSink.getPendingVectorTaskCount();
|
||||
if (pendingVectorTasks > 0) {
|
||||
LOG.info("Waiting for {} pending vector embedding tasks", pendingVectorTasks);
|
||||
VectorCompletionResult vcResult = searchIndexSink.awaitVectorCompletionWithDetails(300);
|
||||
LOG.info(
|
||||
"Vector completion: completed={}, pending={}, waited={}ms",
|
||||
vcResult.completed(),
|
||||
vcResult.pendingTaskCount(),
|
||||
vcResult.waitedMillis());
|
||||
}
|
||||
searchIndexSink.close();
|
||||
syncSinkStats();
|
||||
}
|
||||
}
|
||||
|
||||
private void finalizeReindex() {
|
||||
if (recreateIndexHandler == null || recreateContext == null) return;
|
||||
|
||||
try {
|
||||
recreateContext
|
||||
.getEntities()
|
||||
.forEach(
|
||||
entityType -> {
|
||||
if (promotedEntities.contains(entityType)) return;
|
||||
try {
|
||||
EntityReindexContext ctx = buildEntityReindexContext(entityType);
|
||||
recreateIndexHandler.finalizeReindex(ctx, !stopped.get());
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize reindex for {}", entityType, ex);
|
||||
}
|
||||
});
|
||||
} finally {
|
||||
recreateContext = null;
|
||||
promotedEntities.clear();
|
||||
}
|
||||
}
|
||||
|
||||
private EntityReindexContext buildEntityReindexContext(String entityType) {
|
||||
return EntityReindexContext.builder()
|
||||
.entityType(entityType)
|
||||
.originalIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.canonicalIndex(recreateContext.getCanonicalIndex(entityType).orElse(null))
|
||||
.activeIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.stagedIndex(recreateContext.getStagedIndex(entityType).orElse(null))
|
||||
.canonicalAliases(recreateContext.getCanonicalAlias(entityType).orElse(null))
|
||||
.existingAliases(recreateContext.getExistingAliases(entityType))
|
||||
.parentAliases(
|
||||
new HashSet<>(
|
||||
org.openmetadata.common.utils.CommonUtil.listOrEmpty(
|
||||
recreateContext.getParentAliases(entityType))))
|
||||
.build();
|
||||
}
|
||||
|
||||
private ExecutionResult buildResult(long startTime) {
|
||||
syncSinkStats();
|
||||
Stats currentStats = stats.get();
|
||||
if (currentStats != null) {
|
||||
StatsReconciler.reconcile(currentStats);
|
||||
}
|
||||
|
||||
ExecutionResult.Status status;
|
||||
if (stopped.get()) {
|
||||
status = ExecutionResult.Status.STOPPED;
|
||||
listeners.onJobStopped(currentStats);
|
||||
} else if (hasFailures()) {
|
||||
status = ExecutionResult.Status.COMPLETED_WITH_ERRORS;
|
||||
listeners.onJobCompletedWithErrors(currentStats, System.currentTimeMillis() - startTime);
|
||||
} else {
|
||||
status = ExecutionResult.Status.COMPLETED;
|
||||
listeners.onJobCompleted(currentStats, System.currentTimeMillis() - startTime);
|
||||
}
|
||||
|
||||
return ExecutionResult.fromStats(currentStats, status, startTime);
|
||||
}
|
||||
|
||||
private boolean hasFailures() {
|
||||
Stats s = stats.get();
|
||||
if (s == null || s.getJobStats() == null) return false;
|
||||
StepStats js = s.getJobStats();
|
||||
long failed = js.getFailedRecords() != null ? js.getFailedRecords() : 0;
|
||||
long success = js.getSuccessRecords() != null ? js.getSuccessRecords() : 0;
|
||||
long total = js.getTotalRecords() != null ? js.getTotalRecords() : 0;
|
||||
return failed > 0 || (total > 0 && success < total);
|
||||
}
|
||||
|
||||
private Stats initializeStats(Set<String> entities) {
|
||||
Stats s = new Stats();
|
||||
s.setEntityStats(new org.openmetadata.schema.system.EntityStats());
|
||||
s.setJobStats(new StepStats());
|
||||
s.setReaderStats(new StepStats());
|
||||
s.setSinkStats(new StepStats());
|
||||
|
||||
int total = 0;
|
||||
for (String entityType : entities) {
|
||||
int entityTotal = getTotalEntityRecords(entityType);
|
||||
total += entityTotal;
|
||||
StepStats es = new StepStats();
|
||||
es.setTotalRecords(entityTotal);
|
||||
es.setSuccessRecords(0);
|
||||
es.setFailedRecords(0);
|
||||
s.getEntityStats().getAdditionalProperties().put(entityType, es);
|
||||
}
|
||||
s.getJobStats().setTotalRecords(total);
|
||||
s.getJobStats().setSuccessRecords(0);
|
||||
s.getJobStats().setFailedRecords(0);
|
||||
s.getReaderStats().setTotalRecords(total);
|
||||
s.getReaderStats().setSuccessRecords(0);
|
||||
s.getReaderStats().setFailedRecords(0);
|
||||
s.getReaderStats().setWarningRecords(0);
|
||||
s.getSinkStats().setTotalRecords(0);
|
||||
s.getSinkStats().setSuccessRecords(0);
|
||||
s.getSinkStats().setFailedRecords(0);
|
||||
|
||||
s.setProcessStats(new StepStats());
|
||||
s.getProcessStats().setTotalRecords(0);
|
||||
s.getProcessStats().setSuccessRecords(0);
|
||||
s.getProcessStats().setFailedRecords(0);
|
||||
return s;
|
||||
}
|
||||
|
||||
private int getTotalEntityRecords(String entityType) {
|
||||
StepStats es =
|
||||
stats.get() != null
|
||||
&& stats.get().getEntityStats() != null
|
||||
&& stats.get().getEntityStats().getAdditionalProperties() != null
|
||||
? stats.get().getEntityStats().getAdditionalProperties().get(entityType)
|
||||
: null;
|
||||
if (es != null && es.getTotalRecords() != null) {
|
||||
return es.getTotalRecords();
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
|
||||
private synchronized void updateReaderStats(int success, int failed, int warnings) {
|
||||
Stats s = stats.get();
|
||||
if (s == null) return;
|
||||
StepStats rs = s.getReaderStats();
|
||||
if (rs == null) {
|
||||
rs = new StepStats();
|
||||
s.setReaderStats(rs);
|
||||
}
|
||||
rs.setSuccessRecords((rs.getSuccessRecords() != null ? rs.getSuccessRecords() : 0) + success);
|
||||
rs.setFailedRecords((rs.getFailedRecords() != null ? rs.getFailedRecords() : 0) + failed);
|
||||
rs.setWarningRecords((rs.getWarningRecords() != null ? rs.getWarningRecords() : 0) + warnings);
|
||||
}
|
||||
|
||||
private synchronized void updateEntityAndJobStats(String entityType, StepStats entityDelta) {
|
||||
Stats s = stats.get();
|
||||
if (s == null || s.getEntityStats() == null) return;
|
||||
|
||||
StepStats es = s.getEntityStats().getAdditionalProperties().get(entityType);
|
||||
if (es != null) {
|
||||
es.setSuccessRecords(es.getSuccessRecords() + entityDelta.getSuccessRecords());
|
||||
es.setFailedRecords(es.getFailedRecords() + entityDelta.getFailedRecords());
|
||||
}
|
||||
|
||||
StepStats js = s.getJobStats();
|
||||
if (js != null) {
|
||||
int totalSuccess =
|
||||
s.getEntityStats().getAdditionalProperties().values().stream()
|
||||
.mapToInt(StepStats::getSuccessRecords)
|
||||
.sum();
|
||||
int totalFailed =
|
||||
s.getEntityStats().getAdditionalProperties().values().stream()
|
||||
.mapToInt(StepStats::getFailedRecords)
|
||||
.sum();
|
||||
js.setSuccessRecords(totalSuccess);
|
||||
js.setFailedRecords(totalFailed);
|
||||
}
|
||||
}
|
||||
|
||||
private synchronized void syncSinkStats() {
|
||||
if (searchIndexSink == null) return;
|
||||
Stats s = stats.get();
|
||||
if (s == null) return;
|
||||
|
||||
StepStats bulkStats = searchIndexSink.getStats();
|
||||
if (bulkStats == null) return;
|
||||
|
||||
StepStats sinkStats = s.getSinkStats();
|
||||
if (sinkStats == null) {
|
||||
sinkStats = new StepStats();
|
||||
s.setSinkStats(sinkStats);
|
||||
}
|
||||
sinkStats.setTotalRecords(
|
||||
bulkStats.getTotalRecords() != null ? bulkStats.getTotalRecords() : 0);
|
||||
sinkStats.setSuccessRecords(
|
||||
bulkStats.getSuccessRecords() != null ? bulkStats.getSuccessRecords() : 0);
|
||||
sinkStats.setFailedRecords(
|
||||
bulkStats.getFailedRecords() != null ? bulkStats.getFailedRecords() : 0);
|
||||
|
||||
StepStats vectorStats = searchIndexSink.getVectorStats();
|
||||
if (vectorStats != null
|
||||
&& vectorStats.getTotalRecords() != null
|
||||
&& vectorStats.getTotalRecords() > 0) {
|
||||
s.setVectorStats(vectorStats);
|
||||
}
|
||||
|
||||
StepStats processStats = searchIndexSink.getProcessStats();
|
||||
if (processStats != null) {
|
||||
s.setProcessStats(processStats);
|
||||
}
|
||||
}
|
||||
|
||||
private void shutdownExecutors() {
|
||||
shutdownExecutor(producerExecutor, "producer");
|
||||
shutdownExecutor(jobExecutor, "job");
|
||||
shutdownExecutor(consumerExecutor, "consumer");
|
||||
}
|
||||
|
||||
private void shutdownExecutor(ExecutorService executor, String name) {
|
||||
if (executor != null && !executor.isShutdown()) {
|
||||
executor.shutdown();
|
||||
try {
|
||||
if (!executor.awaitTermination(30, TimeUnit.SECONDS)) {
|
||||
executor.shutdownNow();
|
||||
LOG.warn("{} executor did not terminate in time", name);
|
||||
}
|
||||
} catch (InterruptedException e) {
|
||||
executor.shutdownNow();
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public void stop() {
|
||||
stopped.set(true);
|
||||
if (entityReader != null) entityReader.stop();
|
||||
|
||||
if (searchIndexSink != null) {
|
||||
LOG.info(
|
||||
"Stopping pipeline: flushing sink ({} active bulk requests)",
|
||||
searchIndexSink.getActiveBulkRequestCount());
|
||||
searchIndexSink.flushAndAwait(10);
|
||||
}
|
||||
|
||||
int dropped = taskQueue != null ? taskQueue.size() : 0;
|
||||
if (dropped > 0) {
|
||||
LOG.warn("Dropping {} queued tasks during shutdown", dropped);
|
||||
}
|
||||
|
||||
if (taskQueue != null) {
|
||||
taskQueue.clear();
|
||||
for (int i = 0; i < MAX_CONSUMER_THREADS; i++) {
|
||||
taskQueue.offer(new IndexingTask<>(POISON_PILL, null, -1));
|
||||
}
|
||||
}
|
||||
shutdownExecutors();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
stop();
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Optional;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
|
||||
/**
|
||||
* Strategy interface for reindexing execution. Encapsulates the differences between single-server
|
||||
* and distributed indexing so that SearchIndexApp uses a single code path regardless of mode.
|
||||
*/
|
||||
public interface IndexingStrategy {
|
||||
|
||||
void addListener(ReindexingProgressListener listener);
|
||||
|
||||
ExecutionResult execute(ReindexingConfiguration config, ReindexingJobContext context);
|
||||
|
||||
Optional<Stats> getStats();
|
||||
|
||||
void stop();
|
||||
|
||||
boolean isStopped();
|
||||
}
|
||||
|
|
@ -86,6 +86,10 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
private final AtomicLong totalSuccess = new AtomicLong(0);
|
||||
private final AtomicLong totalFailed = new AtomicLong(0);
|
||||
|
||||
// Process stage metrics (document building/transformation)
|
||||
private final AtomicLong processSuccess = new AtomicLong(0);
|
||||
private final AtomicLong processFailed = new AtomicLong(0);
|
||||
|
||||
// Configuration
|
||||
private volatile int batchSize;
|
||||
private volatile int maxConcurrentRequests;
|
||||
|
|
@ -134,6 +138,7 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
concurrentRequests,
|
||||
maxPayloadSizeBytes / (1024 * 1024));
|
||||
|
||||
BulkCircuitBreaker circuitBreaker = new BulkCircuitBreaker(5, 30_000, 10_000);
|
||||
return new CustomBulkProcessor(
|
||||
searchClient,
|
||||
bulkActions,
|
||||
|
|
@ -145,7 +150,8 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
totalSubmitted,
|
||||
totalSuccess,
|
||||
totalFailed,
|
||||
this::updateStats);
|
||||
this::updateStats,
|
||||
circuitBreaker);
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -305,12 +311,14 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
tracker.incrementPendingSink();
|
||||
}
|
||||
bulkProcessor.add(operation, docId, entityType, tracker, estimatedSize);
|
||||
processSuccess.incrementAndGet();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.SUCCESS);
|
||||
}
|
||||
} catch (EntityNotFoundException e) {
|
||||
LOG.error("Entity Not Found Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -321,12 +329,14 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
entityTypeName,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
entity.getFullyQualifiedName(),
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Encountered Issue while building SearchDoc from Entity Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -337,7 +347,8 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
entityTypeName,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
entity.getFullyQualifiedName(),
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -364,12 +375,14 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
tracker.incrementPendingSink();
|
||||
}
|
||||
bulkProcessor.add(operation, docId, entityType, tracker, estimatedSize);
|
||||
processSuccess.incrementAndGet();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.SUCCESS);
|
||||
}
|
||||
} catch (EntityNotFoundException e) {
|
||||
LOG.error("Entity Not Found Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -379,12 +392,14 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
entityType,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
null,
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.error(
|
||||
"Encountered Issue while building SearchDoc from Entity Due to : {}", e.getMessage(), e);
|
||||
totalFailed.incrementAndGet();
|
||||
processFailed.incrementAndGet();
|
||||
updateStats();
|
||||
if (tracker != null) {
|
||||
tracker.recordProcess(StatsResult.FAILED);
|
||||
|
|
@ -394,7 +409,8 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
entityType,
|
||||
entity.getId() != null ? entity.getId().toString() : null,
|
||||
null,
|
||||
e.getMessage());
|
||||
e.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.PROCESS);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -649,6 +665,28 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
});
|
||||
}
|
||||
|
||||
@Override
|
||||
public int getActiveBulkRequestCount() {
|
||||
return bulkProcessor.activeBulkRequests.get();
|
||||
}
|
||||
|
||||
@Override
|
||||
public VectorCompletionResult awaitVectorCompletionWithDetails(int timeoutSeconds) {
|
||||
long start = System.currentTimeMillis();
|
||||
boolean ok = awaitVectorCompletion(timeoutSeconds);
|
||||
long waited = System.currentTimeMillis() - start;
|
||||
if (!ok) {
|
||||
int pending = getPendingVectorTaskCount();
|
||||
LOG.warn("Vector completion timed out with {} pending tasks after {}ms", pending, waited);
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.recordVectorTimeout(pending);
|
||||
}
|
||||
return VectorCompletionResult.timeout(pending, waited);
|
||||
}
|
||||
return VectorCompletionResult.success(waited);
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean awaitVectorCompletion(int timeoutSeconds) {
|
||||
try {
|
||||
|
|
@ -677,6 +715,16 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
.withFailedRecords((int) vectorFailed.get());
|
||||
}
|
||||
|
||||
@Override
|
||||
public StepStats getProcessStats() {
|
||||
long success = processSuccess.get();
|
||||
long failed = processFailed.get();
|
||||
return new StepStats()
|
||||
.withTotalRecords((int) (success + failed))
|
||||
.withSuccessRecords((int) success)
|
||||
.withFailedRecords((int) failed);
|
||||
}
|
||||
|
||||
public static class CustomBulkProcessor {
|
||||
private final OpenSearchAsyncClient asyncClient;
|
||||
private final List<BulkOperation> buffer = new ArrayList<>();
|
||||
|
|
@ -705,6 +753,7 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
private volatile boolean closed = false;
|
||||
private volatile FailureCallback failureCallback;
|
||||
private volatile SinkStatsCallback statsCallback;
|
||||
private final BulkCircuitBreaker circuitBreaker;
|
||||
|
||||
CustomBulkProcessor(
|
||||
OpenSearchClient client,
|
||||
|
|
@ -717,7 +766,8 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
AtomicLong totalSubmitted,
|
||||
AtomicLong totalSuccess,
|
||||
AtomicLong totalFailed,
|
||||
Runnable statsUpdater) {
|
||||
Runnable statsUpdater,
|
||||
BulkCircuitBreaker circuitBreaker) {
|
||||
this.asyncClient = new OpenSearchAsyncClient(client.getNewClient()._transport());
|
||||
this.bulkActions = bulkActions;
|
||||
this.maxPayloadSizeBytes = maxPayloadSizeBytes;
|
||||
|
|
@ -728,6 +778,7 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
this.totalSuccess = totalSuccess;
|
||||
this.totalFailed = totalFailed;
|
||||
this.statsUpdater = statsUpdater;
|
||||
this.circuitBreaker = circuitBreaker;
|
||||
this.scheduler = Executors.newScheduledThreadPool(1);
|
||||
|
||||
scheduler.scheduleAtFixedRate(
|
||||
|
|
@ -852,9 +903,16 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
}
|
||||
|
||||
List<BulkOperation> toFlush = new ArrayList<>(buffer);
|
||||
long payloadSize = currentBufferSize;
|
||||
buffer.clear();
|
||||
currentBufferSize = 0;
|
||||
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.recordPayloadSize(payloadSize);
|
||||
metrics.incrementPendingBulkRequests();
|
||||
}
|
||||
|
||||
long executionId = executionIdCounter.incrementAndGet();
|
||||
int numberOfActions = toFlush.size();
|
||||
totalSubmitted.addAndGet(numberOfActions);
|
||||
|
|
@ -876,22 +934,69 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
|
||||
private void executeBulkWithRetry(
|
||||
List<BulkOperation> operations, long executionId, int numberOfActions, int attemptNumber) {
|
||||
if (!circuitBreaker.allowRequest()) {
|
||||
LOG.warn(
|
||||
"Circuit breaker OPEN - fail-fast for bulk request {} with {} actions",
|
||||
executionId,
|
||||
numberOfActions);
|
||||
totalFailed.addAndGet(numberOfActions);
|
||||
statsUpdater.run();
|
||||
activeBulkRequests.decrementAndGet();
|
||||
concurrentRequestSemaphore.release();
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.decrementPendingBulkRequests();
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
io.micrometer.core.instrument.Timer.Sample bulkTimerSample =
|
||||
metrics != null ? metrics.startBulkRequestTimer() : null;
|
||||
|
||||
CompletableFuture<BulkResponse> future;
|
||||
try {
|
||||
future = asyncClient.bulk(b -> b.operations(operations).refresh(Refresh.False));
|
||||
} catch (IOException e) {
|
||||
handleBulkFailure(operations, executionId, numberOfActions, attemptNumber, e);
|
||||
if (metrics != null && bulkTimerSample != null) {
|
||||
metrics.recordBulkRequestCompleted(bulkTimerSample, false);
|
||||
}
|
||||
circuitBreaker.recordFailure();
|
||||
boolean retryScheduled =
|
||||
handleBulkFailure(operations, executionId, numberOfActions, attemptNumber, e);
|
||||
if (!retryScheduled) {
|
||||
activeBulkRequests.decrementAndGet();
|
||||
concurrentRequestSemaphore.release();
|
||||
if (metrics != null) {
|
||||
metrics.decrementPendingBulkRequests();
|
||||
}
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
future.whenComplete(
|
||||
(response, error) -> {
|
||||
boolean retryScheduled = false;
|
||||
try {
|
||||
if (error != null) {
|
||||
handleBulkFailure(operations, executionId, numberOfActions, attemptNumber, error);
|
||||
if (metrics != null && bulkTimerSample != null) {
|
||||
metrics.recordBulkRequestCompleted(bulkTimerSample, false);
|
||||
}
|
||||
circuitBreaker.recordFailure();
|
||||
retryScheduled =
|
||||
handleBulkFailure(
|
||||
operations, executionId, numberOfActions, attemptNumber, error);
|
||||
} else if (response.errors()) {
|
||||
if (metrics != null && bulkTimerSample != null) {
|
||||
metrics.recordBulkRequestCompleted(bulkTimerSample, false);
|
||||
}
|
||||
circuitBreaker.recordSuccess();
|
||||
handlePartialFailure(response, executionId, numberOfActions);
|
||||
} else {
|
||||
if (metrics != null && bulkTimerSample != null) {
|
||||
metrics.recordBulkRequestCompleted(bulkTimerSample, true);
|
||||
}
|
||||
circuitBreaker.recordSuccess();
|
||||
totalSuccess.addAndGet(numberOfActions);
|
||||
LOG.debug(
|
||||
"Bulk request {} completed successfully with {} actions",
|
||||
|
|
@ -901,23 +1006,25 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
statsUpdater.run();
|
||||
}
|
||||
} finally {
|
||||
if (error != null && shouldRetry(attemptNumber, error)) {
|
||||
// Don't release resources yet, we're retrying
|
||||
} else {
|
||||
if (!retryScheduled) {
|
||||
activeBulkRequests.decrementAndGet();
|
||||
concurrentRequestSemaphore.release();
|
||||
if (metrics != null) {
|
||||
metrics.decrementPendingBulkRequests();
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
private void handleBulkFailure(
|
||||
private boolean handleBulkFailure(
|
||||
List<BulkOperation> operations,
|
||||
long executionId,
|
||||
int numberOfActions,
|
||||
int attemptNumber,
|
||||
Throwable error) {
|
||||
if (shouldRetry(attemptNumber, error)) {
|
||||
if (shouldRetry(attemptNumber, error)
|
||||
&& circuitBreaker.getState() != BulkCircuitBreaker.State.OPEN) {
|
||||
long backoffTime = calculateBackoff(attemptNumber);
|
||||
LOG.warn(
|
||||
"Bulk request {} failed (attempt {}), retrying in {}ms: {}",
|
||||
|
|
@ -930,6 +1037,7 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
() -> executeBulkWithRetry(operations, executionId, numberOfActions, attemptNumber + 1),
|
||||
backoffTime,
|
||||
TimeUnit.MILLISECONDS);
|
||||
return true;
|
||||
} else {
|
||||
totalFailed.addAndGet(numberOfActions);
|
||||
LOG.error(
|
||||
|
|
@ -955,7 +1063,12 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
tracker.recordSink(StatsResult.FAILED);
|
||||
}
|
||||
if (failureCallback != null) {
|
||||
failureCallback.onFailure(entityType, docId, null, error.getMessage());
|
||||
failureCallback.onFailure(
|
||||
entityType,
|
||||
docId,
|
||||
null,
|
||||
error.getMessage(),
|
||||
IndexingFailureRecorder.FailureStage.SINK);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -965,6 +1078,7 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
}
|
||||
}
|
||||
statsUpdater.run();
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -996,7 +1110,8 @@ public class OpenSearchBulkSink implements BulkSink {
|
|||
tracker.recordSink(StatsResult.FAILED);
|
||||
}
|
||||
if (failureCallback != null) {
|
||||
failureCallback.onFailure(entityType, docId, null, failureMessage);
|
||||
failureCallback.onFailure(
|
||||
entityType, docId, null, failureMessage, IndexingFailureRecorder.FailureStage.SINK);
|
||||
}
|
||||
} else {
|
||||
String entityType = docId != null ? docIdToEntityType.remove(docId) : null;
|
||||
|
|
|
|||
|
|
@ -0,0 +1,32 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.UUID;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
|
||||
public interface OrchestratorContext {
|
||||
|
||||
String getJobName();
|
||||
|
||||
String getAppConfigJson();
|
||||
|
||||
void storeRunStats(Stats stats);
|
||||
|
||||
void storeRunRecord(String json);
|
||||
|
||||
AppRunRecord getJobRecord();
|
||||
|
||||
void pushStatusUpdate(AppRunRecord record, boolean force);
|
||||
|
||||
UUID getAppId();
|
||||
|
||||
Map<String, Object> getAppConfiguration();
|
||||
|
||||
void updateAppConfiguration(Map<String, Object> config);
|
||||
|
||||
ReindexingProgressListener createProgressListener(EventPublisherJob jobData);
|
||||
|
||||
ReindexingJobContext createReindexingContext(boolean distributed);
|
||||
}
|
||||
|
|
@ -0,0 +1,98 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.service.apps.scheduler.OmAppJobListener.APP_CONFIG;
|
||||
import static org.openmetadata.service.apps.scheduler.OmAppJobListener.APP_RUN_STATS;
|
||||
|
||||
import java.util.Map;
|
||||
import java.util.UUID;
|
||||
import java.util.function.Function;
|
||||
import org.openmetadata.schema.entity.app.App;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.utils.JsonUtils;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.QuartzProgressListener;
|
||||
import org.quartz.JobExecutionContext;
|
||||
|
||||
public class QuartzOrchestratorContext implements OrchestratorContext {
|
||||
|
||||
private static final String APP_SCHEDULE_RUN = "AppScheduleRun";
|
||||
|
||||
private final JobExecutionContext ctx;
|
||||
private final App app;
|
||||
private final Function<JobExecutionContext, AppRunRecord> jobRecordProvider;
|
||||
private final StatusPusher statusPusher;
|
||||
|
||||
@FunctionalInterface
|
||||
public interface StatusPusher {
|
||||
void push(JobExecutionContext ctx, AppRunRecord record, boolean force);
|
||||
}
|
||||
|
||||
public QuartzOrchestratorContext(
|
||||
JobExecutionContext ctx,
|
||||
App app,
|
||||
Function<JobExecutionContext, AppRunRecord> jobRecordProvider,
|
||||
StatusPusher statusPusher) {
|
||||
this.ctx = ctx;
|
||||
this.app = app;
|
||||
this.jobRecordProvider = jobRecordProvider;
|
||||
this.statusPusher = statusPusher;
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getJobName() {
|
||||
return ctx.getJobDetail().getKey().getName();
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getAppConfigJson() {
|
||||
return (String) ctx.getJobDetail().getJobDataMap().get(APP_CONFIG);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void storeRunStats(Stats stats) {
|
||||
ctx.getJobDetail().getJobDataMap().put(APP_RUN_STATS, stats);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void storeRunRecord(String json) {
|
||||
ctx.getJobDetail().getJobDataMap().put(APP_SCHEDULE_RUN, json);
|
||||
}
|
||||
|
||||
@Override
|
||||
public AppRunRecord getJobRecord() {
|
||||
return jobRecordProvider.apply(ctx);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void pushStatusUpdate(AppRunRecord record, boolean force) {
|
||||
statusPusher.push(ctx, record, force);
|
||||
}
|
||||
|
||||
@Override
|
||||
public UUID getAppId() {
|
||||
return app != null ? app.getId() : null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> getAppConfiguration() {
|
||||
return app != null ? JsonUtils.getMap(app.getAppConfiguration()) : null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void updateAppConfiguration(Map<String, Object> config) {
|
||||
if (app != null) {
|
||||
app.setAppConfiguration(config);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public ReindexingProgressListener createProgressListener(EventPublisherJob jobData) {
|
||||
return new QuartzProgressListener(ctx, jobData, app, jobRecordProvider, statusPusher);
|
||||
}
|
||||
|
||||
@Override
|
||||
public ReindexingJobContext createReindexingContext(boolean distributed) {
|
||||
return new QuartzJobContext(ctx, app, distributed);
|
||||
}
|
||||
}
|
||||
|
|
@ -1,8 +1,14 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.type.IndexMappingLanguage;
|
||||
import org.openmetadata.service.search.SearchClusterMetrics;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
/**
|
||||
* Immutable configuration for a reindexing job. This record encapsulates all the configuration
|
||||
|
|
@ -26,7 +32,11 @@ public record ReindexingConfiguration(
|
|||
IndexMappingLanguage searchIndexMappingLanguage,
|
||||
String afterCursor,
|
||||
String slackBotToken,
|
||||
String slackChannel) {
|
||||
String slackChannel,
|
||||
int timeSeriesMaxDays,
|
||||
Map<String, Integer> timeSeriesEntityDays) {
|
||||
|
||||
private static final Logger LOG = LoggerFactory.getLogger(ReindexingConfiguration.class);
|
||||
|
||||
private static final int DEFAULT_BATCH_SIZE = 100;
|
||||
private static final int DEFAULT_CONSUMER_THREADS = 1;
|
||||
|
|
@ -37,6 +47,50 @@ public record ReindexingConfiguration(
|
|||
private static final int DEFAULT_MAX_RETRIES = 5;
|
||||
private static final int DEFAULT_INITIAL_BACKOFF = 1000;
|
||||
private static final int DEFAULT_MAX_BACKOFF = 10000;
|
||||
private static final int DEFAULT_TIME_SERIES_MAX_DAYS = 0;
|
||||
|
||||
public static ReindexingConfiguration applyAutoTuning(
|
||||
ReindexingConfiguration config, SearchRepository searchRepository) {
|
||||
if (!config.autoTune()) {
|
||||
return config;
|
||||
}
|
||||
SearchClusterMetrics metrics = fetchClusterMetrics(searchRepository);
|
||||
if (metrics == null) {
|
||||
return config;
|
||||
}
|
||||
return ReindexingConfiguration.builder()
|
||||
.entities(config.entities())
|
||||
.batchSize(metrics.getRecommendedBatchSize())
|
||||
.consumerThreads(metrics.getRecommendedConsumerThreads())
|
||||
.producerThreads(metrics.getRecommendedProducerThreads())
|
||||
.queueSize(metrics.getRecommendedQueueSize())
|
||||
.maxConcurrentRequests(metrics.getRecommendedConcurrentRequests())
|
||||
.payloadSize(metrics.getMaxPayloadSizeBytes())
|
||||
.recreateIndex(config.recreateIndex())
|
||||
.autoTune(true)
|
||||
.useDistributedIndexing(config.useDistributedIndexing())
|
||||
.force(config.force())
|
||||
.maxRetries(config.maxRetries())
|
||||
.initialBackoff(config.initialBackoff())
|
||||
.maxBackoff(config.maxBackoff())
|
||||
.searchIndexMappingLanguage(config.searchIndexMappingLanguage())
|
||||
.afterCursor(config.afterCursor())
|
||||
.slackBotToken(config.slackBotToken())
|
||||
.slackChannel(config.slackChannel())
|
||||
.timeSeriesMaxDays(config.timeSeriesMaxDays())
|
||||
.timeSeriesEntityDays(config.timeSeriesEntityDays())
|
||||
.build();
|
||||
}
|
||||
|
||||
private static SearchClusterMetrics fetchClusterMetrics(SearchRepository searchRepository) {
|
||||
try {
|
||||
return SearchClusterMetrics.fetchClusterMetrics(
|
||||
searchRepository, 0, searchRepository.getMaxDBConnections());
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Failed to fetch cluster metrics for auto-tuning, using configured values", e);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Creates a ReindexingConfiguration from an EventPublisherJob.
|
||||
|
|
@ -69,7 +123,30 @@ public record ReindexingConfiguration(
|
|||
jobData.getSearchIndexMappingLanguage(),
|
||||
jobData.getAfterCursor(),
|
||||
jobData.getSlackBotToken(),
|
||||
jobData.getSlackChannel());
|
||||
jobData.getSlackChannel(),
|
||||
jobData.getTimeSeriesMaxDays() != null
|
||||
? jobData.getTimeSeriesMaxDays()
|
||||
: DEFAULT_TIME_SERIES_MAX_DAYS,
|
||||
jobData.getTimeSeriesEntityDays() != null
|
||||
? jobData.getTimeSeriesEntityDays()
|
||||
: Collections.emptyMap());
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns the start timestamp for time series date filtering for the given entity type. Uses
|
||||
* per-entity override if configured, otherwise falls back to the default timeSeriesMaxDays.
|
||||
*
|
||||
* @return start timestamp in millis, or -1 if no filtering should be applied (days <= 0)
|
||||
*/
|
||||
public long getTimeSeriesStartTs(String entityType) {
|
||||
int days = timeSeriesMaxDays;
|
||||
if (timeSeriesEntityDays != null && timeSeriesEntityDays.containsKey(entityType)) {
|
||||
days = timeSeriesEntityDays.get(entityType);
|
||||
}
|
||||
if (days <= 0) {
|
||||
return -1;
|
||||
}
|
||||
return System.currentTimeMillis() - (days * 86_400_000L);
|
||||
}
|
||||
|
||||
/** Check if Slack notifications are configured */
|
||||
|
|
@ -109,6 +186,8 @@ public record ReindexingConfiguration(
|
|||
private String afterCursor;
|
||||
private String slackBotToken;
|
||||
private String slackChannel;
|
||||
private int timeSeriesMaxDays = DEFAULT_TIME_SERIES_MAX_DAYS;
|
||||
private Map<String, Integer> timeSeriesEntityDays = Collections.emptyMap();
|
||||
|
||||
public Builder entities(Set<String> entities) {
|
||||
this.entities = entities;
|
||||
|
|
@ -200,6 +279,16 @@ public record ReindexingConfiguration(
|
|||
return this;
|
||||
}
|
||||
|
||||
public Builder timeSeriesMaxDays(int timeSeriesMaxDays) {
|
||||
this.timeSeriesMaxDays = timeSeriesMaxDays;
|
||||
return this;
|
||||
}
|
||||
|
||||
public Builder timeSeriesEntityDays(Map<String, Integer> timeSeriesEntityDays) {
|
||||
this.timeSeriesEntityDays = timeSeriesEntityDays;
|
||||
return this;
|
||||
}
|
||||
|
||||
public ReindexingConfiguration build() {
|
||||
return new ReindexingConfiguration(
|
||||
entities,
|
||||
|
|
@ -219,7 +308,9 @@ public record ReindexingConfiguration(
|
|||
searchIndexMappingLanguage,
|
||||
afterCursor,
|
||||
slackBotToken,
|
||||
slackChannel);
|
||||
slackChannel,
|
||||
timeSeriesMaxDays,
|
||||
timeSeriesEntityDays);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,312 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import io.micrometer.core.instrument.Counter;
|
||||
import io.micrometer.core.instrument.DistributionSummary;
|
||||
import io.micrometer.core.instrument.Gauge;
|
||||
import io.micrometer.core.instrument.MeterRegistry;
|
||||
import io.micrometer.core.instrument.Timer;
|
||||
import java.time.Duration;
|
||||
import java.util.Map;
|
||||
import java.util.concurrent.ConcurrentHashMap;
|
||||
import java.util.concurrent.atomic.AtomicLong;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
|
||||
@Slf4j
|
||||
public class ReindexingMetrics {
|
||||
|
||||
private static volatile ReindexingMetrics instance;
|
||||
private final MeterRegistry meterRegistry;
|
||||
|
||||
// Job lifecycle
|
||||
private final Counter jobsStarted;
|
||||
private final Counter jobsCompleted;
|
||||
private final Counter jobsFailed;
|
||||
private final Counter jobsStopped;
|
||||
private final Timer jobDurationCompleted;
|
||||
private final Timer jobDurationFailed;
|
||||
private final Timer jobDurationStopped;
|
||||
private final AtomicLong activeJobs = new AtomicLong();
|
||||
|
||||
// Bulk request metrics
|
||||
private final Timer bulkDurationSuccess;
|
||||
private final Timer bulkDurationFailure;
|
||||
private final DistributionSummary bulkPayloadSize;
|
||||
private final AtomicLong pendingBulkRequests = new AtomicLong();
|
||||
|
||||
// Backpressure
|
||||
private final Counter backpressureEvents;
|
||||
|
||||
// Circuit breaker
|
||||
private final Map<String, Counter> circuitBreakerCounters = new ConcurrentHashMap<>();
|
||||
|
||||
// Vector timeouts
|
||||
private final Counter vectorTimeouts;
|
||||
|
||||
// Queue fill ratio gauge
|
||||
private final AtomicLong queueFillRatio = new AtomicLong();
|
||||
|
||||
private ReindexingMetrics(MeterRegistry meterRegistry) {
|
||||
this.meterRegistry = meterRegistry;
|
||||
|
||||
// Job lifecycle counters
|
||||
this.jobsStarted =
|
||||
Counter.builder("reindexing.jobs")
|
||||
.description("Job lifecycle events")
|
||||
.tag("status", "started")
|
||||
.register(meterRegistry);
|
||||
|
||||
this.jobsCompleted =
|
||||
Counter.builder("reindexing.jobs")
|
||||
.description("Job lifecycle events")
|
||||
.tag("status", "completed")
|
||||
.register(meterRegistry);
|
||||
|
||||
this.jobsFailed =
|
||||
Counter.builder("reindexing.jobs")
|
||||
.description("Job lifecycle events")
|
||||
.tag("status", "failed")
|
||||
.register(meterRegistry);
|
||||
|
||||
this.jobsStopped =
|
||||
Counter.builder("reindexing.jobs")
|
||||
.description("Job lifecycle events")
|
||||
.tag("status", "stopped")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Job duration timers
|
||||
this.jobDurationCompleted =
|
||||
Timer.builder("reindexing.job.duration")
|
||||
.description("Job wall-clock duration")
|
||||
.tag("status", "completed")
|
||||
.register(meterRegistry);
|
||||
|
||||
this.jobDurationFailed =
|
||||
Timer.builder("reindexing.job.duration")
|
||||
.description("Job wall-clock duration")
|
||||
.tag("status", "failed")
|
||||
.register(meterRegistry);
|
||||
|
||||
this.jobDurationStopped =
|
||||
Timer.builder("reindexing.job.duration")
|
||||
.description("Job wall-clock duration")
|
||||
.tag("status", "stopped")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Active jobs gauge
|
||||
Gauge.builder("reindexing.jobs.active", activeJobs, AtomicLong::get)
|
||||
.description("Currently running reindexing jobs")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Bulk request timers with SLA buckets
|
||||
this.bulkDurationSuccess =
|
||||
Timer.builder("reindexing.bulk.duration")
|
||||
.description("Bulk request latency")
|
||||
.tag("success", "true")
|
||||
.sla(
|
||||
Duration.ofMillis(50),
|
||||
Duration.ofMillis(100),
|
||||
Duration.ofMillis(500),
|
||||
Duration.ofSeconds(1),
|
||||
Duration.ofSeconds(5),
|
||||
Duration.ofSeconds(10),
|
||||
Duration.ofSeconds(30))
|
||||
.register(meterRegistry);
|
||||
|
||||
this.bulkDurationFailure =
|
||||
Timer.builder("reindexing.bulk.duration")
|
||||
.description("Bulk request latency")
|
||||
.tag("success", "false")
|
||||
.sla(
|
||||
Duration.ofMillis(50),
|
||||
Duration.ofMillis(100),
|
||||
Duration.ofMillis(500),
|
||||
Duration.ofSeconds(1),
|
||||
Duration.ofSeconds(5),
|
||||
Duration.ofSeconds(10),
|
||||
Duration.ofSeconds(30))
|
||||
.register(meterRegistry);
|
||||
|
||||
// Bulk payload size distribution
|
||||
this.bulkPayloadSize =
|
||||
DistributionSummary.builder("reindexing.bulk.payload.size")
|
||||
.description("Payload size in bytes")
|
||||
.baseUnit("bytes")
|
||||
.serviceLevelObjectives(
|
||||
64 * 1024d, 256 * 1024d, 1024 * 1024d, 5 * 1024 * 1024d, 20 * 1024 * 1024d)
|
||||
.register(meterRegistry);
|
||||
|
||||
// Pending bulk requests gauge
|
||||
Gauge.builder("reindexing.sink.pending", pendingBulkRequests, AtomicLong::get)
|
||||
.description("In-flight bulk requests")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Backpressure counter
|
||||
this.backpressureEvents =
|
||||
Counter.builder("reindexing.backpressure.events")
|
||||
.description("Backpressure detections")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Vector timeouts counter
|
||||
this.vectorTimeouts =
|
||||
Counter.builder("reindexing.vector.timeouts")
|
||||
.description("Vector embedding completion timeouts")
|
||||
.register(meterRegistry);
|
||||
|
||||
// Queue fill ratio gauge
|
||||
Gauge.builder("reindexing.queue.fill_ratio", queueFillRatio, AtomicLong::get)
|
||||
.description("Task queue fill ratio (0-100)")
|
||||
.register(meterRegistry);
|
||||
|
||||
LOG.info("Reindexing metrics initialized");
|
||||
}
|
||||
|
||||
public static synchronized void initialize(MeterRegistry meterRegistry) {
|
||||
if (instance == null) {
|
||||
instance = new ReindexingMetrics(meterRegistry);
|
||||
}
|
||||
}
|
||||
|
||||
public static ReindexingMetrics getInstance() {
|
||||
return instance;
|
||||
}
|
||||
|
||||
// --- Job lifecycle ---
|
||||
|
||||
public Timer.Sample startJobTimer() {
|
||||
return Timer.start(meterRegistry);
|
||||
}
|
||||
|
||||
public void recordJobStarted() {
|
||||
jobsStarted.increment();
|
||||
activeJobs.incrementAndGet();
|
||||
}
|
||||
|
||||
public void recordJobCompleted(Timer.Sample sample) {
|
||||
jobsCompleted.increment();
|
||||
activeJobs.decrementAndGet();
|
||||
if (sample != null) {
|
||||
sample.stop(jobDurationCompleted);
|
||||
}
|
||||
}
|
||||
|
||||
public void recordJobFailed(Timer.Sample sample) {
|
||||
jobsFailed.increment();
|
||||
activeJobs.decrementAndGet();
|
||||
if (sample != null) {
|
||||
sample.stop(jobDurationFailed);
|
||||
}
|
||||
}
|
||||
|
||||
public void recordJobStopped(Timer.Sample sample) {
|
||||
jobsStopped.increment();
|
||||
activeJobs.decrementAndGet();
|
||||
if (sample != null) {
|
||||
sample.stop(jobDurationStopped);
|
||||
}
|
||||
}
|
||||
|
||||
// --- Stage counters (dynamic tags) ---
|
||||
|
||||
public void recordStageSuccess(String stage, String entityType, long count) {
|
||||
Counter.builder("reindexing.stage.success")
|
||||
.description("Records successfully processed per stage")
|
||||
.tag("stage", stage)
|
||||
.tag("entity_type", entityType)
|
||||
.register(meterRegistry)
|
||||
.increment(count);
|
||||
}
|
||||
|
||||
public void recordStageFailed(String stage, String entityType, long count) {
|
||||
Counter.builder("reindexing.stage.failed")
|
||||
.description("Records failed per stage")
|
||||
.tag("stage", stage)
|
||||
.tag("entity_type", entityType)
|
||||
.register(meterRegistry)
|
||||
.increment(count);
|
||||
}
|
||||
|
||||
public void recordStageWarnings(String stage, String entityType, long count) {
|
||||
Counter.builder("reindexing.stage.warnings")
|
||||
.description("Reader warnings")
|
||||
.tag("stage", stage)
|
||||
.tag("entity_type", entityType)
|
||||
.register(meterRegistry)
|
||||
.increment(count);
|
||||
}
|
||||
|
||||
// --- Bulk request metrics ---
|
||||
|
||||
public Timer.Sample startBulkRequestTimer() {
|
||||
return Timer.start(meterRegistry);
|
||||
}
|
||||
|
||||
public void recordBulkRequestCompleted(Timer.Sample sample, boolean success) {
|
||||
if (sample != null) {
|
||||
sample.stop(success ? bulkDurationSuccess : bulkDurationFailure);
|
||||
}
|
||||
}
|
||||
|
||||
public void recordPayloadSize(long sizeBytes) {
|
||||
bulkPayloadSize.record(sizeBytes);
|
||||
}
|
||||
|
||||
public void incrementPendingBulkRequests() {
|
||||
pendingBulkRequests.incrementAndGet();
|
||||
}
|
||||
|
||||
public void decrementPendingBulkRequests() {
|
||||
pendingBulkRequests.decrementAndGet();
|
||||
}
|
||||
|
||||
// --- Backpressure ---
|
||||
|
||||
public void recordBackpressureEvent() {
|
||||
backpressureEvents.increment();
|
||||
}
|
||||
|
||||
// --- Promotion metrics (dynamic tags) ---
|
||||
|
||||
public void recordPromotionSuccess(String entityType) {
|
||||
Counter.builder("reindexing.promotion")
|
||||
.description("Index promotion events")
|
||||
.tag("entity_type", entityType)
|
||||
.tag("result", "success")
|
||||
.register(meterRegistry)
|
||||
.increment();
|
||||
}
|
||||
|
||||
public void recordPromotionFailure(String entityType) {
|
||||
Counter.builder("reindexing.promotion")
|
||||
.description("Index promotion events")
|
||||
.tag("entity_type", entityType)
|
||||
.tag("result", "failure")
|
||||
.register(meterRegistry)
|
||||
.increment();
|
||||
}
|
||||
|
||||
// --- Circuit breaker ---
|
||||
|
||||
public void recordCircuitBreakerTrip(String transition) {
|
||||
circuitBreakerCounters
|
||||
.computeIfAbsent(
|
||||
transition,
|
||||
t ->
|
||||
Counter.builder("reindexing.circuitbreaker.trips")
|
||||
.description("Circuit breaker state transitions")
|
||||
.tag("transition", t)
|
||||
.register(meterRegistry))
|
||||
.increment();
|
||||
}
|
||||
|
||||
// --- Vector timeouts ---
|
||||
|
||||
public void recordVectorTimeout(int pendingCount) {
|
||||
vectorTimeouts.increment();
|
||||
}
|
||||
|
||||
// --- Queue fill ratio ---
|
||||
|
||||
public void updateQueueFillRatio(int percent) {
|
||||
queueFillRatio.set(percent);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,459 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.service.apps.scheduler.AppScheduler.ON_DEMAND_JOB;
|
||||
import static org.openmetadata.service.socket.WebSocketManager.SEARCH_INDEX_JOB_BROADCAST_CHANNEL;
|
||||
|
||||
import com.fasterxml.jackson.core.type.TypeReference;
|
||||
import io.micrometer.core.instrument.Timer;
|
||||
import java.util.Collections;
|
||||
import java.util.HashSet;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.UUID;
|
||||
import lombok.Getter;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.api.configuration.OpenMetadataBaseUrlConfiguration;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
import org.openmetadata.schema.entity.app.FailureContext;
|
||||
import org.openmetadata.schema.entity.app.SuccessContext;
|
||||
import org.openmetadata.schema.settings.Settings;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.IndexingError;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.utils.JsonUtils;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.LoggingProgressListener;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.SlackProgressListener;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.jdbi3.SystemRepository;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.socket.WebSocketManager;
|
||||
import org.slf4j.MDC;
|
||||
|
||||
@Slf4j
|
||||
public class ReindexingOrchestrator {
|
||||
private static final String ALL = "all";
|
||||
private final CollectionDAO collectionDAO;
|
||||
private final SearchRepository searchRepository;
|
||||
private final OrchestratorContext context;
|
||||
|
||||
@Getter private EventPublisherJob jobData;
|
||||
private volatile boolean stopped = false;
|
||||
private volatile IndexingStrategy activeStrategy;
|
||||
private volatile Map<String, Object> resultMetadata = Collections.emptyMap();
|
||||
|
||||
public ReindexingOrchestrator(
|
||||
CollectionDAO collectionDAO, SearchRepository searchRepository, OrchestratorContext context) {
|
||||
this.collectionDAO = collectionDAO;
|
||||
this.searchRepository = searchRepository;
|
||||
this.context = context;
|
||||
}
|
||||
|
||||
public void run(EventPublisherJob initialJobData) {
|
||||
this.jobData = initialJobData;
|
||||
initializeState();
|
||||
initializeJobData();
|
||||
|
||||
String jobId = UUID.randomUUID().toString().substring(0, 8);
|
||||
MDC.put("reindexJobId", jobId);
|
||||
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
Timer.Sample timerSample = null;
|
||||
if (metrics != null) {
|
||||
metrics.recordJobStarted();
|
||||
timerSample = metrics.startJobTimer();
|
||||
}
|
||||
|
||||
preflightFixes();
|
||||
|
||||
try {
|
||||
runReindexing();
|
||||
} catch (Exception ex) {
|
||||
handleExecutionException(ex);
|
||||
} finally {
|
||||
finalizeJobExecution();
|
||||
cleanupOrphanedIndices();
|
||||
|
||||
if (metrics != null && timerSample != null) {
|
||||
EventPublisherJob.Status status = jobData != null ? jobData.getStatus() : null;
|
||||
if (status == EventPublisherJob.Status.COMPLETED
|
||||
|| status == EventPublisherJob.Status.ACTIVE_ERROR) {
|
||||
metrics.recordJobCompleted(timerSample);
|
||||
} else if (status == EventPublisherJob.Status.STOPPED) {
|
||||
metrics.recordJobStopped(timerSample);
|
||||
} else {
|
||||
metrics.recordJobFailed(timerSample);
|
||||
}
|
||||
}
|
||||
|
||||
MDC.remove("reindexJobId");
|
||||
}
|
||||
}
|
||||
|
||||
public void stop() {
|
||||
LOG.info("Reindexing job is being stopped.");
|
||||
stopped = true;
|
||||
|
||||
IndexingStrategy strategy = this.activeStrategy;
|
||||
if (strategy != null) {
|
||||
try {
|
||||
strategy.stop();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error stopping indexing strategy", e);
|
||||
}
|
||||
}
|
||||
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
|
||||
AppRunRecord appRecord = context.getJobRecord();
|
||||
appRecord.setStatus(AppRunRecord.Status.STOPPED);
|
||||
appRecord.setEndTime(System.currentTimeMillis());
|
||||
context.storeRunRecord(JsonUtils.pojoToJson(appRecord));
|
||||
context.pushStatusUpdate(appRecord, true);
|
||||
sendUpdates();
|
||||
|
||||
LOG.info("Reindexing job stopped successfully.");
|
||||
}
|
||||
|
||||
private void initializeState() {
|
||||
stopped = false;
|
||||
activeStrategy = null;
|
||||
resultMetadata = Collections.emptyMap();
|
||||
}
|
||||
|
||||
private void initializeJobData() {
|
||||
if (jobData == null) {
|
||||
jobData = loadJobData();
|
||||
}
|
||||
|
||||
String jobName = context.getJobName();
|
||||
if (jobName.equals(ON_DEMAND_JOB)) {
|
||||
Map<String, Object> jsonAppConfig =
|
||||
JsonUtils.convertValue(jobData, new TypeReference<Map<String, Object>>() {});
|
||||
context.updateAppConfiguration(jsonAppConfig);
|
||||
}
|
||||
}
|
||||
|
||||
private EventPublisherJob loadJobData() {
|
||||
String appConfigJson = context.getAppConfigJson();
|
||||
if (appConfigJson != null) {
|
||||
return JsonUtils.readValue(appConfigJson, EventPublisherJob.class);
|
||||
}
|
||||
|
||||
Map<String, Object> appConfig = context.getAppConfiguration();
|
||||
if (appConfig != null) {
|
||||
return JsonUtils.convertValue(appConfig, EventPublisherJob.class);
|
||||
}
|
||||
|
||||
LOG.error("Unable to initialize jobData from JobDataMap or App configuration");
|
||||
throw new SearchIndexApp.ReindexingException("JobData is not initialized");
|
||||
}
|
||||
|
||||
private void preflightFixes() {
|
||||
LOG.info("Running preflight fixes before reindexing");
|
||||
markStaleRunningJobsStopped();
|
||||
cleanupOrphanedIndicesPreFlight();
|
||||
}
|
||||
|
||||
private static final String APP_NAME = "SearchIndexingApplication";
|
||||
|
||||
private void markStaleRunningJobsStopped() {
|
||||
try {
|
||||
AppRunRecord currentRecord = context.getJobRecord();
|
||||
if (currentRecord != null && currentRecord.getStartTime() != null) {
|
||||
collectionDAO
|
||||
.appExtensionTimeSeriesDao()
|
||||
.markStaleEntriesStoppedBefore(APP_NAME, currentRecord.getStartTime());
|
||||
LOG.info("Preflight: marked stale running jobs as stopped for {}", APP_NAME);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Preflight: failed to cleanup stale running jobs: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
private void cleanupOrphanedIndicesPreFlight() {
|
||||
try {
|
||||
OrphanedIndexCleaner cleaner = new OrphanedIndexCleaner();
|
||||
OrphanedIndexCleaner.CleanupResult result =
|
||||
cleaner.cleanupOrphanedIndices(searchRepository.getSearchClient());
|
||||
if (result.found() > 0) {
|
||||
LOG.info(
|
||||
"Preflight: cleaned up {} orphaned rebuild indices (found={}, failed={})",
|
||||
result.deleted(),
|
||||
result.found(),
|
||||
result.failed());
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Preflight: failed to cleanup orphaned indices: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
private void runReindexing() throws Exception {
|
||||
if (jobData.getEntities() == null || jobData.getEntities().isEmpty()) {
|
||||
LOG.info("No entities selected for reindexing, completing immediately");
|
||||
jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
jobData.setStats(new Stats());
|
||||
return;
|
||||
}
|
||||
|
||||
setupEntities();
|
||||
cleanupOldFailures();
|
||||
|
||||
LOG.info(
|
||||
"Search Index Job Started for Entities: {}, RecreateIndex: {}, DistributedIndexing: {}",
|
||||
jobData.getEntities(),
|
||||
jobData.getRecreateIndex(),
|
||||
jobData.getUseDistributedIndexing());
|
||||
|
||||
activeStrategy = createStrategy();
|
||||
|
||||
activeStrategy.addListener(context.createProgressListener(jobData));
|
||||
activeStrategy.addListener(new LoggingProgressListener());
|
||||
|
||||
if (hasSlackConfig()) {
|
||||
String instanceUrl = getInstanceUrl();
|
||||
activeStrategy.addListener(
|
||||
new SlackProgressListener(
|
||||
jobData.getSlackBotToken(), jobData.getSlackChannel(), instanceUrl));
|
||||
}
|
||||
|
||||
ReindexingJobContext jobContext =
|
||||
context.createReindexingContext(Boolean.TRUE.equals(jobData.getUseDistributedIndexing()));
|
||||
|
||||
ReindexingConfiguration config = ReindexingConfiguration.from(jobData);
|
||||
config = ReindexingConfiguration.applyAutoTuning(config, searchRepository);
|
||||
|
||||
ExecutionResult result = activeStrategy.execute(config, jobContext);
|
||||
updateJobDataFromResult(result);
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
context.storeRunStats(jobData.getStats());
|
||||
}
|
||||
|
||||
if (!result.metadata().isEmpty()) {
|
||||
saveResultMetadataToJobRecord(result.metadata());
|
||||
}
|
||||
}
|
||||
|
||||
private IndexingStrategy createStrategy() {
|
||||
if (Boolean.TRUE.equals(jobData.getUseDistributedIndexing())) {
|
||||
AppRunRecord appRecord = context.getJobRecord();
|
||||
return new DistributedIndexingStrategy(
|
||||
collectionDAO,
|
||||
searchRepository,
|
||||
jobData,
|
||||
appRecord.getAppId(),
|
||||
appRecord.getStartTime(),
|
||||
context.getJobName());
|
||||
}
|
||||
return new SingleServerIndexingStrategy(collectionDAO, searchRepository);
|
||||
}
|
||||
|
||||
private void updateJobDataFromResult(ExecutionResult result) {
|
||||
if (result.finalStats() != null) {
|
||||
Stats stats = result.finalStats();
|
||||
StatsReconciler.reconcile(stats);
|
||||
jobData.setStats(stats);
|
||||
}
|
||||
|
||||
resultMetadata = result.metadata() != null ? result.metadata() : Collections.emptyMap();
|
||||
|
||||
switch (result.status()) {
|
||||
case COMPLETED -> jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
case COMPLETED_WITH_ERRORS -> jobData.setStatus(EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
case FAILED -> jobData.setStatus(EventPublisherJob.Status.FAILED);
|
||||
case STOPPED -> jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
}
|
||||
|
||||
private void saveResultMetadataToJobRecord(Map<String, Object> metadata) {
|
||||
try {
|
||||
AppRunRecord appRecord = context.getJobRecord();
|
||||
SuccessContext successContext = appRecord.getSuccessContext();
|
||||
if (successContext == null) {
|
||||
successContext = new SuccessContext();
|
||||
}
|
||||
|
||||
for (Map.Entry<String, Object> entry : metadata.entrySet()) {
|
||||
successContext.withAdditionalProperty(entry.getKey(), entry.getValue());
|
||||
}
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
successContext.withAdditionalProperty("stats", jobData.getStats());
|
||||
}
|
||||
|
||||
appRecord.setSuccessContext(successContext);
|
||||
context.storeRunRecord(JsonUtils.pojoToJson(appRecord));
|
||||
} catch (Exception e) {
|
||||
LOG.error("Failed to save result metadata to job record", e);
|
||||
}
|
||||
}
|
||||
|
||||
private void handleExecutionException(Exception ex) {
|
||||
IndexingStrategy strategy = this.activeStrategy;
|
||||
if (strategy != null && jobData != null) {
|
||||
try {
|
||||
strategy.getStats().ifPresent(jobData::setStats);
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not capture strategy stats during exception handling", e);
|
||||
}
|
||||
}
|
||||
|
||||
if (stopped) {
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
} else {
|
||||
IndexingError error =
|
||||
new IndexingError()
|
||||
.withErrorSource(IndexingError.ErrorSource.JOB)
|
||||
.withMessage("Reindexing Job Exception: " + ex.getMessage());
|
||||
LOG.error("Reindexing Job Failed", ex);
|
||||
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.FAILED);
|
||||
jobData.setFailure(error);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void finalizeJobExecution() {
|
||||
sendUpdates();
|
||||
|
||||
if (stopped) {
|
||||
AppRunRecord appRecord = context.getJobRecord();
|
||||
appRecord.setStatus(AppRunRecord.Status.STOPPED);
|
||||
context.storeRunRecord(JsonUtils.pojoToJson(appRecord));
|
||||
}
|
||||
}
|
||||
|
||||
private void sendUpdates() {
|
||||
try {
|
||||
updateRecordToDbAndNotify();
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to send updates", ex);
|
||||
}
|
||||
}
|
||||
|
||||
private void updateRecordToDbAndNotify() {
|
||||
AppRunRecord appRecord = context.getJobRecord();
|
||||
appRecord.setStatus(AppRunRecord.Status.fromValue(jobData.getStatus().value()));
|
||||
|
||||
if (jobData.getFailure() != null) {
|
||||
appRecord.setFailureContext(
|
||||
new FailureContext().withAdditionalProperty("failure", jobData.getFailure()));
|
||||
}
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
SuccessContext successContext =
|
||||
new SuccessContext().withAdditionalProperty("stats", jobData.getStats());
|
||||
|
||||
String distributedJobId = (String) resultMetadata.get("distributedJobId");
|
||||
|
||||
try {
|
||||
UUID appId = context.getAppId();
|
||||
String jobIdStr =
|
||||
distributedJobId != null ? distributedJobId : (appId != null ? appId.toString() : null);
|
||||
if (jobIdStr != null) {
|
||||
int failureCount = collectionDAO.searchIndexFailureDAO().countByJobId(jobIdStr);
|
||||
if (failureCount > 0) {
|
||||
successContext.withAdditionalProperty("failureRecordCount", failureCount);
|
||||
}
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not get failure count", e);
|
||||
}
|
||||
|
||||
Object serverStats = resultMetadata.get("serverStats");
|
||||
if (serverStats != null) {
|
||||
successContext.withAdditionalProperty("serverStats", serverStats);
|
||||
successContext.withAdditionalProperty("serverCount", resultMetadata.get("serverCount"));
|
||||
successContext.withAdditionalProperty("distributedJobId", distributedJobId);
|
||||
}
|
||||
|
||||
appRecord.setSuccessContext(successContext);
|
||||
}
|
||||
|
||||
if (WebSocketManager.getInstance() != null) {
|
||||
String messageJson = JsonUtils.pojoToJson(appRecord);
|
||||
WebSocketManager.getInstance()
|
||||
.broadCastMessageToAll(SEARCH_INDEX_JOB_BROADCAST_CHANNEL, messageJson);
|
||||
}
|
||||
}
|
||||
|
||||
private void cleanupOldFailures() {
|
||||
try {
|
||||
int deleted = collectionDAO.searchIndexFailureDAO().deleteAll();
|
||||
if (deleted > 0) {
|
||||
LOG.info("Cleaned up {} failure records from previous runs", deleted);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Failed to cleanup old failure records", e);
|
||||
}
|
||||
}
|
||||
|
||||
private void cleanupOrphanedIndices() {
|
||||
try {
|
||||
OrphanedIndexCleaner cleaner = new OrphanedIndexCleaner();
|
||||
OrphanedIndexCleaner.CleanupResult result =
|
||||
cleaner.cleanupOrphanedIndices(searchRepository.getSearchClient());
|
||||
if (result.deleted() > 0) {
|
||||
LOG.info(
|
||||
"Cleaned up {} orphaned rebuild indices on Job End (found={}, failed={})",
|
||||
result.deleted(),
|
||||
result.found(),
|
||||
result.failed());
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Failed to cleanup orphaned indices on Job End: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
private void setupEntities() {
|
||||
boolean containsAll = jobData.getEntities().contains(ALL);
|
||||
if (containsAll) {
|
||||
jobData.setEntities(getAll());
|
||||
}
|
||||
}
|
||||
|
||||
private Set<String> getAll() {
|
||||
Set<String> entities =
|
||||
new HashSet<>(
|
||||
Entity.getEntityList().stream()
|
||||
.filter(t -> searchRepository.getEntityIndexMap().containsKey(t))
|
||||
.toList());
|
||||
entities.addAll(
|
||||
SearchIndexApp.TIME_SERIES_ENTITIES.stream()
|
||||
.filter(t -> searchRepository.getEntityIndexMap().containsKey(t))
|
||||
.toList());
|
||||
return entities;
|
||||
}
|
||||
|
||||
private boolean hasSlackConfig() {
|
||||
return jobData.getSlackBotToken() != null
|
||||
&& !jobData.getSlackBotToken().isEmpty()
|
||||
&& jobData.getSlackChannel() != null
|
||||
&& !jobData.getSlackChannel().isEmpty();
|
||||
}
|
||||
|
||||
private String getInstanceUrl() {
|
||||
try {
|
||||
SystemRepository systemRepository = Entity.getSystemRepository();
|
||||
if (systemRepository != null) {
|
||||
Settings settings = systemRepository.getOMBaseUrlConfigInternal();
|
||||
if (settings != null && settings.getConfigValue() != null) {
|
||||
OpenMetadataBaseUrlConfiguration urlConfig =
|
||||
(OpenMetadataBaseUrlConfiguration) settings.getConfigValue();
|
||||
if (urlConfig != null && urlConfig.getOpenMetadataUrl() != null) {
|
||||
return urlConfig.getOpenMetadataUrl();
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not get instance URL from SystemSettings", e);
|
||||
}
|
||||
return "http://localhost:8585";
|
||||
}
|
||||
}
|
||||
|
|
@ -18,6 +18,15 @@ import org.openmetadata.schema.system.StepStats;
|
|||
*/
|
||||
public interface ReindexingProgressListener {
|
||||
|
||||
/** Failure type classification for per-stage failure hooks */
|
||||
enum FailureType {
|
||||
ENTITY_NOT_FOUND,
|
||||
JSON_PROCESSING,
|
||||
DB_ERROR,
|
||||
SINK_ERROR,
|
||||
UNKNOWN
|
||||
}
|
||||
|
||||
/** Called when reindexing job starts initialization */
|
||||
default void onJobStarted(ReindexingJobContext context) {}
|
||||
|
||||
|
|
@ -39,6 +48,20 @@ public interface ReindexingProgressListener {
|
|||
/** Called when an error occurs during processing */
|
||||
default void onError(String entityType, IndexingError error, Stats currentStats) {}
|
||||
|
||||
/** Called when a reader-stage failure occurs for a specific entity */
|
||||
default void onReaderFailure(
|
||||
String entityType, String entityId, String error, FailureType type) {}
|
||||
|
||||
/** Called when a process-stage failure occurs (entity -> search doc conversion) */
|
||||
default void onProcessFailure(String entityType, String entityId, String error) {}
|
||||
|
||||
/** Called when a sink-stage failure occurs (ES/OS bulk indexing) */
|
||||
default void onSinkFailure(String entityType, String entityId, String error) {}
|
||||
|
||||
/** Called when sub-indexing (columns, vectors) completes for an entity type */
|
||||
default void onSubIndexingCompleted(
|
||||
String entityType, String subIndex, StepStats subIndexStats) {}
|
||||
|
||||
/** Called when job completes successfully */
|
||||
default void onJobCompleted(Stats finalStats, long elapsedMillis) {}
|
||||
|
||||
|
|
|
|||
|
|
@ -1,63 +1,24 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.openmetadata.common.utils.CommonUtil.listOrEmpty;
|
||||
import static org.openmetadata.service.Entity.QUERY_COST_RECORD;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESOLUTION_STATUS;
|
||||
import static org.openmetadata.service.Entity.TEST_CASE_RESULT;
|
||||
import static org.openmetadata.service.apps.scheduler.AppScheduler.ON_DEMAND_JOB;
|
||||
import static org.openmetadata.service.apps.scheduler.OmAppJobListener.APP_CONFIG;
|
||||
import static org.openmetadata.service.apps.scheduler.OmAppJobListener.APP_RUN_STATS;
|
||||
import static org.openmetadata.service.socket.WebSocketManager.SEARCH_INDEX_JOB_BROADCAST_CHANNEL;
|
||||
|
||||
import com.fasterxml.jackson.core.type.TypeReference;
|
||||
import jakarta.ws.rs.core.Response;
|
||||
import java.util.Collections;
|
||||
import java.util.HashSet;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.CountDownLatch;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.ScheduledExecutorService;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import lombok.Getter;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.analytics.ReportData;
|
||||
import org.openmetadata.schema.api.configuration.OpenMetadataBaseUrlConfiguration;
|
||||
import org.openmetadata.schema.entity.app.App;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
import org.openmetadata.schema.entity.app.FailureContext;
|
||||
import org.openmetadata.schema.entity.app.SuccessContext;
|
||||
import org.openmetadata.schema.settings.Settings;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.IndexingError;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
import org.openmetadata.schema.utils.JsonUtils;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.AbstractNativeApplication;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.DistributedSearchIndexExecutor;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.IndexJobStatus;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.SearchIndexJob;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.LoggingProgressListener;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.QuartzProgressListener;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.listeners.SlackProgressListener;
|
||||
import org.openmetadata.service.exception.AppException;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.jdbi3.EntityTimeSeriesRepository;
|
||||
import org.openmetadata.service.jdbi3.ListFilter;
|
||||
import org.openmetadata.service.jdbi3.SystemRepository;
|
||||
import org.openmetadata.service.search.RecreateIndexHandler;
|
||||
import org.openmetadata.service.search.ReindexContext;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.search.vector.VectorIndexService;
|
||||
import org.openmetadata.service.socket.WebSocketManager;
|
||||
import org.openmetadata.service.util.FullyQualifiedName;
|
||||
import org.quartz.JobExecutionContext;
|
||||
|
||||
/**
|
||||
* Quartz-scheduled application for reindexing search indices. This class handles the Quartz
|
||||
* integration and delegates core reindexing logic to SearchIndexExecutor.
|
||||
*/
|
||||
@Slf4j
|
||||
public class SearchIndexApp extends AbstractNativeApplication {
|
||||
|
||||
|
|
@ -71,10 +32,6 @@ public class SearchIndexApp extends AbstractNativeApplication {
|
|||
}
|
||||
}
|
||||
|
||||
private static final String ALL = "all";
|
||||
private static final String APP_SCHEDULE_RUN = "AppScheduleRun";
|
||||
private static final long WEBSOCKET_UPDATE_INTERVAL_MS = 2000;
|
||||
|
||||
public static final Set<String> TIME_SERIES_ENTITIES =
|
||||
Set.of(
|
||||
ReportData.ReportDataType.ENTITY_REPORT_DATA.value(),
|
||||
|
|
@ -87,13 +44,7 @@ public class SearchIndexApp extends AbstractNativeApplication {
|
|||
QUERY_COST_RECORD);
|
||||
|
||||
@Getter private EventPublisherJob jobData;
|
||||
private JobExecutionContext jobExecutionContext;
|
||||
private volatile boolean stopped = false;
|
||||
private SearchIndexExecutor executor;
|
||||
private DistributedSearchIndexExecutor distributedExecutor;
|
||||
private ReindexContext recreateContext;
|
||||
private RecreateIndexHandler recreateIndexHandler;
|
||||
private volatile BulkSink searchIndexSink;
|
||||
private volatile ReindexingOrchestrator orchestrator;
|
||||
|
||||
public SearchIndexApp(CollectionDAO collectionDAO, SearchRepository searchRepository) {
|
||||
super(collectionDAO, searchRepository);
|
||||
|
|
@ -105,884 +56,25 @@ public class SearchIndexApp extends AbstractNativeApplication {
|
|||
jobData = JsonUtils.convertValue(app.getAppConfiguration(), EventPublisherJob.class);
|
||||
}
|
||||
|
||||
private void cleanupOrphanedIndices() {
|
||||
try {
|
||||
OrphanedIndexCleaner cleaner = new OrphanedIndexCleaner();
|
||||
OrphanedIndexCleaner.CleanupResult result =
|
||||
cleaner.cleanupOrphanedIndices(searchRepository.getSearchClient());
|
||||
if (result.deleted() > 0) {
|
||||
LOG.info(
|
||||
"Cleaned up {} orphaned rebuild indices on Job End (found={}, failed={})",
|
||||
result.deleted(),
|
||||
result.found(),
|
||||
result.failed());
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Failed to cleanup orphaned indices on Job End: {}", e.getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void execute(JobExecutionContext jobExecutionContext) {
|
||||
this.jobExecutionContext = jobExecutionContext;
|
||||
initializeJobState();
|
||||
initializeJobData(jobExecutionContext);
|
||||
|
||||
try {
|
||||
runReindexing(jobExecutionContext);
|
||||
} catch (Exception ex) {
|
||||
handleExecutionException(ex);
|
||||
} finally {
|
||||
finalizeJobExecution(jobExecutionContext);
|
||||
cleanupOrphanedIndices();
|
||||
}
|
||||
}
|
||||
|
||||
private void initializeJobState() {
|
||||
stopped = false;
|
||||
recreateContext = null;
|
||||
}
|
||||
|
||||
private void initializeJobData(JobExecutionContext jobExecutionContext) {
|
||||
if (jobData == null) {
|
||||
jobData = loadJobData(jobExecutionContext);
|
||||
}
|
||||
|
||||
String jobName = jobExecutionContext.getJobDetail().getKey().getName();
|
||||
if (jobName.equals(ON_DEMAND_JOB)) {
|
||||
Map<String, Object> jsonAppConfig =
|
||||
JsonUtils.convertValue(jobData, new TypeReference<Map<String, Object>>() {});
|
||||
getApp().setAppConfiguration(jsonAppConfig);
|
||||
}
|
||||
}
|
||||
|
||||
private EventPublisherJob loadJobData(JobExecutionContext jobExecutionContext) {
|
||||
String appConfigJson =
|
||||
(String) jobExecutionContext.getJobDetail().getJobDataMap().get(APP_CONFIG);
|
||||
if (appConfigJson != null) {
|
||||
return JsonUtils.readValue(appConfigJson, EventPublisherJob.class);
|
||||
}
|
||||
|
||||
if (getApp() != null && getApp().getAppConfiguration() != null) {
|
||||
return JsonUtils.convertValue(getApp().getAppConfiguration(), EventPublisherJob.class);
|
||||
}
|
||||
|
||||
LOG.error("Unable to initialize jobData from JobDataMap or App configuration");
|
||||
throw new ReindexingException("JobData is not initialized");
|
||||
}
|
||||
|
||||
private void cleanupOldFailures() {
|
||||
try {
|
||||
// Delete all previous failure records - we only keep failures for the current run
|
||||
int deleted = collectionDAO.searchIndexFailureDAO().deleteAll();
|
||||
if (deleted > 0) {
|
||||
LOG.info("Cleaned up {} failure records from previous runs", deleted);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.warn("Failed to cleanup old failure records", e);
|
||||
}
|
||||
}
|
||||
|
||||
private void runReindexing(JobExecutionContext jobExecutionContext) throws Exception {
|
||||
boolean success = false;
|
||||
try {
|
||||
if (jobData.getEntities() == null || jobData.getEntities().isEmpty()) {
|
||||
LOG.info("No entities selected for reindexing, completing immediately");
|
||||
jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
jobData.setStats(new Stats());
|
||||
success = true;
|
||||
return;
|
||||
}
|
||||
|
||||
setupEntities();
|
||||
cleanupOldFailures();
|
||||
|
||||
LOG.info(
|
||||
"Search Index Job Started for Entities: {}, RecreateIndex: {}, DistributedIndexing: {}",
|
||||
jobData.getEntities(),
|
||||
jobData.getRecreateIndex(),
|
||||
jobData.getUseDistributedIndexing());
|
||||
|
||||
if (Boolean.TRUE.equals(jobData.getUseDistributedIndexing())) {
|
||||
runDistributedReindexing(jobExecutionContext);
|
||||
success = jobData != null && jobData.getStatus() == EventPublisherJob.Status.COMPLETED;
|
||||
} else {
|
||||
ExecutionResult result = runSingleServerReindexing(jobExecutionContext);
|
||||
success = result.isSuccessful();
|
||||
updateJobDataFromResult(result);
|
||||
}
|
||||
} finally {
|
||||
finalizeAllEntityReindex(success);
|
||||
}
|
||||
}
|
||||
|
||||
private ExecutionResult runSingleServerReindexing(JobExecutionContext jobExecutionContext) {
|
||||
executor = new SearchIndexExecutor(collectionDAO, searchRepository);
|
||||
|
||||
QuartzProgressListener quartzListener =
|
||||
new QuartzProgressListener(jobExecutionContext, jobData, getApp());
|
||||
executor.addListener(quartzListener);
|
||||
executor.addListener(new LoggingProgressListener());
|
||||
|
||||
if (hasSlackConfig()) {
|
||||
String instanceUrl = getInstanceUrl();
|
||||
executor.addListener(
|
||||
new SlackProgressListener(
|
||||
jobData.getSlackBotToken(), jobData.getSlackChannel(), instanceUrl));
|
||||
}
|
||||
|
||||
ReindexingJobContext context =
|
||||
new QuartzJobContext(
|
||||
jobExecutionContext,
|
||||
getApp(),
|
||||
Boolean.TRUE.equals(jobData.getUseDistributedIndexing()));
|
||||
|
||||
ReindexingConfiguration config = ReindexingConfiguration.from(jobData);
|
||||
|
||||
return executor.execute(config, context);
|
||||
}
|
||||
|
||||
private void updateJobDataFromResult(ExecutionResult result) {
|
||||
if (result.finalStats() != null) {
|
||||
Stats stats = result.finalStats();
|
||||
StatsReconciler.reconcile(stats);
|
||||
jobData.setStats(stats);
|
||||
}
|
||||
|
||||
switch (result.status()) {
|
||||
case COMPLETED -> jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
case COMPLETED_WITH_ERRORS -> jobData.setStatus(EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
case FAILED -> jobData.setStatus(EventPublisherJob.Status.FAILED);
|
||||
case STOPPED -> jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
}
|
||||
|
||||
private boolean hasSlackConfig() {
|
||||
return jobData.getSlackBotToken() != null
|
||||
&& !jobData.getSlackBotToken().isEmpty()
|
||||
&& jobData.getSlackChannel() != null
|
||||
&& !jobData.getSlackChannel().isEmpty();
|
||||
}
|
||||
|
||||
private String getInstanceUrl() {
|
||||
try {
|
||||
SystemRepository systemRepository = Entity.getSystemRepository();
|
||||
if (systemRepository != null) {
|
||||
Settings settings = systemRepository.getOMBaseUrlConfigInternal();
|
||||
if (settings != null && settings.getConfigValue() != null) {
|
||||
OpenMetadataBaseUrlConfiguration urlConfig =
|
||||
(OpenMetadataBaseUrlConfiguration) settings.getConfigValue();
|
||||
if (urlConfig != null && urlConfig.getOpenMetadataUrl() != null) {
|
||||
return urlConfig.getOpenMetadataUrl();
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not get instance URL from SystemSettings", e);
|
||||
}
|
||||
return "http://localhost:8585";
|
||||
}
|
||||
|
||||
// ========== Distributed Mode ==========
|
||||
|
||||
private void runDistributedReindexing(JobExecutionContext jobExecutionContext) throws Exception {
|
||||
LOG.info("Starting distributed reindexing for entities: {}", jobData.getEntities());
|
||||
|
||||
Stats stats = initializeTotalRecords(jobData.getEntities());
|
||||
jobData.setStats(stats);
|
||||
|
||||
int partitionSize = jobData.getPartitionSize() != null ? jobData.getPartitionSize() : 10000;
|
||||
distributedExecutor = new DistributedSearchIndexExecutor(collectionDAO, partitionSize);
|
||||
distributedExecutor.performStartupRecovery();
|
||||
|
||||
// Add listeners for distributed mode (same as single-server mode)
|
||||
distributedExecutor.addListener(new LoggingProgressListener());
|
||||
if (hasSlackConfig()) {
|
||||
String instanceUrl = getInstanceUrl();
|
||||
distributedExecutor.addListener(
|
||||
new SlackProgressListener(
|
||||
jobData.getSlackBotToken(), jobData.getSlackChannel(), instanceUrl));
|
||||
}
|
||||
|
||||
String createdBy = jobExecutionContext.getJobDetail().getKey().getName();
|
||||
SearchIndexJob distributedJob =
|
||||
distributedExecutor.createJob(jobData.getEntities(), jobData, createdBy);
|
||||
|
||||
LOG.info(
|
||||
"Created distributed job {} with {} total records",
|
||||
distributedJob.getId(),
|
||||
distributedJob.getTotalRecords());
|
||||
|
||||
this.searchIndexSink =
|
||||
searchRepository.createBulkSink(
|
||||
jobData.getBatchSize(), jobData.getMaxConcurrentRequests(), jobData.getPayLoadSize());
|
||||
this.recreateIndexHandler = searchRepository.createReindexHandler();
|
||||
|
||||
if (Boolean.TRUE.equals(jobData.getRecreateIndex())) {
|
||||
recreateContext = recreateIndexHandler.reCreateIndexes(jobData.getEntities());
|
||||
// Share staged index mapping with participant servers
|
||||
if (recreateContext != null && !recreateContext.isEmpty()) {
|
||||
distributedExecutor.updateStagedIndexMapping(recreateContext.getStagedIndexMapping());
|
||||
}
|
||||
}
|
||||
|
||||
updateJobStatus(EventPublisherJob.Status.RUNNING);
|
||||
sendUpdates(jobExecutionContext, true);
|
||||
|
||||
AppRunRecord appRecord = getJobRecord(jobExecutionContext);
|
||||
distributedExecutor.setAppContext(appRecord.getAppId(), appRecord.getStartTime());
|
||||
|
||||
distributedExecutor.execute(
|
||||
searchIndexSink, recreateContext, Boolean.TRUE.equals(jobData.getRecreateIndex()));
|
||||
monitorDistributedJob(jobExecutionContext, distributedJob.getId());
|
||||
|
||||
if (searchIndexSink != null) {
|
||||
// Wait for vector embedding tasks to complete before closing
|
||||
int pendingVectorTasks = searchIndexSink.getPendingVectorTaskCount();
|
||||
if (pendingVectorTasks > 0) {
|
||||
LOG.info("Waiting for {} pending vector embedding tasks to complete", pendingVectorTasks);
|
||||
boolean vectorComplete = searchIndexSink.awaitVectorCompletion(120);
|
||||
if (!vectorComplete) {
|
||||
LOG.warn("Vector embedding wait timed out - some tasks may not be reflected in stats");
|
||||
}
|
||||
}
|
||||
|
||||
// Flush and wait for pending bulk requests
|
||||
LOG.info("Flushing sink and waiting for pending bulk requests");
|
||||
boolean flushComplete = searchIndexSink.flushAndAwait(60);
|
||||
if (!flushComplete) {
|
||||
LOG.warn("Sink flush timed out - some requests may not be reflected in stats");
|
||||
}
|
||||
|
||||
searchIndexSink.close();
|
||||
}
|
||||
|
||||
SearchIndexJob finalJob = distributedExecutor.getJobWithFreshStats();
|
||||
if (finalJob != null) {
|
||||
// Use actual sink stats for accurate success/failure counts
|
||||
// The partition-based stats may be inaccurate because the bulk sink is asynchronous
|
||||
StepStats sinkStats = searchIndexSink != null ? searchIndexSink.getStats() : null;
|
||||
updateJobDataFromDistributedJob(finalJob, sinkStats);
|
||||
|
||||
// Set vector stats directly from the bulk sink since the sink tracks vector
|
||||
// success/failure internally and these may not be fully reflected in server stats
|
||||
if (searchIndexSink != null && jobData.getStats() != null) {
|
||||
StepStats sinkVectorStats = searchIndexSink.getVectorStats();
|
||||
if (sinkVectorStats != null && sinkVectorStats.getTotalRecords() > 0) {
|
||||
jobData.getStats().setVectorStats(sinkVectorStats);
|
||||
}
|
||||
}
|
||||
|
||||
saveServerStatsToJobDataMap(jobExecutionContext, finalJob);
|
||||
}
|
||||
|
||||
// Save stats to APP_RUN_STATS for OmAppJobListener to pick up
|
||||
// This is required because distributed mode doesn't use QuartzProgressListener
|
||||
if (jobData.getStats() != null) {
|
||||
jobExecutionContext.getJobDetail().getJobDataMap().put(APP_RUN_STATS, jobData.getStats());
|
||||
}
|
||||
|
||||
updateFinalJobStatus();
|
||||
}
|
||||
|
||||
private void monitorDistributedJob(
|
||||
JobExecutionContext jobExecutionContext, java.util.UUID jobId) {
|
||||
CountDownLatch completionLatch = new CountDownLatch(1);
|
||||
ScheduledExecutorService monitor =
|
||||
Executors.newSingleThreadScheduledExecutor(
|
||||
Thread.ofPlatform().name("distributed-monitor").factory());
|
||||
|
||||
try {
|
||||
monitor.scheduleAtFixedRate(
|
||||
() -> {
|
||||
if (stopped) {
|
||||
LOG.info("Stop signal received, stopping distributed job");
|
||||
distributedExecutor.stop();
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
SearchIndexJob job = distributedExecutor.getJobWithFreshStats();
|
||||
if (job == null) {
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
IndexJobStatus status = job.getStatus();
|
||||
if (status == IndexJobStatus.COMPLETED
|
||||
|| status == IndexJobStatus.COMPLETED_WITH_ERRORS
|
||||
|| status == IndexJobStatus.FAILED
|
||||
|| status == IndexJobStatus.STOPPED) {
|
||||
LOG.info("Distributed job {} completed with status: {}", jobId, status);
|
||||
completionLatch.countDown();
|
||||
return;
|
||||
}
|
||||
|
||||
updateJobDataFromDistributedJob(job);
|
||||
},
|
||||
0,
|
||||
WEBSOCKET_UPDATE_INTERVAL_MS,
|
||||
TimeUnit.MILLISECONDS);
|
||||
|
||||
completionLatch.await();
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
LOG.warn("Distributed job monitoring interrupted");
|
||||
} finally {
|
||||
monitor.shutdownNow();
|
||||
}
|
||||
}
|
||||
|
||||
private void updateJobDataFromDistributedJob(SearchIndexJob distributedJob) {
|
||||
updateJobDataFromDistributedJob(distributedJob, null);
|
||||
}
|
||||
|
||||
private void updateJobDataFromDistributedJob(
|
||||
SearchIndexJob distributedJob, StepStats actualSinkStats) {
|
||||
Stats stats = jobData.getStats();
|
||||
if (stats == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Fetch aggregated server stats once for accurate reader/sink breakdown
|
||||
CollectionDAO.SearchIndexServerStatsDAO.AggregatedServerStats serverStatsAggr = null;
|
||||
try {
|
||||
serverStatsAggr =
|
||||
Entity.getCollectionDAO()
|
||||
.searchIndexServerStatsDAO()
|
||||
.getAggregatedStats(distributedJob.getId().toString());
|
||||
if (serverStatsAggr != null) {
|
||||
LOG.info(
|
||||
"Fetched aggregated server stats for job {}: readerSuccess={}, readerFailed={}, "
|
||||
+ "sinkSuccess={}, sinkFailed={}",
|
||||
distributedJob.getId(),
|
||||
serverStatsAggr.readerSuccess(),
|
||||
serverStatsAggr.readerFailed(),
|
||||
serverStatsAggr.sinkSuccess(),
|
||||
serverStatsAggr.sinkFailed());
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not fetch aggregated server stats for job {}", distributedJob.getId(), e);
|
||||
}
|
||||
|
||||
// Determine success/failed from best available source
|
||||
long successRecords;
|
||||
long failedRecords;
|
||||
String statsSource;
|
||||
|
||||
if (serverStatsAggr != null && serverStatsAggr.sinkSuccess() > 0) {
|
||||
// Use server stats table (most accurate)
|
||||
// processFailed = records that read successfully but failed during doc building
|
||||
successRecords = serverStatsAggr.sinkSuccess();
|
||||
failedRecords =
|
||||
serverStatsAggr.readerFailed()
|
||||
+ serverStatsAggr.sinkFailed()
|
||||
+ serverStatsAggr.processFailed();
|
||||
statsSource = "serverStatsTable";
|
||||
} else if (actualSinkStats != null) {
|
||||
// Use local sink stats (single server scenario)
|
||||
successRecords = actualSinkStats.getSuccessRecords();
|
||||
failedRecords = actualSinkStats.getFailedRecords();
|
||||
statsSource = "localSink";
|
||||
} else {
|
||||
// Fallback to partition-based stats
|
||||
successRecords = distributedJob.getSuccessRecords();
|
||||
failedRecords = distributedJob.getFailedRecords();
|
||||
statsSource = "partition-based";
|
||||
}
|
||||
|
||||
LOG.debug(
|
||||
"Stats source: {}, success={}, failed={}", statsSource, successRecords, failedRecords);
|
||||
|
||||
StepStats jobStats = stats.getJobStats();
|
||||
if (jobStats != null) {
|
||||
jobStats.setSuccessRecords((int) successRecords);
|
||||
jobStats.setFailedRecords((int) failedRecords);
|
||||
}
|
||||
|
||||
StepStats readerStats = stats.getReaderStats();
|
||||
if (readerStats != null) {
|
||||
readerStats.setTotalRecords((int) distributedJob.getTotalRecords());
|
||||
long readerFailed = serverStatsAggr != null ? serverStatsAggr.readerFailed() : 0;
|
||||
long readerWarnings = serverStatsAggr != null ? serverStatsAggr.readerWarnings() : 0;
|
||||
long readerSuccess =
|
||||
serverStatsAggr != null
|
||||
? serverStatsAggr.readerSuccess()
|
||||
: distributedJob.getTotalRecords() - readerFailed - readerWarnings;
|
||||
readerStats.setSuccessRecords((int) readerSuccess);
|
||||
readerStats.setFailedRecords((int) readerFailed);
|
||||
readerStats.setWarningRecords((int) readerWarnings);
|
||||
}
|
||||
|
||||
// Process stats - document building stage
|
||||
StepStats processStats = stats.getProcessStats();
|
||||
if (processStats != null && serverStatsAggr != null) {
|
||||
long processSuccess = serverStatsAggr.processSuccess();
|
||||
long processFailed = serverStatsAggr.processFailed();
|
||||
processStats.setTotalRecords((int) (processSuccess + processFailed));
|
||||
processStats.setSuccessRecords((int) processSuccess);
|
||||
processStats.setFailedRecords((int) processFailed);
|
||||
}
|
||||
|
||||
StepStats sinkStats = stats.getSinkStats();
|
||||
if (sinkStats != null) {
|
||||
if (serverStatsAggr != null) {
|
||||
// Use actual sink stats from the database
|
||||
long sinkSuccess = serverStatsAggr.sinkSuccess();
|
||||
long sinkFailed = serverStatsAggr.sinkFailed();
|
||||
|
||||
// sinkTotal = docs submitted to ES = sinkSuccess + sinkFailed
|
||||
long actualSinkTotal = sinkSuccess + sinkFailed;
|
||||
|
||||
sinkStats.setTotalRecords((int) actualSinkTotal);
|
||||
sinkStats.setSuccessRecords((int) sinkSuccess);
|
||||
sinkStats.setFailedRecords((int) sinkFailed);
|
||||
} else {
|
||||
// Fallback: derive from reader stats (less accurate)
|
||||
long readerFailed = 0;
|
||||
long sinkTotal = distributedJob.getTotalRecords() - readerFailed;
|
||||
sinkStats.setTotalRecords((int) sinkTotal);
|
||||
sinkStats.setSuccessRecords((int) successRecords);
|
||||
sinkStats.setFailedRecords((int) failedRecords);
|
||||
}
|
||||
}
|
||||
|
||||
// Vector stats - embedding generation stage
|
||||
StepStats vectorStats = stats.getVectorStats();
|
||||
if (vectorStats != null && serverStatsAggr != null) {
|
||||
long vectorSuccess = serverStatsAggr.vectorSuccess();
|
||||
long vectorFailed = serverStatsAggr.vectorFailed();
|
||||
vectorStats.setTotalRecords((int) (vectorSuccess + vectorFailed));
|
||||
vectorStats.setSuccessRecords((int) vectorSuccess);
|
||||
vectorStats.setFailedRecords((int) vectorFailed);
|
||||
}
|
||||
|
||||
if (distributedJob.getEntityStats() != null && stats.getEntityStats() != null) {
|
||||
for (Map.Entry<String, SearchIndexJob.EntityTypeStats> entry :
|
||||
distributedJob.getEntityStats().entrySet()) {
|
||||
StepStats entityStats =
|
||||
stats.getEntityStats().getAdditionalProperties().get(entry.getKey());
|
||||
if (entityStats != null) {
|
||||
entityStats.setSuccessRecords((int) entry.getValue().getSuccessRecords());
|
||||
entityStats.setFailedRecords((int) entry.getValue().getFailedRecords());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
StatsReconciler.reconcile(stats);
|
||||
|
||||
switch (distributedJob.getStatus()) {
|
||||
case COMPLETED -> jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
case COMPLETED_WITH_ERRORS -> jobData.setStatus(EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
case FAILED -> jobData.setStatus(EventPublisherJob.Status.FAILED);
|
||||
case STOPPING, STOPPED -> jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
default -> jobData.setStatus(EventPublisherJob.Status.RUNNING);
|
||||
}
|
||||
}
|
||||
|
||||
private void saveServerStatsToJobDataMap(
|
||||
JobExecutionContext jobExecutionContext, SearchIndexJob distributedJob) {
|
||||
try {
|
||||
AppRunRecord appRecord = getJobRecord(jobExecutionContext);
|
||||
SuccessContext successContext = appRecord.getSuccessContext();
|
||||
if (successContext == null) {
|
||||
successContext = new SuccessContext();
|
||||
}
|
||||
|
||||
if (distributedJob.getServerStats() != null && !distributedJob.getServerStats().isEmpty()) {
|
||||
LOG.info(
|
||||
"Saving serverStats to job data map: {} servers with data: {}",
|
||||
distributedJob.getServerStats().size(),
|
||||
distributedJob.getServerStats());
|
||||
successContext.withAdditionalProperty("serverStats", distributedJob.getServerStats());
|
||||
successContext.withAdditionalProperty(
|
||||
"serverCount", distributedJob.getServerStats().size());
|
||||
successContext.withAdditionalProperty(
|
||||
"distributedJobId", distributedJob.getId().toString());
|
||||
} else {
|
||||
LOG.warn(
|
||||
"No server stats available for distributed job {} - serverStats is {} ",
|
||||
distributedJob.getId(),
|
||||
distributedJob.getServerStats() == null ? "null" : "empty");
|
||||
}
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
successContext.withAdditionalProperty("stats", jobData.getStats());
|
||||
}
|
||||
|
||||
appRecord.setSuccessContext(successContext);
|
||||
jobExecutionContext
|
||||
.getJobDetail()
|
||||
.getJobDataMap()
|
||||
.put("AppScheduleRun", JsonUtils.pojoToJson(appRecord));
|
||||
|
||||
} catch (Exception e) {
|
||||
LOG.error("Failed to save serverStats to job data map", e);
|
||||
}
|
||||
}
|
||||
|
||||
// ========== Helper Methods ==========
|
||||
|
||||
private void setupEntities() {
|
||||
boolean containsAll = jobData.getEntities().contains(ALL);
|
||||
if (containsAll) {
|
||||
jobData.setEntities(getAll());
|
||||
}
|
||||
}
|
||||
|
||||
private Set<String> getAll() {
|
||||
Set<String> entities =
|
||||
new HashSet<>(
|
||||
Entity.getEntityList().stream()
|
||||
.filter(t -> searchRepository.getEntityIndexMap().containsKey(t))
|
||||
.toList());
|
||||
entities.addAll(
|
||||
TIME_SERIES_ENTITIES.stream()
|
||||
.filter(t -> searchRepository.getEntityIndexMap().containsKey(t))
|
||||
.toList());
|
||||
return entities;
|
||||
}
|
||||
|
||||
public Stats initializeTotalRecords(Set<String> entities) {
|
||||
Stats stats = new Stats();
|
||||
stats.setEntityStats(new org.openmetadata.schema.system.EntityStats());
|
||||
stats.setJobStats(new StepStats());
|
||||
stats.setReaderStats(new StepStats());
|
||||
stats.setProcessStats(new StepStats());
|
||||
stats.setSinkStats(new StepStats());
|
||||
stats.setVectorStats(new StepStats());
|
||||
|
||||
int total = 0;
|
||||
for (String entityType : entities) {
|
||||
int entityTotal = getEntityTotal(entityType);
|
||||
total += entityTotal;
|
||||
|
||||
StepStats entityStats = new StepStats();
|
||||
entityStats.setTotalRecords(entityTotal);
|
||||
entityStats.setSuccessRecords(0);
|
||||
entityStats.setFailedRecords(0);
|
||||
stats.getEntityStats().getAdditionalProperties().put(entityType, entityStats);
|
||||
}
|
||||
|
||||
stats.getJobStats().setTotalRecords(total);
|
||||
stats.getJobStats().setSuccessRecords(0);
|
||||
stats.getJobStats().setFailedRecords(0);
|
||||
|
||||
stats.getReaderStats().setTotalRecords(total);
|
||||
stats.getReaderStats().setSuccessRecords(0);
|
||||
stats.getReaderStats().setFailedRecords(0);
|
||||
|
||||
stats.getProcessStats().setTotalRecords(0);
|
||||
stats.getProcessStats().setSuccessRecords(0);
|
||||
stats.getProcessStats().setFailedRecords(0);
|
||||
|
||||
stats.getSinkStats().setTotalRecords(0);
|
||||
stats.getSinkStats().setSuccessRecords(0);
|
||||
stats.getSinkStats().setFailedRecords(0);
|
||||
|
||||
stats.getVectorStats().setTotalRecords(0);
|
||||
stats.getVectorStats().setSuccessRecords(0);
|
||||
stats.getVectorStats().setFailedRecords(0);
|
||||
|
||||
return stats;
|
||||
}
|
||||
|
||||
private int getEntityTotal(String entityType) {
|
||||
try {
|
||||
String correctedType = "queryCostResult".equals(entityType) ? QUERY_COST_RECORD : entityType;
|
||||
|
||||
if (!TIME_SERIES_ENTITIES.contains(correctedType)) {
|
||||
return Entity.getEntityRepository(correctedType).getDao().listTotalCount();
|
||||
} else {
|
||||
ListFilter listFilter = new ListFilter(null);
|
||||
EntityTimeSeriesRepository<?> repository;
|
||||
|
||||
if (isDataInsightIndex(correctedType)) {
|
||||
listFilter.addQueryParam("entityFQNHash", FullyQualifiedName.buildHash(correctedType));
|
||||
repository = Entity.getEntityTimeSeriesRepository(Entity.ENTITY_REPORT_DATA);
|
||||
} else {
|
||||
repository = Entity.getEntityTimeSeriesRepository(correctedType);
|
||||
}
|
||||
|
||||
return repository.getTimeSeriesDao().listCount(listFilter);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Error getting total for '{}'", entityType, e);
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
private boolean isDataInsightIndex(String entityType) {
|
||||
return entityType.endsWith("ReportData");
|
||||
}
|
||||
|
||||
private void updateJobStatus(EventPublisherJob.Status newStatus) {
|
||||
if (stopped
|
||||
&& newStatus != EventPublisherJob.Status.STOP_IN_PROGRESS
|
||||
&& newStatus != EventPublisherJob.Status.STOPPED) {
|
||||
return;
|
||||
}
|
||||
jobData.setStatus(newStatus);
|
||||
}
|
||||
|
||||
private void updateFinalJobStatus() {
|
||||
if (stopped) {
|
||||
updateJobStatus(EventPublisherJob.Status.STOPPED);
|
||||
} else if (hasIncompleteProcessing()) {
|
||||
updateJobStatus(EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
} else {
|
||||
updateJobStatus(EventPublisherJob.Status.COMPLETED);
|
||||
}
|
||||
}
|
||||
|
||||
private boolean hasIncompleteProcessing() {
|
||||
if (jobData == null || jobData.getStats() == null || jobData.getStats().getJobStats() == null) {
|
||||
return false;
|
||||
}
|
||||
|
||||
StepStats jobStats = jobData.getStats().getJobStats();
|
||||
long failed = jobStats.getFailedRecords() != null ? jobStats.getFailedRecords() : 0;
|
||||
long processed = jobStats.getSuccessRecords() != null ? jobStats.getSuccessRecords() : 0;
|
||||
long total = jobStats.getTotalRecords() != null ? jobStats.getTotalRecords() : 0;
|
||||
|
||||
return failed > 0 || (total > 0 && processed < total);
|
||||
}
|
||||
|
||||
private void finalizeAllEntityReindex(boolean finalSuccess) {
|
||||
if (recreateIndexHandler == null || recreateContext == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Get already-promoted entities from distributed executor (if running in distributed mode)
|
||||
Set<String> promotedEntities = Collections.emptySet();
|
||||
if (distributedExecutor != null && distributedExecutor.getEntityTracker() != null) {
|
||||
promotedEntities = distributedExecutor.getEntityTracker().getPromotedEntities();
|
||||
}
|
||||
|
||||
// Calculate entities that still need finalization
|
||||
Set<String> entitiesToFinalize = new HashSet<>(recreateContext.getEntities());
|
||||
entitiesToFinalize.removeAll(promotedEntities);
|
||||
|
||||
// Vector index is a pseudo-entity with no partitions or batch tracking — handle separately
|
||||
boolean hasVectorIndex = entitiesToFinalize.remove(VectorIndexService.VECTOR_INDEX_KEY);
|
||||
|
||||
try {
|
||||
if (!entitiesToFinalize.isEmpty()) {
|
||||
LOG.info(
|
||||
"Finalizing {} remaining entities (already promoted: {})",
|
||||
entitiesToFinalize.size(),
|
||||
promotedEntities.size());
|
||||
|
||||
for (String entityType : entitiesToFinalize) {
|
||||
try {
|
||||
finalizeEntityReindex(entityType, finalSuccess);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize reindex for entity: {}", entityType, ex);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (hasVectorIndex) {
|
||||
finalizeVectorIndex(finalSuccess);
|
||||
}
|
||||
} finally {
|
||||
recreateContext = null;
|
||||
}
|
||||
}
|
||||
|
||||
private void finalizeVectorIndex(boolean finalSuccess) {
|
||||
// Vector index data is written as a side-effect of processing real entities.
|
||||
// Promote when the job ran to completion (even with some errors) since partial
|
||||
// vector data is better than an orphaned rebuild index. Only discard on
|
||||
// FAILED (job crashed) or STOPPED (user cancelled).
|
||||
boolean vectorSuccess =
|
||||
finalSuccess
|
||||
|| (jobData != null && jobData.getStatus() == EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
|
||||
try {
|
||||
finalizeEntityReindex(VectorIndexService.VECTOR_INDEX_KEY, vectorSuccess);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize vector index", ex);
|
||||
}
|
||||
}
|
||||
|
||||
private void finalizeEntityReindex(String entityType, boolean success) {
|
||||
if (recreateIndexHandler == null || recreateContext == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
var entityReindexContext =
|
||||
org.openmetadata.service.search.EntityReindexContext.builder()
|
||||
.entityType(entityType)
|
||||
.originalIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.canonicalIndex(recreateContext.getCanonicalIndex(entityType).orElse(null))
|
||||
.activeIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.stagedIndex(recreateContext.getStagedIndex(entityType).orElse(null))
|
||||
.canonicalAliases(recreateContext.getCanonicalAlias(entityType).orElse(null))
|
||||
.existingAliases(recreateContext.getExistingAliases(entityType))
|
||||
.parentAliases(
|
||||
new HashSet<>(listOrEmpty(recreateContext.getParentAliases(entityType))))
|
||||
.build();
|
||||
|
||||
recreateIndexHandler.finalizeReindex(entityReindexContext, success);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to finalize index recreation flow", ex);
|
||||
}
|
||||
}
|
||||
|
||||
private void handleExecutionException(Exception ex) {
|
||||
BulkSink sink = searchIndexSink;
|
||||
if (sink != null) {
|
||||
searchIndexSink = null;
|
||||
try {
|
||||
sink.close();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error closing search index sink", e);
|
||||
}
|
||||
}
|
||||
|
||||
if (executor != null && jobData != null) {
|
||||
try {
|
||||
Stats executorStats = executor.getStats().get();
|
||||
if (executorStats != null) {
|
||||
jobData.setStats(executorStats);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not capture executor stats during exception handling", e);
|
||||
}
|
||||
}
|
||||
|
||||
if (stopped) {
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
} else {
|
||||
IndexingError error =
|
||||
new IndexingError()
|
||||
.withErrorSource(IndexingError.ErrorSource.JOB)
|
||||
.withMessage("Reindexing Job Exception: " + ex.getMessage());
|
||||
LOG.error("Reindexing Job Failed", ex);
|
||||
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.FAILED);
|
||||
jobData.setFailure(error);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private void finalizeJobExecution(JobExecutionContext jobExecutionContext) {
|
||||
sendUpdates(jobExecutionContext, true);
|
||||
|
||||
if (stopped && jobExecutionContext != null) {
|
||||
AppRunRecord appRecord = getJobRecord(jobExecutionContext);
|
||||
appRecord.setStatus(AppRunRecord.Status.STOPPED);
|
||||
jobExecutionContext
|
||||
.getJobDetail()
|
||||
.getJobDataMap()
|
||||
.put(APP_SCHEDULE_RUN, JsonUtils.pojoToJson(appRecord));
|
||||
}
|
||||
}
|
||||
|
||||
private void sendUpdates(JobExecutionContext jobExecutionContext, boolean force) {
|
||||
try {
|
||||
updateRecordToDbAndNotify(jobExecutionContext);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to send updates", ex);
|
||||
}
|
||||
}
|
||||
|
||||
public void updateRecordToDbAndNotify(JobExecutionContext jobExecutionContext) {
|
||||
AppRunRecord appRecord = getJobRecord(jobExecutionContext);
|
||||
appRecord.setStatus(AppRunRecord.Status.fromValue(jobData.getStatus().value()));
|
||||
|
||||
if (jobData.getFailure() != null) {
|
||||
appRecord.setFailureContext(
|
||||
new FailureContext().withAdditionalProperty("failure", jobData.getFailure()));
|
||||
}
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
SuccessContext successContext =
|
||||
new SuccessContext().withAdditionalProperty("stats", jobData.getStats());
|
||||
|
||||
SearchIndexJob distributedJob =
|
||||
distributedExecutor != null ? distributedExecutor.getJobWithFreshStats() : null;
|
||||
|
||||
try {
|
||||
String jobIdStr =
|
||||
distributedJob != null
|
||||
? distributedJob.getId().toString()
|
||||
: getApp().getId().toString();
|
||||
int failureCount = collectionDAO.searchIndexFailureDAO().countByJobId(jobIdStr);
|
||||
if (failureCount > 0) {
|
||||
successContext.withAdditionalProperty("failureRecordCount", failureCount);
|
||||
}
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not get failure count", e);
|
||||
}
|
||||
|
||||
if (distributedJob != null && distributedJob.getServerStats() != null) {
|
||||
successContext.withAdditionalProperty("serverStats", distributedJob.getServerStats());
|
||||
successContext.withAdditionalProperty(
|
||||
"serverCount", distributedJob.getServerStats().size());
|
||||
successContext.withAdditionalProperty(
|
||||
"distributedJobId", distributedJob.getId().toString());
|
||||
}
|
||||
|
||||
appRecord.setSuccessContext(successContext);
|
||||
}
|
||||
|
||||
if (WebSocketManager.getInstance() != null) {
|
||||
String messageJson = JsonUtils.pojoToJson(appRecord);
|
||||
WebSocketManager.getInstance()
|
||||
.broadCastMessageToAll(SEARCH_INDEX_JOB_BROADCAST_CHANNEL, messageJson);
|
||||
}
|
||||
public void execute(JobExecutionContext ctx) {
|
||||
OrchestratorContext orchCtx =
|
||||
new QuartzOrchestratorContext(
|
||||
ctx, getApp(), this::getJobRecord, this::pushAppStatusUpdates);
|
||||
ReindexingOrchestrator orch =
|
||||
new ReindexingOrchestrator(collectionDAO, searchRepository, orchCtx);
|
||||
this.orchestrator = orch;
|
||||
orch.run(jobData);
|
||||
this.jobData = orch.getJobData();
|
||||
}
|
||||
|
||||
@Override
|
||||
public void stop() {
|
||||
LOG.info("Reindexing job is being stopped.");
|
||||
stopped = true;
|
||||
|
||||
if (executor != null) {
|
||||
executor.stop();
|
||||
ReindexingOrchestrator orch = this.orchestrator;
|
||||
if (orch != null) {
|
||||
orch.stop();
|
||||
this.jobData = orch.getJobData();
|
||||
}
|
||||
|
||||
if (distributedExecutor != null) {
|
||||
try {
|
||||
distributedExecutor.stop();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error stopping distributed executor", e);
|
||||
}
|
||||
}
|
||||
|
||||
if (jobData != null) {
|
||||
jobData.setStatus(EventPublisherJob.Status.STOPPED);
|
||||
}
|
||||
|
||||
if (jobExecutionContext != null) {
|
||||
AppRunRecord appRecord = getJobRecord(jobExecutionContext);
|
||||
appRecord.setStatus(AppRunRecord.Status.STOPPED);
|
||||
appRecord.setEndTime(System.currentTimeMillis());
|
||||
jobExecutionContext
|
||||
.getJobDetail()
|
||||
.getJobDataMap()
|
||||
.put(APP_SCHEDULE_RUN, JsonUtils.pojoToJson(appRecord));
|
||||
pushAppStatusUpdates(jobExecutionContext, appRecord, true);
|
||||
sendUpdates(jobExecutionContext, true);
|
||||
}
|
||||
|
||||
BulkSink sink = searchIndexSink;
|
||||
if (sink != null) {
|
||||
searchIndexSink = null;
|
||||
try {
|
||||
sink.close();
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error closing search index sink", e);
|
||||
}
|
||||
}
|
||||
|
||||
LOG.info("Reindexing job stopped successfully.");
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load diff
|
|
@ -0,0 +1,41 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import java.util.Optional;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
|
||||
public class SingleServerIndexingStrategy implements IndexingStrategy {
|
||||
|
||||
private final SearchIndexExecutor executor;
|
||||
|
||||
public SingleServerIndexingStrategy(
|
||||
CollectionDAO collectionDAO, SearchRepository searchRepository) {
|
||||
this.executor = new SearchIndexExecutor(collectionDAO, searchRepository);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void addListener(ReindexingProgressListener listener) {
|
||||
executor.addListener(listener);
|
||||
}
|
||||
|
||||
@Override
|
||||
public ExecutionResult execute(ReindexingConfiguration config, ReindexingJobContext context) {
|
||||
return executor.execute(config, context);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Optional<Stats> getStats() {
|
||||
return Optional.ofNullable(executor.getStats().get());
|
||||
}
|
||||
|
||||
@Override
|
||||
public void stop() {
|
||||
executor.stop();
|
||||
}
|
||||
|
||||
@Override
|
||||
public boolean isStopped() {
|
||||
return executor.isStopped();
|
||||
}
|
||||
}
|
||||
|
|
@ -30,11 +30,26 @@ public class StatsReconciler {
|
|||
int sinkFailed = safeGet(sinkStats.getFailedRecords());
|
||||
int sinkWarnings = safeGet(sinkStats.getWarningRecords());
|
||||
|
||||
// Reconcile entity-level totals
|
||||
if (stats.getEntityStats() != null
|
||||
&& stats.getEntityStats().getAdditionalProperties() != null) {
|
||||
int reconciledTotal = 0;
|
||||
for (StepStats es : stats.getEntityStats().getAdditionalProperties().values()) {
|
||||
int actual = safeGet(es.getSuccessRecords()) + safeGet(es.getFailedRecords());
|
||||
if (actual > safeGet(es.getTotalRecords())) {
|
||||
es.setTotalRecords(actual);
|
||||
}
|
||||
reconciledTotal += safeGet(es.getTotalRecords());
|
||||
}
|
||||
if (reconciledTotal > readerTotal) {
|
||||
readerStats.setTotalRecords(reconciledTotal);
|
||||
readerTotal = reconciledTotal;
|
||||
}
|
||||
}
|
||||
|
||||
int jobSuccess = sinkSuccess;
|
||||
int jobFailed = readerFailed + sinkFailed;
|
||||
int jobTotal = readerTotal;
|
||||
// Warnings are informational - use reader warnings as the primary source
|
||||
// (entities with stale references that were still indexed)
|
||||
int jobWarnings = readerWarnings;
|
||||
|
||||
jobStats.setTotalRecords(jobTotal);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,12 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
public record VectorCompletionResult(boolean completed, int pendingTaskCount, long waitedMillis) {
|
||||
|
||||
public static VectorCompletionResult success(long waitedMillis) {
|
||||
return new VectorCompletionResult(true, 0, waitedMillis);
|
||||
}
|
||||
|
||||
public static VectorCompletionResult timeout(int pendingCount, long waitedMillis) {
|
||||
return new VectorCompletionResult(false, pendingCount, waitedMillis);
|
||||
}
|
||||
}
|
||||
|
|
@ -269,9 +269,13 @@ public class DistributedJobParticipant implements Managed {
|
|||
// Set up failure callback on bulk sink to record sink failures
|
||||
final IndexingFailureRecorder recorder = failureRecorder;
|
||||
bulkSink.setFailureCallback(
|
||||
(entityType, entityId, entityFqn, errorMessage) -> {
|
||||
(entityType, entityId, entityFqn, errorMessage, stage) -> {
|
||||
if (recorder != null) {
|
||||
recorder.recordSinkFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
if (stage == IndexingFailureRecorder.FailureStage.PROCESS) {
|
||||
recorder.recordProcessFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
} else {
|
||||
recorder.recordSinkFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
|
|
|
|||
|
|
@ -20,10 +20,12 @@ import java.util.Optional;
|
|||
import java.util.Set;
|
||||
import java.util.UUID;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicLong;
|
||||
import java.util.stream.Collectors;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.utils.JsonUtils;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingConfiguration;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO.SearchIndexJobDAO;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO.SearchIndexJobDAO.SearchIndexJobRecord;
|
||||
|
|
@ -85,6 +87,9 @@ public class DistributedSearchIndexCoordinator {
|
|||
private final String serverId;
|
||||
private EntityCompletionTracker entityTracker;
|
||||
|
||||
/** Monotonic counter to guarantee unique claimedAt values across concurrent worker threads. */
|
||||
private final AtomicLong claimCounter = new AtomicLong(0);
|
||||
|
||||
public DistributedSearchIndexCoordinator(CollectionDAO collectionDAO) {
|
||||
this.collectionDAO = collectionDAO;
|
||||
this.partitionCalculator = new PartitionCalculator();
|
||||
|
|
@ -121,12 +126,20 @@ public class DistributedSearchIndexCoordinator {
|
|||
*/
|
||||
public SearchIndexJob createJob(
|
||||
Set<String> entities, EventPublisherJob jobConfiguration, String createdBy) {
|
||||
return createJob(entities, jobConfiguration, createdBy, null);
|
||||
}
|
||||
|
||||
public SearchIndexJob createJob(
|
||||
Set<String> entities,
|
||||
EventPublisherJob jobConfiguration,
|
||||
String createdBy,
|
||||
ReindexingConfiguration reindexConfig) {
|
||||
|
||||
UUID jobId = UUID.randomUUID();
|
||||
long now = System.currentTimeMillis();
|
||||
|
||||
// Calculate entity statistics
|
||||
Map<String, Long> entityCounts = partitionCalculator.getEntityCounts(entities);
|
||||
// Calculate entity statistics (with time-series date filtering if config is provided)
|
||||
Map<String, Long> entityCounts = partitionCalculator.getEntityCounts(entities, reindexConfig);
|
||||
long totalRecords = entityCounts.values().stream().mapToLong(Long::longValue).sum();
|
||||
|
||||
// Build entity stats map
|
||||
|
|
@ -183,6 +196,10 @@ public class DistributedSearchIndexCoordinator {
|
|||
* @return Updated job with partition information
|
||||
*/
|
||||
public SearchIndexJob initializePartitions(UUID jobId) {
|
||||
return initializePartitions(jobId, null);
|
||||
}
|
||||
|
||||
public SearchIndexJob initializePartitions(UUID jobId, ReindexingConfiguration reindexConfig) {
|
||||
SearchIndexJobDAO jobDAO = collectionDAO.searchIndexJobDAO();
|
||||
SearchIndexPartitionDAO partitionDAO = collectionDAO.searchIndexPartitionDAO();
|
||||
|
||||
|
|
@ -196,9 +213,9 @@ public class DistributedSearchIndexCoordinator {
|
|||
// Get entity types from job configuration
|
||||
Set<String> entityTypes = Set.copyOf(job.getJobConfiguration().getEntities());
|
||||
|
||||
// Calculate partitions
|
||||
// Calculate partitions (with date filtering for time series if config provided)
|
||||
List<SearchIndexPartition> partitions =
|
||||
partitionCalculator.calculatePartitions(jobId, entityTypes);
|
||||
partitionCalculator.calculatePartitions(jobId, entityTypes, reindexConfig);
|
||||
|
||||
if (partitions.isEmpty()) {
|
||||
LOG.warn(
|
||||
|
|
@ -270,10 +287,22 @@ public class DistributedSearchIndexCoordinator {
|
|||
}
|
||||
}
|
||||
|
||||
// Reconcile totalRecords from actual partitions (accounts for time-series filtering)
|
||||
long actualTotalRecords =
|
||||
partitions.stream().mapToLong(SearchIndexPartition::getEstimatedCount).sum();
|
||||
if (actualTotalRecords != job.getTotalRecords()) {
|
||||
LOG.info(
|
||||
"Reconciled totalRecords for job {}: {} → {} (after partition calculation)",
|
||||
jobId,
|
||||
job.getTotalRecords(),
|
||||
actualTotalRecords);
|
||||
}
|
||||
|
||||
// Update job status
|
||||
SearchIndexJob updatedJob =
|
||||
job.toBuilder()
|
||||
.status(IndexJobStatus.READY)
|
||||
.totalRecords(actualTotalRecords)
|
||||
.entityStats(updatedStats)
|
||||
.updatedAt(System.currentTimeMillis())
|
||||
.build();
|
||||
|
|
@ -313,7 +342,9 @@ public class DistributedSearchIndexCoordinator {
|
|||
return Optional.empty();
|
||||
}
|
||||
|
||||
long claimTime = System.currentTimeMillis();
|
||||
// Ensure unique claimTime per call so concurrent claims on the same server are distinguishable.
|
||||
// The counter suffix keeps values within normal epoch-millis range while preventing collisions.
|
||||
long claimTime = uniqueClaimTime();
|
||||
|
||||
// Atomically claim a partition - FOR UPDATE SKIP LOCKED ensures no race condition
|
||||
int claimed = partitionDAO.claimNextPartitionAtomic(jobId.toString(), serverId, claimTime);
|
||||
|
|
@ -322,9 +353,9 @@ public class DistributedSearchIndexCoordinator {
|
|||
return Optional.empty();
|
||||
}
|
||||
|
||||
// Fetch the partition we just claimed
|
||||
// Fetch the partition we just claimed using the unique claimTime
|
||||
SearchIndexPartitionRecord record =
|
||||
partitionDAO.findLatestClaimedPartition(jobId.toString(), serverId);
|
||||
partitionDAO.findLatestClaimedPartition(jobId.toString(), serverId, claimTime);
|
||||
if (record == null) {
|
||||
LOG.warn("Claimed partition but couldn't find it - this shouldn't happen");
|
||||
return Optional.empty();
|
||||
|
|
@ -343,6 +374,18 @@ public class DistributedSearchIndexCoordinator {
|
|||
return Optional.of(partition);
|
||||
}
|
||||
|
||||
/**
|
||||
* Generates a unique claimedAt timestamp that stays close to real wall-clock time but never
|
||||
* repeats, even when called concurrently from multiple worker threads. The counter suffix is
|
||||
* added in the sub-millisecond range so stale-detection logic (which compares against
|
||||
* System.currentTimeMillis()) continues to work correctly.
|
||||
*/
|
||||
private long uniqueClaimTime() {
|
||||
long millis = System.currentTimeMillis();
|
||||
long seq = claimCounter.incrementAndGet() % 1000;
|
||||
return millis + seq;
|
||||
}
|
||||
|
||||
/**
|
||||
* Update partition progress.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -139,7 +139,7 @@ public class DistributedSearchIndexExecutor {
|
|||
|
||||
public DistributedSearchIndexExecutor(CollectionDAO collectionDAO, int partitionSize) {
|
||||
this.collectionDAO = collectionDAO;
|
||||
PartitionCalculator calculator = new PartitionCalculator(partitionSize);
|
||||
PartitionCalculator calculator = new PartitionCalculator(partitionSize, MAX_WORKER_THREADS);
|
||||
this.coordinator = new DistributedSearchIndexCoordinator(collectionDAO, calculator);
|
||||
this.recoveryManager = new JobRecoveryManager(collectionDAO, partitionSize);
|
||||
this.serverId = ServerIdentityResolver.getInstance().getServerId();
|
||||
|
|
@ -207,7 +207,10 @@ public class DistributedSearchIndexExecutor {
|
|||
* @return The created job
|
||||
*/
|
||||
public SearchIndexJob createJob(
|
||||
Set<String> entities, EventPublisherJob jobConfiguration, String createdBy) {
|
||||
Set<String> entities,
|
||||
EventPublisherJob jobConfiguration,
|
||||
String createdBy,
|
||||
ReindexingConfiguration reindexConfig) {
|
||||
|
||||
LOG.info("Creating distributed indexing job for {} entity types", entities.size());
|
||||
|
||||
|
|
@ -240,11 +243,12 @@ public class DistributedSearchIndexExecutor {
|
|||
}
|
||||
|
||||
try {
|
||||
// Create the job
|
||||
SearchIndexJob job = coordinator.createJob(entities, jobConfiguration, createdBy);
|
||||
// Create the job (pass reindexConfig so time-series date filtering is applied to totals)
|
||||
SearchIndexJob job =
|
||||
coordinator.createJob(entities, jobConfiguration, createdBy, reindexConfig);
|
||||
|
||||
// Initialize partitions
|
||||
currentJob = coordinator.initializePartitions(job.getId());
|
||||
// Initialize partitions (with date filtering for time series entities)
|
||||
currentJob = coordinator.initializePartitions(job.getId(), reindexConfig);
|
||||
|
||||
// Atomically transfer lock to real job ID
|
||||
boolean transferred = coordinator.transferReindexLock(tempJobId, currentJob.getId());
|
||||
|
|
@ -306,7 +310,10 @@ public class DistributedSearchIndexExecutor {
|
|||
* @return Execution result with statistics
|
||||
*/
|
||||
public ExecutionResult execute(
|
||||
BulkSink bulkSink, ReindexContext recreateContext, boolean recreateIndex) {
|
||||
BulkSink bulkSink,
|
||||
ReindexContext recreateContext,
|
||||
boolean recreateIndex,
|
||||
ReindexingConfiguration reindexConfig) {
|
||||
|
||||
if (currentJob == null) {
|
||||
throw new IllegalStateException("No job to execute - call createJob() or joinJob() first");
|
||||
|
|
@ -342,11 +349,9 @@ public class DistributedSearchIndexExecutor {
|
|||
// Notify listeners that job has started
|
||||
listeners.onJobStarted(jobContext);
|
||||
|
||||
// Notify listeners with configuration
|
||||
if (currentJob.getJobConfiguration() != null) {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.from(currentJob.getJobConfiguration());
|
||||
listeners.onJobConfigured(jobContext, config);
|
||||
// Notify listeners with auto-tuned configuration
|
||||
if (reindexConfig != null) {
|
||||
listeners.onJobConfigured(jobContext, reindexConfig);
|
||||
}
|
||||
|
||||
// Create stats aggregator with app context for proper WebSocket matching
|
||||
|
|
@ -373,9 +378,13 @@ public class DistributedSearchIndexExecutor {
|
|||
|
||||
// Set up failure callback on the sink to record sink failures
|
||||
bulkSink.setFailureCallback(
|
||||
(entityType, entityId, entityFqn, errorMessage) -> {
|
||||
(entityType, entityId, entityFqn, errorMessage, stage) -> {
|
||||
if (failureRecorder != null) {
|
||||
failureRecorder.recordSinkFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
if (stage == IndexingFailureRecorder.FailureStage.PROCESS) {
|
||||
failureRecorder.recordProcessFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
} else {
|
||||
failureRecorder.recordSinkFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
|
|
@ -402,13 +411,13 @@ public class DistributedSearchIndexExecutor {
|
|||
.name("partition-heartbeat-" + jobId.toString().substring(0, 8))
|
||||
.start(() -> runPartitionHeartbeatLoop());
|
||||
|
||||
// Calculate worker threads based on configuration
|
||||
int numWorkers =
|
||||
Math.min(
|
||||
currentJob.getJobConfiguration().getConsumerThreads() != null
|
||||
? currentJob.getJobConfiguration().getConsumerThreads()
|
||||
: 4,
|
||||
MAX_WORKER_THREADS);
|
||||
// Calculate worker threads from auto-tuned configuration
|
||||
int numWorkers = Math.min(Math.max(1, reindexConfig.consumerThreads()), MAX_WORKER_THREADS);
|
||||
LOG.info(
|
||||
"Distributed executor using {} workers, batch size {} (autoTune={})",
|
||||
numWorkers,
|
||||
reindexConfig.batchSize(),
|
||||
reindexConfig.autoTune());
|
||||
|
||||
workerExecutor =
|
||||
Executors.newFixedThreadPool(
|
||||
|
|
@ -419,10 +428,7 @@ public class DistributedSearchIndexExecutor {
|
|||
CountDownLatch workerLatch = new CountDownLatch(numWorkers);
|
||||
|
||||
// Start worker threads that continuously claim and process partitions
|
||||
int batchSize =
|
||||
currentJob.getJobConfiguration().getBatchSize() != null
|
||||
? currentJob.getJobConfiguration().getBatchSize()
|
||||
: 500;
|
||||
int batchSize = reindexConfig.batchSize();
|
||||
|
||||
for (int i = 0; i < numWorkers; i++) {
|
||||
final int workerId = i;
|
||||
|
|
@ -436,7 +442,8 @@ public class DistributedSearchIndexExecutor {
|
|||
recreateContext,
|
||||
recreateIndex,
|
||||
totalSuccess,
|
||||
totalFailed);
|
||||
totalFailed,
|
||||
reindexConfig);
|
||||
} finally {
|
||||
workerLatch.countDown();
|
||||
}
|
||||
|
|
@ -458,6 +465,18 @@ public class DistributedSearchIndexExecutor {
|
|||
// entity types have 0 records), so no partition completion ever triggers the check.
|
||||
coordinator.checkAndUpdateJobCompletion(jobId);
|
||||
|
||||
// Final reconciliation pass: catch ALL participant-server completions before
|
||||
// the stale-reclaimer is killed. Participant workers may have finished partitions
|
||||
// that were never reconciled by the stale-reclaimer's periodic loop.
|
||||
if (entityTracker != null && recreateContext != null) {
|
||||
LOG.info("Running final DB reconciliation for job {}", jobId);
|
||||
List<SearchIndexPartition> allPartitions = coordinator.getPartitions(jobId, null);
|
||||
entityTracker.reconcileFromDatabase(allPartitions);
|
||||
LOG.info(
|
||||
"Final reconciliation complete - promoted entities: {}",
|
||||
entityTracker.getPromotedEntities());
|
||||
}
|
||||
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
LOG.warn("Execution interrupted for job {}", jobId);
|
||||
|
|
@ -538,13 +557,20 @@ public class DistributedSearchIndexExecutor {
|
|||
ReindexContext recreateContext,
|
||||
boolean recreateIndex,
|
||||
AtomicLong totalSuccess,
|
||||
AtomicLong totalFailed) {
|
||||
AtomicLong totalFailed,
|
||||
ReindexingConfiguration reindexConfig) {
|
||||
|
||||
LOG.info("Worker {} starting for job {}", workerId, currentJob.getId());
|
||||
|
||||
PartitionWorker worker =
|
||||
new PartitionWorker(
|
||||
coordinator, bulkSink, batchSize, recreateContext, recreateIndex, failureRecorder);
|
||||
coordinator,
|
||||
bulkSink,
|
||||
batchSize,
|
||||
recreateContext,
|
||||
recreateIndex,
|
||||
failureRecorder,
|
||||
reindexConfig);
|
||||
|
||||
synchronized (activeWorkers) {
|
||||
activeWorkers.add(worker);
|
||||
|
|
@ -991,12 +1017,22 @@ public class DistributedSearchIndexExecutor {
|
|||
}
|
||||
|
||||
try {
|
||||
String canonicalIndex = recreateContext.getCanonicalIndex(entityType).orElse(null);
|
||||
String originalIndex = recreateContext.getOriginalIndex(entityType).orElse(null);
|
||||
|
||||
LOG.debug(
|
||||
"Promoting entity '{}': success={}, canonicalIndex={}, stagedIndex={}",
|
||||
entityType,
|
||||
success,
|
||||
canonicalIndex,
|
||||
stagedIndexOpt.get());
|
||||
|
||||
EntityReindexContext entityContext =
|
||||
EntityReindexContext.builder()
|
||||
.entityType(entityType)
|
||||
.originalIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.canonicalIndex(recreateContext.getCanonicalIndex(entityType).orElse(null))
|
||||
.activeIndex(recreateContext.getOriginalIndex(entityType).orElse(null))
|
||||
.originalIndex(originalIndex)
|
||||
.canonicalIndex(canonicalIndex)
|
||||
.activeIndex(originalIndex)
|
||||
.stagedIndex(stagedIndexOpt.get())
|
||||
.canonicalAliases(recreateContext.getCanonicalAlias(entityType).orElse(null))
|
||||
.existingAliases(recreateContext.getExistingAliases(entityType))
|
||||
|
|
|
|||
|
|
@ -155,18 +155,33 @@ public class EntityCompletionTracker {
|
|||
continue;
|
||||
}
|
||||
|
||||
boolean allDone =
|
||||
long completedCount =
|
||||
entityPartitions.stream()
|
||||
.allMatch(
|
||||
.filter(
|
||||
p ->
|
||||
p.getStatus() == PartitionStatus.COMPLETED
|
||||
|| p.getStatus() == PartitionStatus.FAILED);
|
||||
|| p.getStatus() == PartitionStatus.FAILED)
|
||||
.count();
|
||||
boolean allDone = completedCount == entityPartitions.size();
|
||||
|
||||
if (!allDone) {
|
||||
Map<PartitionStatus, Long> statusCounts =
|
||||
entityPartitions.stream()
|
||||
.collect(
|
||||
Collectors.groupingBy(SearchIndexPartition::getStatus, Collectors.counting()));
|
||||
LOG.debug(
|
||||
"Reconcile: entity '{}' not all done: {}/{} complete, statuses={}",
|
||||
entityType,
|
||||
completedCount,
|
||||
entityPartitions.size(),
|
||||
statusCounts);
|
||||
}
|
||||
|
||||
if (allDone && !entityPartitions.isEmpty()) {
|
||||
boolean hasFailed =
|
||||
entityPartitions.stream().anyMatch(p -> p.getStatus() == PartitionStatus.FAILED);
|
||||
|
||||
LOG.info(
|
||||
LOG.debug(
|
||||
"DB reconciliation: entity '{}' all {} partitions done (hasFailed={}, job {})",
|
||||
entityType,
|
||||
entityPartitions.size(),
|
||||
|
|
@ -182,16 +197,27 @@ public class EntityCompletionTracker {
|
|||
if (promotedEntities.add(entityType)) {
|
||||
boolean success = !hasFailed;
|
||||
|
||||
LOG.info(
|
||||
"Entity '{}' all partitions complete (success={}, job {})", entityType, success, jobId);
|
||||
LOG.debug(
|
||||
"Entity '{}' all partitions complete (success={}, hasFailed={}, job {})",
|
||||
entityType,
|
||||
success,
|
||||
hasFailed,
|
||||
jobId);
|
||||
|
||||
if (onEntityComplete != null) {
|
||||
try {
|
||||
onEntityComplete.accept(entityType, success);
|
||||
} catch (Exception e) {
|
||||
LOG.error("Error in entity completion callback for '{}' (job {})", entityType, jobId, e);
|
||||
LOG.error(
|
||||
"Error in entity completion callback for '{}' (job {}). "
|
||||
+ "Entity is STILL in promotedEntities - will be SKIPPED by finalization!",
|
||||
entityType,
|
||||
jobId,
|
||||
e);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
LOG.debug("Entity '{}' already in promotedEntities, skipping (job {})", entityType, jobId);
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -22,6 +22,8 @@ import java.util.UUID;
|
|||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.type.Include;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.EntityPriority;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingConfiguration;
|
||||
import org.openmetadata.service.jdbi3.EntityRepository;
|
||||
import org.openmetadata.service.jdbi3.EntityTimeSeriesRepository;
|
||||
import org.openmetadata.service.jdbi3.ListFilter;
|
||||
|
|
@ -79,38 +81,6 @@ public class PartitionCalculator {
|
|||
Map.entry("queryCostRecord", 0.3) // Time series, simple structure
|
||||
);
|
||||
|
||||
/**
|
||||
* Priority ordering for entity types during indexing. Higher priority entities should be indexed
|
||||
* first as they may be referenced by others. This ensures that when indexing tables, their parent
|
||||
* databases and schemas already exist in the search index.
|
||||
*/
|
||||
private static final Map<String, Integer> ENTITY_PRIORITY =
|
||||
Map.ofEntries(
|
||||
Map.entry("databaseService", 100),
|
||||
Map.entry("messagingService", 100),
|
||||
Map.entry("dashboardService", 100),
|
||||
Map.entry("pipelineService", 100),
|
||||
Map.entry("mlmodelService", 100),
|
||||
Map.entry("storageService", 100),
|
||||
Map.entry("database", 90),
|
||||
Map.entry("databaseSchema", 80),
|
||||
Map.entry("glossary", 70),
|
||||
Map.entry("classification", 70),
|
||||
Map.entry("team", 65),
|
||||
Map.entry("user", 60),
|
||||
Map.entry("table", 50),
|
||||
Map.entry("dashboard", 50),
|
||||
Map.entry("pipeline", 50),
|
||||
Map.entry("mlmodel", 50),
|
||||
Map.entry("topic", 50),
|
||||
Map.entry("container", 50),
|
||||
Map.entry("glossaryTerm", 45),
|
||||
Map.entry("tag", 40),
|
||||
Map.entry("testCase", 30),
|
||||
Map.entry("testCaseResult", 20),
|
||||
Map.entry("testCaseResolutionStatus", 20),
|
||||
Map.entry("queryCostRecord", 10));
|
||||
|
||||
/** Time series entity types */
|
||||
private static final Set<String> TIME_SERIES_ENTITIES =
|
||||
Set.of(
|
||||
|
|
@ -124,13 +94,19 @@ public class PartitionCalculator {
|
|||
"aggregatedCostAnalysisReportData");
|
||||
|
||||
private final int partitionSize;
|
||||
private final int minPartitionsPerEntity;
|
||||
|
||||
public PartitionCalculator() {
|
||||
this(DEFAULT_PARTITION_SIZE);
|
||||
this(DEFAULT_PARTITION_SIZE, 1);
|
||||
}
|
||||
|
||||
public PartitionCalculator(int partitionSize) {
|
||||
this(partitionSize, 1);
|
||||
}
|
||||
|
||||
public PartitionCalculator(int partitionSize, int minPartitionsPerEntity) {
|
||||
this.partitionSize = Math.clamp(partitionSize, MIN_PARTITION_SIZE, MAX_PARTITION_SIZE);
|
||||
this.minPartitionsPerEntity = Math.max(1, minPartitionsPerEntity);
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -144,10 +120,16 @@ public class PartitionCalculator {
|
|||
* @throws IllegalStateException if partition count would exceed safe limits
|
||||
*/
|
||||
public List<SearchIndexPartition> calculatePartitions(UUID jobId, Set<String> entityTypes) {
|
||||
return calculatePartitions(jobId, entityTypes, null);
|
||||
}
|
||||
|
||||
public List<SearchIndexPartition> calculatePartitions(
|
||||
UUID jobId, Set<String> entityTypes, ReindexingConfiguration reindexConfig) {
|
||||
List<SearchIndexPartition> partitions = new ArrayList<>();
|
||||
|
||||
for (String entityType : entityTypes) {
|
||||
List<SearchIndexPartition> entityPartitions = calculatePartitionsForEntity(jobId, entityType);
|
||||
List<SearchIndexPartition> entityPartitions =
|
||||
calculatePartitionsForEntity(jobId, entityType, reindexConfig);
|
||||
partitions.addAll(entityPartitions);
|
||||
|
||||
if (partitions.size() > MAX_TOTAL_PARTITIONS) {
|
||||
|
|
@ -174,14 +156,19 @@ public class PartitionCalculator {
|
|||
* @return List of partitions for this entity type
|
||||
*/
|
||||
public List<SearchIndexPartition> calculatePartitionsForEntity(UUID jobId, String entityType) {
|
||||
long totalCount = getEntityCount(entityType);
|
||||
return calculatePartitionsForEntity(jobId, entityType, null);
|
||||
}
|
||||
|
||||
public List<SearchIndexPartition> calculatePartitionsForEntity(
|
||||
UUID jobId, String entityType, ReindexingConfiguration reindexConfig) {
|
||||
long totalCount = getEntityCount(entityType, reindexConfig);
|
||||
if (totalCount == 0) {
|
||||
LOG.debug("No entities found for type: {}", entityType);
|
||||
return List.of();
|
||||
}
|
||||
|
||||
double complexityFactor = ENTITY_COMPLEXITY_FACTORS.getOrDefault(entityType, 1.0);
|
||||
int priority = ENTITY_PRIORITY.getOrDefault(entityType, 50);
|
||||
int priority = EntityPriority.getNumericPriority(entityType);
|
||||
|
||||
// Adjust partition size based on complexity - more complex entities get smaller partitions
|
||||
long adjustedPartitionSizeLong = (long) (partitionSize / complexityFactor);
|
||||
|
|
@ -191,6 +178,13 @@ public class PartitionCalculator {
|
|||
long numPartitionsLong =
|
||||
(totalCount + adjustedPartitionSizeLong - 1) / adjustedPartitionSizeLong;
|
||||
|
||||
// Ensure minimum partitions so all workers stay busy (e.g. testCaseResult with
|
||||
// only 4 partitions leaves 6 of 10 workers idle for minutes)
|
||||
if (numPartitionsLong < minPartitionsPerEntity && totalCount >= minPartitionsPerEntity) {
|
||||
numPartitionsLong = minPartitionsPerEntity;
|
||||
adjustedPartitionSizeLong = (totalCount + numPartitionsLong - 1) / numPartitionsLong;
|
||||
}
|
||||
|
||||
// Enforce per-entity-type limit and adjust partition size if needed
|
||||
if (numPartitionsLong > MAX_PARTITIONS_PER_ENTITY_TYPE) {
|
||||
LOG.warn(
|
||||
|
|
@ -253,10 +247,14 @@ public class PartitionCalculator {
|
|||
* @return Total count of entities
|
||||
*/
|
||||
public long getEntityCount(String entityType) {
|
||||
return getEntityCount(entityType, null);
|
||||
}
|
||||
|
||||
public long getEntityCount(String entityType, ReindexingConfiguration reindexConfig) {
|
||||
try {
|
||||
long count;
|
||||
if (TIME_SERIES_ENTITIES.contains(entityType)) {
|
||||
count = getTimeSeriesEntityCount(entityType);
|
||||
count = getTimeSeriesEntityCount(entityType, reindexConfig);
|
||||
} else {
|
||||
count = getRegularEntityCount(entityType);
|
||||
}
|
||||
|
|
@ -270,10 +268,10 @@ public class PartitionCalculator {
|
|||
|
||||
private long getRegularEntityCount(String entityType) {
|
||||
EntityRepository<?> repository = Entity.getEntityRepository(entityType);
|
||||
return repository.getDao().listTotalCount();
|
||||
return repository.getDao().listCount(new ListFilter(Include.ALL));
|
||||
}
|
||||
|
||||
private long getTimeSeriesEntityCount(String entityType) {
|
||||
private long getTimeSeriesEntityCount(String entityType, ReindexingConfiguration reindexConfig) {
|
||||
ListFilter listFilter = new ListFilter(Include.ALL);
|
||||
EntityTimeSeriesRepository<?> repository;
|
||||
|
||||
|
|
@ -284,6 +282,21 @@ public class PartitionCalculator {
|
|||
repository = Entity.getEntityTimeSeriesRepository(entityType);
|
||||
}
|
||||
|
||||
if (reindexConfig != null) {
|
||||
long startTs = reindexConfig.getTimeSeriesStartTs(entityType);
|
||||
if (startTs > 0) {
|
||||
long endTs = System.currentTimeMillis();
|
||||
long count = repository.getTimeSeriesDao().listCount(listFilter, startTs, endTs, false);
|
||||
LOG.info(
|
||||
"Time series date filter for {}: last {} days → {} records (was {} total)",
|
||||
entityType,
|
||||
reindexConfig.timeSeriesMaxDays(),
|
||||
count,
|
||||
repository.getTimeSeriesDao().listCount(listFilter));
|
||||
return count;
|
||||
}
|
||||
}
|
||||
|
||||
return repository.getTimeSeriesDao().listCount(listFilter);
|
||||
}
|
||||
|
||||
|
|
@ -298,9 +311,14 @@ public class PartitionCalculator {
|
|||
* @return Map of entity type to count
|
||||
*/
|
||||
public Map<String, Long> getEntityCounts(Set<String> entityTypes) {
|
||||
return getEntityCounts(entityTypes, null);
|
||||
}
|
||||
|
||||
public Map<String, Long> getEntityCounts(
|
||||
Set<String> entityTypes, ReindexingConfiguration reindexConfig) {
|
||||
Map<String, Long> counts = new HashMap<>();
|
||||
for (String entityType : entityTypes) {
|
||||
counts.put(entityType, getEntityCount(entityType));
|
||||
counts.put(entityType, getEntityCount(entityType, reindexConfig));
|
||||
}
|
||||
return counts;
|
||||
}
|
||||
|
|
@ -312,7 +330,7 @@ public class PartitionCalculator {
|
|||
* @return Priority value (higher = processed first)
|
||||
*/
|
||||
public int getEntityPriority(String entityType) {
|
||||
return ENTITY_PRIORITY.getOrDefault(entityType, 50);
|
||||
return EntityPriority.getNumericPriority(entityType);
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
|
|||
|
|
@ -28,11 +28,13 @@ import org.apache.commons.lang3.exception.ExceptionUtils;
|
|||
import org.openmetadata.schema.EntityInterface;
|
||||
import org.openmetadata.schema.EntityTimeSeriesInterface;
|
||||
import org.openmetadata.schema.analytics.ReportData;
|
||||
import org.openmetadata.schema.system.EntityError;
|
||||
import org.openmetadata.schema.type.Include;
|
||||
import org.openmetadata.schema.utils.ResultList;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.BulkSink;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.IndexingFailureRecorder;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingConfiguration;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.stats.StageStatsTracker;
|
||||
import org.openmetadata.service.exception.SearchIndexException;
|
||||
import org.openmetadata.service.jdbi3.ListFilter;
|
||||
|
|
@ -77,6 +79,12 @@ public class PartitionWorker {
|
|||
/** Progress update interval (every N entities) */
|
||||
private static final int PROGRESS_UPDATE_INTERVAL = 100;
|
||||
|
||||
/** Overall deadline for waiting on sink operations to complete */
|
||||
private static final long SINK_WAIT_DEADLINE_MS = 300_000;
|
||||
|
||||
/** Timeout per flush cycle when retrying sink completion */
|
||||
private static final int FLUSH_CYCLE_SECONDS = 30;
|
||||
|
||||
private final DistributedSearchIndexCoordinator coordinator;
|
||||
private final BulkSink searchIndexSink;
|
||||
private final int batchSize;
|
||||
|
|
@ -84,6 +92,7 @@ public class PartitionWorker {
|
|||
private final boolean recreateIndex;
|
||||
private final AtomicBoolean stopped = new AtomicBoolean(false);
|
||||
private final IndexingFailureRecorder failureRecorder;
|
||||
private final ReindexingConfiguration reindexConfig;
|
||||
|
||||
public PartitionWorker(
|
||||
DistributedSearchIndexCoordinator coordinator,
|
||||
|
|
@ -91,7 +100,7 @@ public class PartitionWorker {
|
|||
int batchSize,
|
||||
ReindexContext recreateContext,
|
||||
boolean recreateIndex) {
|
||||
this(coordinator, searchIndexSink, batchSize, recreateContext, recreateIndex, null);
|
||||
this(coordinator, searchIndexSink, batchSize, recreateContext, recreateIndex, null, null);
|
||||
}
|
||||
|
||||
public PartitionWorker(
|
||||
|
|
@ -101,12 +110,31 @@ public class PartitionWorker {
|
|||
ReindexContext recreateContext,
|
||||
boolean recreateIndex,
|
||||
IndexingFailureRecorder failureRecorder) {
|
||||
this(
|
||||
coordinator,
|
||||
searchIndexSink,
|
||||
batchSize,
|
||||
recreateContext,
|
||||
recreateIndex,
|
||||
failureRecorder,
|
||||
null);
|
||||
}
|
||||
|
||||
public PartitionWorker(
|
||||
DistributedSearchIndexCoordinator coordinator,
|
||||
BulkSink searchIndexSink,
|
||||
int batchSize,
|
||||
ReindexContext recreateContext,
|
||||
boolean recreateIndex,
|
||||
IndexingFailureRecorder failureRecorder,
|
||||
ReindexingConfiguration reindexConfig) {
|
||||
this.coordinator = coordinator;
|
||||
this.searchIndexSink = searchIndexSink;
|
||||
this.batchSize = batchSize;
|
||||
this.recreateContext = recreateContext;
|
||||
this.recreateIndex = recreateIndex;
|
||||
this.failureRecorder = failureRecorder;
|
||||
this.reindexConfig = reindexConfig;
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -154,19 +182,25 @@ public class PartitionWorker {
|
|||
// Initialize keyset cursor for efficient pagination (avoids OFFSET degradation)
|
||||
long cursorInitStart = System.currentTimeMillis();
|
||||
String keysetCursor = initializeKeysetCursor(entityType, rangeStart);
|
||||
LOG.info(
|
||||
"[PERF] initializeKeysetCursor for {} offset={} took {}ms",
|
||||
LOG.debug(
|
||||
"initializeKeysetCursor for {} offset={} took {}ms",
|
||||
entityType,
|
||||
rangeStart,
|
||||
System.currentTimeMillis() - cursorInitStart);
|
||||
|
||||
// Process in batches
|
||||
while (currentOffset < rangeEnd && !stopped.get()) {
|
||||
while (currentOffset < rangeEnd
|
||||
&& !stopped.get()
|
||||
&& !Thread.currentThread().isInterrupted()) {
|
||||
int currentBatchSize = (int) Math.min(batchSize, rangeEnd - currentOffset);
|
||||
|
||||
try {
|
||||
BatchResult batchResult =
|
||||
processBatch(entityType, keysetCursor, currentBatchSize, statsTracker);
|
||||
// Check for stop/interrupt after DB read completes
|
||||
if (stopped.get() || Thread.currentThread().isInterrupted()) {
|
||||
break;
|
||||
}
|
||||
successCount.addAndGet(batchResult.successCount());
|
||||
failedCount.addAndGet(batchResult.failedCount());
|
||||
warningsCount.addAndGet(batchResult.warningsCount());
|
||||
|
|
@ -191,9 +225,14 @@ public class PartitionWorker {
|
|||
keysetCursor = initializeKeysetCursor(entityType, currentOffset);
|
||||
if (keysetCursor == null) {
|
||||
LOG.debug(
|
||||
"No more data at offset {} (rangeEnd: {}), finishing partition early",
|
||||
"{} partition {} data exhausted at offset {} (rangeEnd: {}), "
|
||||
+ "missing {} records. processedCount={}",
|
||||
entityType,
|
||||
partition.getId(),
|
||||
currentOffset,
|
||||
rangeEnd);
|
||||
rangeEnd,
|
||||
rangeEnd - currentOffset,
|
||||
processedCount.get());
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
|
@ -272,11 +311,13 @@ public class PartitionWorker {
|
|||
// the coordinator may aggregate stats before they're written to the database
|
||||
long waitStart = System.currentTimeMillis();
|
||||
waitForSinkOperations(statsTracker);
|
||||
LOG.info("[PERF] waitForSinkOperations took {}ms", System.currentTimeMillis() - waitStart);
|
||||
LOG.debug("waitForSinkOperations took {}ms", System.currentTimeMillis() - waitStart);
|
||||
|
||||
// Mark partition as completed (stats are now in the database)
|
||||
coordinator.completePartition(partition.getId(), successCount.get(), failedCount.get());
|
||||
|
||||
long expectedRecords = rangeEnd - rangeStart;
|
||||
long actualProcessed = successCount.get() + failedCount.get();
|
||||
LOG.info(
|
||||
"Completed partition {} for entity type {} (success: {}, failed: {}, readerFailed: {}, warnings: {})",
|
||||
partition.getId(),
|
||||
|
|
@ -285,6 +326,18 @@ public class PartitionWorker {
|
|||
failedCount.get(),
|
||||
readerFailedCount.get(),
|
||||
warningsCount.get());
|
||||
if (actualProcessed < expectedRecords) {
|
||||
LOG.debug(
|
||||
"{} partition {} processed fewer records than expected: "
|
||||
+ "actual={}, expected={}, gap={}, range=[{},{})",
|
||||
entityType,
|
||||
partition.getId(),
|
||||
actualProcessed,
|
||||
expectedRecords,
|
||||
expectedRecords - actualProcessed,
|
||||
rangeStart,
|
||||
rangeEnd);
|
||||
}
|
||||
|
||||
return new PartitionResult(
|
||||
successCount.get(),
|
||||
|
|
@ -322,7 +375,7 @@ public class PartitionWorker {
|
|||
private void waitForSinkOperations(StageStatsTracker statsTracker) {
|
||||
// Flush the bulk processor to send any pending documents immediately
|
||||
// Without this, documents wait for the periodic flush interval (5 seconds)
|
||||
searchIndexSink.flushAndAwait(30);
|
||||
searchIndexSink.flushAndAwait(FLUSH_CYCLE_SECONDS);
|
||||
|
||||
// Check if there are pending vector tasks - if so, we need a longer timeout
|
||||
int pendingVectorTasks = searchIndexSink.getPendingVectorTaskCount();
|
||||
|
|
@ -334,7 +387,6 @@ public class PartitionWorker {
|
|||
pendingVectorTasks,
|
||||
statsTracker.getEntityType());
|
||||
|
||||
// Wait for vector operations to complete first (up to 120 seconds for vectors)
|
||||
boolean vectorComplete = searchIndexSink.awaitVectorCompletion(120);
|
||||
if (!vectorComplete) {
|
||||
LOG.warn(
|
||||
|
|
@ -344,15 +396,59 @@ public class PartitionWorker {
|
|||
}
|
||||
}
|
||||
|
||||
// Now wait for the stats tracker to have all callbacks accounted for
|
||||
// Use a longer timeout if we had vector tasks since callbacks may be delayed
|
||||
long statsTimeout = hasVectorTasks ? 60000 : 30000;
|
||||
boolean statsComplete = statsTracker.awaitSinkCompletion(statsTimeout);
|
||||
if (!statsComplete) {
|
||||
// Wait for all sink callbacks with retries. The bulk processor is shared across
|
||||
// partition workers, so slow batches from other entity types (e.g. testCaseResult
|
||||
// writes taking 70+ seconds) can delay our callbacks. Instead of a single fixed
|
||||
// timeout, retry flush cycles until all pending operations complete.
|
||||
long deadline = System.currentTimeMillis() + SINK_WAIT_DEADLINE_MS;
|
||||
int retryCount = 0;
|
||||
long previousPending = statsTracker.getPendingSinkOps();
|
||||
int staleRetries = 0;
|
||||
|
||||
while (statsTracker.getPendingSinkOps() > 0 && System.currentTimeMillis() < deadline) {
|
||||
long remainingMs = deadline - System.currentTimeMillis();
|
||||
long waitMs = Math.min(30_000, remainingMs);
|
||||
|
||||
if (statsTracker.awaitSinkCompletion(waitMs)) {
|
||||
break;
|
||||
}
|
||||
|
||||
if (statsTracker.getPendingSinkOps() > 0 && System.currentTimeMillis() < deadline) {
|
||||
retryCount++;
|
||||
long currentPending = statsTracker.getPendingSinkOps();
|
||||
LOG.info(
|
||||
"Retry {} - {} sink operations still pending for entity {}, re-flushing bulk processor",
|
||||
retryCount,
|
||||
currentPending,
|
||||
statsTracker.getEntityType());
|
||||
searchIndexSink.flushAndAwait(FLUSH_CYCLE_SECONDS);
|
||||
|
||||
if (currentPending == previousPending) {
|
||||
staleRetries++;
|
||||
if (staleRetries >= 3) {
|
||||
LOG.warn(
|
||||
"Pending sink ops stuck at {} for entity {} after {} retries with no progress. "
|
||||
+ "Reconciling early (callbacks likely lost).",
|
||||
currentPending,
|
||||
statsTracker.getEntityType(),
|
||||
staleRetries);
|
||||
break;
|
||||
}
|
||||
} else {
|
||||
staleRetries = 0;
|
||||
}
|
||||
previousPending = currentPending;
|
||||
}
|
||||
}
|
||||
|
||||
if (statsTracker.getPendingSinkOps() > 0) {
|
||||
LOG.warn(
|
||||
"Timed out waiting for sink stats completion, {} operations still pending for entity {}",
|
||||
"Reconciling {} pending sink operations after {} retries for entity {} "
|
||||
+ "(bulk processor was flushed, treating as successful)",
|
||||
statsTracker.getPendingSinkOps(),
|
||||
retryCount,
|
||||
statsTracker.getEntityType());
|
||||
statsTracker.reconcilePendingSinkOps();
|
||||
}
|
||||
|
||||
statsTracker.flush();
|
||||
|
|
@ -376,7 +472,7 @@ public class PartitionWorker {
|
|||
long t1 = System.currentTimeMillis();
|
||||
|
||||
if (resultList == null || resultList.getData() == null || resultList.getData().isEmpty()) {
|
||||
LOG.info("[PERF] {} read={}ms returned empty", entityType, t1 - t0);
|
||||
LOG.debug("{} read={}ms returned empty", entityType, t1 - t0);
|
||||
return new BatchResult(0, 0, 0, null);
|
||||
}
|
||||
|
||||
|
|
@ -389,13 +485,22 @@ public class PartitionWorker {
|
|||
statsTracker.recordReaderBatch(readSuccessCount, readErrorCount, warningsCount);
|
||||
}
|
||||
|
||||
if (failureRecorder != null && readErrorCount > 0) {
|
||||
for (EntityError entityError : listOrEmpty(resultList.getErrors())) {
|
||||
String entityId =
|
||||
entityError.getEntity() != null ? entityError.getEntity().toString() : null;
|
||||
failureRecorder.recordReaderEntityFailure(
|
||||
entityType, entityId, null, entityError.getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
Map<String, Object> contextData = createContextData(entityType, statsTracker);
|
||||
|
||||
try {
|
||||
writeToSink(entityType, resultList, contextData);
|
||||
long t2 = System.currentTimeMillis();
|
||||
LOG.info(
|
||||
"[PERF] {} read={}ms write={}ms total={}ms records={}",
|
||||
LOG.debug(
|
||||
"{} read={}ms write={}ms total={}ms records={}",
|
||||
entityType,
|
||||
t1 - t0,
|
||||
t2 - t1,
|
||||
|
|
@ -429,8 +534,20 @@ public class PartitionWorker {
|
|||
PaginatedEntitiesSource source = new PaginatedEntitiesSource(entityType, limit, fields, 0);
|
||||
return source.readNextKeyset(keysetCursor);
|
||||
} else {
|
||||
Long filterStartTs = null;
|
||||
Long filterEndTs = null;
|
||||
if (reindexConfig != null) {
|
||||
long startTs = reindexConfig.getTimeSeriesStartTs(entityType);
|
||||
if (startTs > 0) {
|
||||
filterStartTs = startTs;
|
||||
filterEndTs = System.currentTimeMillis();
|
||||
}
|
||||
}
|
||||
PaginatedEntityTimeSeriesSource source =
|
||||
new PaginatedEntityTimeSeriesSource(entityType, limit, fields, 0);
|
||||
(filterStartTs != null)
|
||||
? new PaginatedEntityTimeSeriesSource(
|
||||
entityType, limit, fields, filterStartTs, filterEndTs)
|
||||
: new PaginatedEntityTimeSeriesSource(entityType, limit, fields, 0);
|
||||
return source.readWithCursor(keysetCursor);
|
||||
}
|
||||
}
|
||||
|
|
@ -442,7 +559,16 @@ public class PartitionWorker {
|
|||
if (!TIME_SERIES_ENTITIES.contains(entityType)) {
|
||||
int cursorOffset = (int) offset - 1;
|
||||
ListFilter filter = new ListFilter(Include.ALL);
|
||||
return Entity.getEntityRepository(entityType).getCursorAtOffset(filter, cursorOffset);
|
||||
String cursor =
|
||||
Entity.getEntityRepository(entityType).getCursorAtOffset(filter, cursorOffset);
|
||||
if (cursor == null) {
|
||||
LOG.debug(
|
||||
"getCursorAtOffset returned null for {} at offset {} (cursorOffset={})",
|
||||
entityType,
|
||||
offset,
|
||||
cursorOffset);
|
||||
}
|
||||
return cursor;
|
||||
} else {
|
||||
return RestUtil.encodeCursor(String.valueOf(offset));
|
||||
}
|
||||
|
|
|
|||
|
|
@ -137,6 +137,38 @@ public class LoggingProgressListener implements ReindexingProgressListener {
|
|||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onReaderFailure(String entityType, String entityId, String error, FailureType type) {
|
||||
if (type == FailureType.ENTITY_NOT_FOUND) {
|
||||
LOG.warn("Reader warning for {} [{}]: {}", entityType, entityId, error);
|
||||
} else {
|
||||
LOG.error("Reader failure for {} [{}] ({}): {}", entityType, entityId, type, error);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onProcessFailure(String entityType, String entityId, String error) {
|
||||
LOG.error("Process failure for {} [{}]: {}", entityType, entityId, error);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onSinkFailure(String entityType, String entityId, String error) {
|
||||
LOG.error("Sink failure for {} [{}]: {}", entityType, entityId, error);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onSubIndexingCompleted(String entityType, String subIndex, StepStats subIndexStats) {
|
||||
long success =
|
||||
subIndexStats.getSuccessRecords() != null ? subIndexStats.getSuccessRecords() : 0;
|
||||
long failed = subIndexStats.getFailedRecords() != null ? subIndexStats.getFailedRecords() : 0;
|
||||
LOG.info(
|
||||
"Sub-indexing completed for {} [{}] - Success: {}, Failed: {}",
|
||||
entityType,
|
||||
subIndex,
|
||||
success,
|
||||
failed);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void onJobStopped(Stats currentStats) {
|
||||
LOG.info("Reindexing job stopped by user request");
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@ import static org.openmetadata.service.socket.WebSocketManager.SEARCH_INDEX_JOB_
|
|||
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
import java.util.function.Function;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.schema.entity.app.App;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
|
|
@ -16,6 +17,7 @@ import org.openmetadata.schema.system.IndexingError;
|
|||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
import org.openmetadata.schema.utils.JsonUtils;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.QuartzOrchestratorContext;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingConfiguration;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingJobContext;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingProgressListener;
|
||||
|
|
@ -30,19 +32,29 @@ import org.quartz.JobExecutionContext;
|
|||
public class QuartzProgressListener implements ReindexingProgressListener {
|
||||
|
||||
private static final long WEBSOCKET_UPDATE_INTERVAL_MS = 2000;
|
||||
private static final long DB_UPDATE_INTERVAL_MS = 5000;
|
||||
private static final int ERROR_THRESHOLD = 3;
|
||||
|
||||
private final JobExecutionContext jobExecutionContext;
|
||||
private final EventPublisherJob jobData;
|
||||
private final App app;
|
||||
private final Function<JobExecutionContext, AppRunRecord> jobRecordProvider;
|
||||
private final QuartzOrchestratorContext.StatusPusher statusPusher;
|
||||
private volatile long lastWebSocketUpdate = 0;
|
||||
private volatile long lastDbUpdate = 0;
|
||||
private final AtomicInteger pendingErrors = new AtomicInteger(0);
|
||||
|
||||
public QuartzProgressListener(
|
||||
JobExecutionContext jobExecutionContext, EventPublisherJob jobData, App app) {
|
||||
JobExecutionContext jobExecutionContext,
|
||||
EventPublisherJob jobData,
|
||||
App app,
|
||||
Function<JobExecutionContext, AppRunRecord> jobRecordProvider,
|
||||
QuartzOrchestratorContext.StatusPusher statusPusher) {
|
||||
this.jobExecutionContext = jobExecutionContext;
|
||||
this.jobData = jobData;
|
||||
this.app = app;
|
||||
this.jobRecordProvider = jobRecordProvider;
|
||||
this.statusPusher = statusPusher;
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -166,42 +178,81 @@ public class QuartzProgressListener implements ReindexingProgressListener {
|
|||
.getJobDataMap()
|
||||
.put(WEBSOCKET_STATUS_CHANNEL, SEARCH_INDEX_JOB_BROADCAST_CHANNEL);
|
||||
|
||||
updateRecordToDbAndNotify();
|
||||
updateRecordAndNotify(force);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to send updated stats with WebSocket", ex);
|
||||
LOG.error("Failed to send updated stats", ex);
|
||||
}
|
||||
}
|
||||
|
||||
private void updateRecordToDbAndNotify() {
|
||||
AppRunRecord appRecord = createAppRunRecord();
|
||||
private void updateRecordAndNotify(boolean forceDbUpdate) {
|
||||
AppRunRecord appRecord = getUpdatedAppRunRecord();
|
||||
|
||||
persistToDb(appRecord, forceDbUpdate);
|
||||
broadcastViaWebSocket(appRecord);
|
||||
}
|
||||
|
||||
private void persistToDb(AppRunRecord appRecord, boolean force) {
|
||||
if (statusPusher == null) {
|
||||
return;
|
||||
}
|
||||
long currentTime = System.currentTimeMillis();
|
||||
if (!force && currentTime - lastDbUpdate < DB_UPDATE_INTERVAL_MS) {
|
||||
return;
|
||||
}
|
||||
lastDbUpdate = currentTime;
|
||||
try {
|
||||
statusPusher.push(jobExecutionContext, appRecord, true);
|
||||
} catch (Exception ex) {
|
||||
LOG.error("Failed to persist app run record to database", ex);
|
||||
}
|
||||
}
|
||||
|
||||
private void broadcastViaWebSocket(AppRunRecord appRecord) {
|
||||
if (WebSocketManager.getInstance() != null) {
|
||||
String messageJson = JsonUtils.pojoToJson(appRecord);
|
||||
WebSocketManager.getInstance()
|
||||
.broadCastMessageToAll(SEARCH_INDEX_JOB_BROADCAST_CHANNEL, messageJson);
|
||||
LOG.debug("Broad-casted job updates via WebSocket. Status: {}", appRecord.getStatus());
|
||||
}
|
||||
}
|
||||
|
||||
private AppRunRecord createAppRunRecord() {
|
||||
AppRunRecord appRecord = new AppRunRecord();
|
||||
appRecord.setAppId(app != null ? app.getId() : null);
|
||||
appRecord.setStartTime(jobData.getTimestamp());
|
||||
private AppRunRecord getUpdatedAppRunRecord() {
|
||||
AppRunRecord appRecord = readExistingRecord();
|
||||
appRecord.setStatus(AppRunRecord.Status.fromValue(jobData.getStatus().value()));
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
SuccessContext ctx = appRecord.getSuccessContext();
|
||||
if (ctx == null) {
|
||||
ctx = new SuccessContext();
|
||||
}
|
||||
ctx.withAdditionalProperty("stats", jobData.getStats());
|
||||
appRecord.setSuccessContext(ctx);
|
||||
}
|
||||
|
||||
if (jobData.getFailure() != null) {
|
||||
appRecord.setFailureContext(
|
||||
new FailureContext().withAdditionalProperty("failure", jobData.getFailure()));
|
||||
}
|
||||
|
||||
if (jobData.getStats() != null) {
|
||||
appRecord.setSuccessContext(
|
||||
new SuccessContext().withAdditionalProperty("stats", jobData.getStats()));
|
||||
}
|
||||
|
||||
return appRecord;
|
||||
}
|
||||
|
||||
private AppRunRecord readExistingRecord() {
|
||||
if (jobRecordProvider != null) {
|
||||
try {
|
||||
AppRunRecord existing = jobRecordProvider.apply(jobExecutionContext);
|
||||
if (existing != null) {
|
||||
return existing;
|
||||
}
|
||||
} catch (Exception ex) {
|
||||
LOG.debug("Could not read existing job record from context", ex);
|
||||
}
|
||||
}
|
||||
AppRunRecord fallback = new AppRunRecord();
|
||||
fallback.setAppId(app != null ? app.getId() : null);
|
||||
fallback.setStartTime(jobData.getTimestamp());
|
||||
return fallback;
|
||||
}
|
||||
|
||||
/** Get the current job data for external access */
|
||||
public EventPublisherJob getJobData() {
|
||||
return jobData;
|
||||
|
|
|
|||
|
|
@ -4,6 +4,7 @@ import java.util.UUID;
|
|||
import java.util.concurrent.atomic.AtomicLong;
|
||||
import lombok.Getter;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingMetrics;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
|
||||
/**
|
||||
|
|
@ -113,8 +114,8 @@ public class StageStatsTracker {
|
|||
long deadline = System.currentTimeMillis() + timeoutMs;
|
||||
while (pendingSinkOps.get() > 0) {
|
||||
if (System.currentTimeMillis() >= deadline) {
|
||||
LOG.warn(
|
||||
"Timed out waiting for {} pending sink operations for job {} entity {}",
|
||||
LOG.debug(
|
||||
"Await cycle expired with {} pending sink operations for job {} entity {}",
|
||||
pendingSinkOps.get(),
|
||||
jobId,
|
||||
entityType);
|
||||
|
|
@ -135,6 +136,24 @@ public class StageStatsTracker {
|
|||
return pendingSinkOps.get();
|
||||
}
|
||||
|
||||
/**
|
||||
* Reconcile any remaining pending sink operations by recording them as successful. This should
|
||||
* only be called after the bulk processor has been flushed — at that point, submitted records are
|
||||
* either written or would have been reported as failures through the error handler. Pending ops
|
||||
* that remain are callbacks that didn't fire in time, not actual write failures.
|
||||
*/
|
||||
public void reconcilePendingSinkOps() {
|
||||
long remaining = pendingSinkOps.getAndSet(0);
|
||||
if (remaining > 0) {
|
||||
sink.add((int) remaining, 0, 0);
|
||||
LOG.info(
|
||||
"Reconciled {} pending sink operations as successful for job {} entity {}",
|
||||
remaining,
|
||||
jobId,
|
||||
entityType);
|
||||
}
|
||||
}
|
||||
|
||||
public void recordVector(StatsResult result) {
|
||||
vector.record(result);
|
||||
checkFlush();
|
||||
|
|
@ -184,6 +203,19 @@ public class StageStatsTracker {
|
|||
return;
|
||||
}
|
||||
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
if (rSuccess > 0) metrics.recordStageSuccess("reader", entityType, rSuccess);
|
||||
if (rFailed > 0) metrics.recordStageFailed("reader", entityType, rFailed);
|
||||
if (rWarnings > 0) metrics.recordStageWarnings("reader", entityType, rWarnings);
|
||||
if (pSuccess > 0) metrics.recordStageSuccess("process", entityType, pSuccess);
|
||||
if (pFailed > 0) metrics.recordStageFailed("process", entityType, pFailed);
|
||||
if (sSuccess > 0) metrics.recordStageSuccess("sink", entityType, sSuccess);
|
||||
if (sFailed > 0) metrics.recordStageFailed("sink", entityType, sFailed);
|
||||
if (vSuccess > 0) metrics.recordStageSuccess("vector", entityType, vSuccess);
|
||||
if (vFailed > 0) metrics.recordStageFailed("vector", entityType, vFailed);
|
||||
}
|
||||
|
||||
try {
|
||||
statsDAO.incrementStats(
|
||||
recordId.toString(),
|
||||
|
|
|
|||
|
|
@ -7507,6 +7507,27 @@ public interface CollectionDAO {
|
|||
connectionType = POSTGRES)
|
||||
void markStaleEntriesStopped(@Bind("appId") String appId);
|
||||
|
||||
@ConnectionAwareSqlUpdate(
|
||||
value =
|
||||
"UPDATE apps_extension_time_series SET json = JSON_SET(json, '$.status', 'stopped') WHERE appName=:appName AND JSON_UNQUOTE(JSON_EXTRACT(json, '$.status')) = 'running' AND extension = 'status'",
|
||||
connectionType = MYSQL)
|
||||
@ConnectionAwareSqlUpdate(
|
||||
value =
|
||||
"UPDATE apps_extension_time_series SET json = jsonb_set(json, '{status}', '\"stopped\"') WHERE appName = :appName AND json->>'status' = 'running' AND extension = 'status'",
|
||||
connectionType = POSTGRES)
|
||||
void markStaleEntriesStoppedByName(@Bind("appName") String appName);
|
||||
|
||||
@ConnectionAwareSqlUpdate(
|
||||
value =
|
||||
"UPDATE apps_extension_time_series SET json = JSON_SET(json, '$.status', 'stopped') WHERE appName=:appName AND JSON_UNQUOTE(JSON_EXTRACT(json, '$.status')) = 'running' AND extension = 'status' AND timestamp < :beforeTimestamp",
|
||||
connectionType = MYSQL)
|
||||
@ConnectionAwareSqlUpdate(
|
||||
value =
|
||||
"UPDATE apps_extension_time_series SET json = jsonb_set(json, '{status}', '\"stopped\"') WHERE appName = :appName AND json->>'status' = 'running' AND extension = 'status' AND timestamp < :beforeTimestamp",
|
||||
connectionType = POSTGRES)
|
||||
void markStaleEntriesStoppedBefore(
|
||||
@Bind("appName") String appName, @Bind("beforeTimestamp") long beforeTimestamp);
|
||||
|
||||
@ConnectionAwareSqlUpdate(
|
||||
value =
|
||||
"UPDATE apps_extension_time_series SET json = JSON_SET(json, '$.status', 'failed') WHERE JSON_UNQUOTE(JSON_EXTRACT(json, '$.status')) = 'running' AND extension = 'status'",
|
||||
|
|
@ -9644,10 +9665,13 @@ public interface CollectionDAO {
|
|||
|
||||
@SqlQuery(
|
||||
"SELECT * FROM search_index_partition WHERE jobId = :jobId AND status = 'PROCESSING' "
|
||||
+ "AND assignedServer = :serverId ORDER BY claimedAt DESC LIMIT 1")
|
||||
+ "AND assignedServer = :serverId AND claimedAt = :claimedAt "
|
||||
+ "ORDER BY priority DESC, entityType, partitionIndex LIMIT 1")
|
||||
@RegisterRowMapper(SearchIndexPartitionMapper.class)
|
||||
SearchIndexPartitionRecord findLatestClaimedPartition(
|
||||
@Bind("jobId") String jobId, @Bind("serverId") String serverId);
|
||||
@Bind("jobId") String jobId,
|
||||
@Bind("serverId") String serverId,
|
||||
@Bind("claimedAt") long claimedAt);
|
||||
|
||||
@SqlQuery(
|
||||
"SELECT * FROM search_index_partition WHERE jobId = :jobId AND status = :status "
|
||||
|
|
|
|||
|
|
@ -1224,6 +1224,11 @@ public abstract class EntityRepository<T extends EntityInterface> {
|
|||
public String getCursorAtOffset(ListFilter filter, int offset) {
|
||||
List<String> jsons = dao.listAfter(filter, 1, offset);
|
||||
if (jsons.isEmpty()) {
|
||||
LOG.debug(
|
||||
"getCursorAtOffset for {} at offset {} returned empty (filter condition={})",
|
||||
entityType,
|
||||
offset,
|
||||
filter.getCondition(dao.getTableName()));
|
||||
return null;
|
||||
}
|
||||
T entity = JsonUtils.readValue(jsons.get(0), entityClass);
|
||||
|
|
|
|||
|
|
@ -45,7 +45,7 @@ public class HikariCPDataSourceFactory extends DataSourceFactory {
|
|||
private int minimumIdle = 10;
|
||||
|
||||
@JsonProperty
|
||||
@Max(100)
|
||||
@Max(500)
|
||||
private int maximumPoolSize = 100;
|
||||
|
||||
@JsonProperty private Long connectionTimeout;
|
||||
|
|
|
|||
|
|
@ -22,6 +22,7 @@ import java.io.IOException;
|
|||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.glassfish.jersey.internal.inject.AbstractBinder;
|
||||
import org.openmetadata.service.OpenMetadataApplicationConfig;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingMetrics;
|
||||
|
||||
/**
|
||||
* Dropwizard bundle for configuring Micrometer metrics with Prometheus backend.
|
||||
|
|
@ -61,6 +62,8 @@ public class MicrometerBundle implements ConfiguredBundle<OpenMetadataApplicatio
|
|||
// Create StreamableLogsMetrics instance
|
||||
streamableLogsMetrics = new StreamableLogsMetrics(prometheusMeterRegistry);
|
||||
|
||||
ReindexingMetrics.initialize(prometheusMeterRegistry);
|
||||
|
||||
// Register Prometheus endpoint on admin connector
|
||||
registerPrometheusEndpoint(environment);
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,282 @@
|
|||
/*
|
||||
* Copyright 2021 Collate
|
||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||
* you may not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package org.openmetadata.service.resources.system;
|
||||
|
||||
import io.micrometer.core.instrument.Gauge;
|
||||
import io.micrometer.core.instrument.Meter;
|
||||
import io.micrometer.core.instrument.MeterRegistry;
|
||||
import io.micrometer.core.instrument.Metrics;
|
||||
import io.micrometer.core.instrument.Timer;
|
||||
import io.micrometer.core.instrument.search.Search;
|
||||
import io.swagger.v3.oas.annotations.Operation;
|
||||
import io.swagger.v3.oas.annotations.responses.ApiResponse;
|
||||
import io.swagger.v3.oas.annotations.tags.Tag;
|
||||
import jakarta.ws.rs.GET;
|
||||
import jakarta.ws.rs.Path;
|
||||
import jakarta.ws.rs.Produces;
|
||||
import jakarta.ws.rs.core.Context;
|
||||
import jakarta.ws.rs.core.MediaType;
|
||||
import jakarta.ws.rs.core.Response;
|
||||
import jakarta.ws.rs.core.SecurityContext;
|
||||
import java.lang.management.GarbageCollectorMXBean;
|
||||
import java.lang.management.ManagementFactory;
|
||||
import java.lang.management.MemoryMXBean;
|
||||
import java.lang.management.MemoryUsage;
|
||||
import java.lang.management.OperatingSystemMXBean;
|
||||
import java.lang.management.RuntimeMXBean;
|
||||
import java.lang.management.ThreadMXBean;
|
||||
import java.time.Instant;
|
||||
import java.util.LinkedHashMap;
|
||||
import java.util.Map;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.service.jdbi3.BulkExecutor;
|
||||
import org.openmetadata.service.security.Authorizer;
|
||||
|
||||
@Path("/v1/system/diagnostics")
|
||||
@Tag(
|
||||
name = "System",
|
||||
description = "System diagnostics providing a performance snapshot for load test correlation")
|
||||
@Produces(MediaType.APPLICATION_JSON)
|
||||
@Slf4j
|
||||
public class DiagnosticsResource {
|
||||
private final Authorizer authorizer;
|
||||
|
||||
public DiagnosticsResource(Authorizer authorizer) {
|
||||
this.authorizer = authorizer;
|
||||
}
|
||||
|
||||
@GET
|
||||
@Operation(
|
||||
operationId = "getSystemDiagnostics",
|
||||
summary = "Get system diagnostics",
|
||||
description =
|
||||
"Returns a structured performance snapshot including JVM, Jetty thread pool, "
|
||||
+ "database connection pool, bulk executor, and per-endpoint latency breakdown.",
|
||||
responses = {
|
||||
@ApiResponse(responseCode = "200", description = "Diagnostics snapshot"),
|
||||
})
|
||||
public Response getDiagnostics(@Context SecurityContext securityContext) {
|
||||
authorizer.authorizeAdmin(securityContext);
|
||||
Map<String, Object> diagnostics = new LinkedHashMap<>();
|
||||
diagnostics.put("timestamp", Instant.now().toString());
|
||||
diagnostics.put("jvm", collectJvmMetrics());
|
||||
diagnostics.put("jetty", collectJettyMetrics());
|
||||
diagnostics.put("database", collectDatabaseMetrics());
|
||||
diagnostics.put("bulk_executor", collectBulkExecutorMetrics());
|
||||
diagnostics.put("request_latency", collectRequestLatencyMetrics());
|
||||
return Response.ok(diagnostics).build();
|
||||
}
|
||||
|
||||
private Map<String, Object> collectJvmMetrics() {
|
||||
Map<String, Object> jvm = new LinkedHashMap<>();
|
||||
|
||||
MemoryMXBean memoryBean = ManagementFactory.getMemoryMXBean();
|
||||
MemoryUsage heapUsage = memoryBean.getHeapMemoryUsage();
|
||||
MemoryUsage nonHeapUsage = memoryBean.getNonHeapMemoryUsage();
|
||||
|
||||
long heapUsed = heapUsage.getUsed();
|
||||
long heapMax = heapUsage.getMax();
|
||||
jvm.put("heap_used_bytes", heapUsed);
|
||||
jvm.put("heap_max_bytes", heapMax);
|
||||
jvm.put("heap_usage_pct", heapMax > 0 ? Math.round(heapUsed * 1000.0 / heapMax) / 10.0 : 0.0);
|
||||
jvm.put("non_heap_used_bytes", nonHeapUsage.getUsed());
|
||||
|
||||
long gcPauseTotalMs = 0;
|
||||
long gcCount = 0;
|
||||
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
|
||||
gcPauseTotalMs += gcBean.getCollectionTime();
|
||||
gcCount += gcBean.getCollectionCount();
|
||||
}
|
||||
jvm.put("gc_pause_total_ms", gcPauseTotalMs);
|
||||
jvm.put("gc_count", gcCount);
|
||||
|
||||
ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
|
||||
jvm.put("thread_count", threadBean.getThreadCount());
|
||||
jvm.put("thread_peak", threadBean.getPeakThreadCount());
|
||||
|
||||
OperatingSystemMXBean osBean = ManagementFactory.getOperatingSystemMXBean();
|
||||
if (osBean instanceof com.sun.management.OperatingSystemMXBean sunOsBean) {
|
||||
double processCpu = sunOsBean.getProcessCpuLoad();
|
||||
double systemCpu = sunOsBean.getCpuLoad();
|
||||
jvm.put("cpu_process_pct", Math.round(processCpu * 1000.0) / 10.0);
|
||||
jvm.put("cpu_system_pct", Math.round(systemCpu * 1000.0) / 10.0);
|
||||
} else {
|
||||
jvm.put("cpu_process_pct", -1.0);
|
||||
jvm.put("cpu_system_pct", -1.0);
|
||||
}
|
||||
|
||||
RuntimeMXBean runtimeBean = ManagementFactory.getRuntimeMXBean();
|
||||
jvm.put("uptime_seconds", runtimeBean.getUptime() / 1000);
|
||||
|
||||
return jvm;
|
||||
}
|
||||
|
||||
private Map<String, Object> collectJettyMetrics() {
|
||||
Map<String, Object> jetty = new LinkedHashMap<>();
|
||||
MeterRegistry registry = Metrics.globalRegistry;
|
||||
|
||||
jetty.put("threads_current", gaugeValue(registry, "jetty.threads.current"));
|
||||
jetty.put("threads_busy", gaugeValue(registry, "jetty.threads.busy"));
|
||||
jetty.put("threads_idle", gaugeValue(registry, "jetty.threads.idle"));
|
||||
jetty.put("threads_max", gaugeValue(registry, "jetty.threads.max"));
|
||||
|
||||
double current = gaugeValue(registry, "jetty.threads.current");
|
||||
double busy = gaugeValue(registry, "jetty.threads.busy");
|
||||
jetty.put("utilization_pct", current > 0 ? Math.round(busy / current * 1000.0) / 10.0 : 0.0);
|
||||
|
||||
jetty.put("queue_size", gaugeValue(registry, "jetty.queue.size"));
|
||||
jetty.put("queue_time_avg_ms", gaugeValue(registry, "jetty.request.queue.time.ms"));
|
||||
jetty.put("active_requests", gaugeValue(registry, "jetty.requests.active"));
|
||||
|
||||
double virtualEnabled = gaugeValue(registry, "jetty.virtual.threads.enabled");
|
||||
jetty.put("virtual_threads_enabled", virtualEnabled > 0);
|
||||
|
||||
return jetty;
|
||||
}
|
||||
|
||||
private Map<String, Object> collectDatabaseMetrics() {
|
||||
Map<String, Object> db = new LinkedHashMap<>();
|
||||
MeterRegistry registry = Metrics.globalRegistry;
|
||||
|
||||
double active = gaugeValue(registry, "hikaricp.connections.active");
|
||||
double idle = gaugeValue(registry, "hikaricp.connections.idle");
|
||||
double total = active + idle;
|
||||
double max = gaugeValue(registry, "hikaricp.connections.max");
|
||||
double pending = gaugeValue(registry, "hikaricp.connections.pending");
|
||||
|
||||
db.put("pool_active", (int) active);
|
||||
db.put("pool_idle", (int) idle);
|
||||
db.put("pool_total", (int) total);
|
||||
db.put("pool_max", (int) max);
|
||||
db.put("pool_pending", (int) pending);
|
||||
db.put("pool_usage_pct", max > 0 ? Math.round(active / max * 1000.0) / 10.0 : 0.0);
|
||||
|
||||
double timeoutMs = gaugeValue(registry, "hikaricp.connections.timeout");
|
||||
db.put("connection_timeout_ms", timeoutMs > 0 ? (long) timeoutMs : 30000L);
|
||||
|
||||
return db;
|
||||
}
|
||||
|
||||
private Map<String, Object> collectBulkExecutorMetrics() {
|
||||
Map<String, Object> bulk = new LinkedHashMap<>();
|
||||
|
||||
try {
|
||||
BulkExecutor executor = BulkExecutor.getInstance();
|
||||
int maxThreads = executor.getMaxThreads();
|
||||
int activeThreads = executor.getActiveCount();
|
||||
int queueDepth = executor.getQueueDepth();
|
||||
int queueCapacity = executor.getQueueSize();
|
||||
|
||||
bulk.put("max_threads", maxThreads);
|
||||
bulk.put("active_threads", activeThreads);
|
||||
bulk.put("queue_depth", queueDepth);
|
||||
bulk.put("queue_capacity", queueCapacity);
|
||||
bulk.put(
|
||||
"queue_usage_pct",
|
||||
queueCapacity > 0 ? Math.round(queueDepth * 1000.0 / queueCapacity) / 10.0 : 0.0);
|
||||
bulk.put("has_capacity", executor.hasCapacity());
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Could not collect BulkExecutor metrics: {}", e.getMessage());
|
||||
bulk.put("error", "BulkExecutor not available");
|
||||
}
|
||||
|
||||
return bulk;
|
||||
}
|
||||
|
||||
private Map<String, Object> collectRequestLatencyMetrics() {
|
||||
Map<String, Object> latencyMap = new LinkedHashMap<>();
|
||||
MeterRegistry registry = Metrics.globalRegistry;
|
||||
|
||||
Search totalTimerSearch = registry.find("request.latency.total");
|
||||
for (Meter meter : totalTimerSearch.meters()) {
|
||||
if (!(meter instanceof Timer totalTimer)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
String endpoint =
|
||||
totalTimer.getId().getTag("endpoint") != null
|
||||
? totalTimer.getId().getTag("endpoint")
|
||||
: "unknown";
|
||||
String method =
|
||||
totalTimer.getId().getTag("method") != null
|
||||
? totalTimer.getId().getTag("method")
|
||||
: "unknown";
|
||||
String key = method + " " + endpoint;
|
||||
|
||||
long count = totalTimer.count();
|
||||
if (count == 0) {
|
||||
continue;
|
||||
}
|
||||
|
||||
double avgTotalMs = totalTimer.mean(TimeUnit.MILLISECONDS);
|
||||
|
||||
double avgDbMs = timerMean(registry, "request.latency.database", endpoint, method);
|
||||
double avgSearchMs = timerMean(registry, "request.latency.search", endpoint, method);
|
||||
double avgInternalMs = timerMean(registry, "request.latency.internal", endpoint, method);
|
||||
|
||||
double dbPct = avgTotalMs > 0 ? Math.round(avgDbMs / avgTotalMs * 1000.0) / 10.0 : 0.0;
|
||||
double searchPct =
|
||||
avgTotalMs > 0 ? Math.round(avgSearchMs / avgTotalMs * 1000.0) / 10.0 : 0.0;
|
||||
double internalPct =
|
||||
avgTotalMs > 0 ? Math.round(avgInternalMs / avgTotalMs * 1000.0) / 10.0 : 0.0;
|
||||
|
||||
double avgDbOps = summaryMean(registry, "request.operations.database", endpoint, method);
|
||||
double avgSearchOps = summaryMean(registry, "request.operations.search", endpoint, method);
|
||||
|
||||
Map<String, Object> entry = new LinkedHashMap<>();
|
||||
entry.put("count", count);
|
||||
entry.put("avg_total_ms", Math.round(avgTotalMs * 10.0) / 10.0);
|
||||
entry.put("avg_db_ms", Math.round(avgDbMs * 10.0) / 10.0);
|
||||
entry.put("avg_search_ms", Math.round(avgSearchMs * 10.0) / 10.0);
|
||||
entry.put("avg_internal_ms", Math.round(avgInternalMs * 10.0) / 10.0);
|
||||
entry.put("db_pct", dbPct);
|
||||
entry.put("search_pct", searchPct);
|
||||
entry.put("internal_pct", internalPct);
|
||||
entry.put("avg_db_ops", Math.round(avgDbOps * 10.0) / 10.0);
|
||||
entry.put("avg_search_ops", Math.round(avgSearchOps * 10.0) / 10.0);
|
||||
|
||||
latencyMap.put(key, entry);
|
||||
}
|
||||
|
||||
return latencyMap;
|
||||
}
|
||||
|
||||
private static double gaugeValue(MeterRegistry registry, String name) {
|
||||
Gauge gauge = registry.find(name).gauge();
|
||||
if (gauge != null) {
|
||||
return gauge.value();
|
||||
}
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
private static double timerMean(
|
||||
MeterRegistry registry, String name, String endpoint, String method) {
|
||||
Timer timer = registry.find(name).tag("endpoint", endpoint).tag("method", method).timer();
|
||||
if (timer != null && timer.count() > 0) {
|
||||
return timer.mean(TimeUnit.MILLISECONDS);
|
||||
}
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
private static double summaryMean(
|
||||
MeterRegistry registry, String name, String endpoint, String method) {
|
||||
io.micrometer.core.instrument.DistributionSummary summary =
|
||||
registry.find(name).tag("endpoint", endpoint).tag("method", method).summary();
|
||||
if (summary != null && summary.count() > 0) {
|
||||
return summary.mean();
|
||||
}
|
||||
return 0.0;
|
||||
}
|
||||
}
|
||||
|
|
@ -7,6 +7,7 @@ import java.util.Set;
|
|||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.openmetadata.search.IndexMapping;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.ReindexingMetrics;
|
||||
|
||||
/**
|
||||
* Default implementation of RecreateHandler that provides zero-downtime index recreation.
|
||||
|
|
@ -52,34 +53,60 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
|
||||
if (canonicalIndex == null || stagedIndex == null) {
|
||||
LOG.error(
|
||||
"Cannot finalize reindex for entity '{}'. Missing canonical or staged index name.",
|
||||
entityType);
|
||||
"Cannot finalize reindex for entity '{}'. canonicalIndex={}, stagedIndex={}",
|
||||
entityType,
|
||||
canonicalIndex,
|
||||
stagedIndex);
|
||||
return;
|
||||
}
|
||||
|
||||
if (reindexSuccess) {
|
||||
// Always-promote: partial data is better than no data. When reindex failed but the staged
|
||||
// index has documents, promote it. Only delete if truly empty.
|
||||
boolean shouldPromote = reindexSuccess;
|
||||
if (!shouldPromote) {
|
||||
long docCount = searchClient.getDocumentCount(stagedIndex);
|
||||
if (docCount > 0) {
|
||||
LOG.info(
|
||||
"Reindex failed for entity '{}' but staged index '{}' has {} documents. "
|
||||
+ "Promoting partial data (partial data > no data).",
|
||||
entityType,
|
||||
stagedIndex,
|
||||
docCount);
|
||||
shouldPromote = true;
|
||||
} else if (docCount == 0) {
|
||||
LOG.info(
|
||||
"Reindex failed for entity '{}' and staged index '{}' has 0 documents. "
|
||||
+ "Deleting empty staged index.",
|
||||
entityType,
|
||||
stagedIndex);
|
||||
} else {
|
||||
LOG.warn(
|
||||
"Could not determine doc count for staged index '{}' (entity '{}'). "
|
||||
+ "Promoting to avoid data loss.",
|
||||
stagedIndex,
|
||||
entityType);
|
||||
shouldPromote = true;
|
||||
}
|
||||
}
|
||||
|
||||
if (shouldPromote) {
|
||||
try {
|
||||
Set<String> aliasesToAttach = new HashSet<>();
|
||||
|
||||
// Existing Aliases
|
||||
existingAliases.stream()
|
||||
.filter(alias -> alias != null && !alias.isBlank())
|
||||
.forEach(aliasesToAttach::add);
|
||||
|
||||
// Canonical Alias
|
||||
if (!nullOrEmpty(canonicalAlias)) {
|
||||
aliasesToAttach.add(canonicalAlias);
|
||||
}
|
||||
|
||||
// Parent Aliases
|
||||
parentAliases.stream()
|
||||
.filter(alias -> alias != null && !alias.isBlank())
|
||||
.forEach(aliasesToAttach::add);
|
||||
|
||||
// Remove any null or blank aliases
|
||||
aliasesToAttach.removeIf(alias -> alias == null || alias.isBlank());
|
||||
|
||||
// Collect all old indices to delete (except staged)
|
||||
Set<String> allEntityIndices = searchClient.listIndicesByPrefix(canonicalIndex);
|
||||
Set<String> oldIndicesToDelete = new HashSet<>();
|
||||
for (String oldIndex : allEntityIndices) {
|
||||
|
|
@ -88,7 +115,13 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
}
|
||||
|
||||
// Canonical Indexes needs to be removed before attached that as aliases
|
||||
LOG.debug(
|
||||
"finalizeReindex entity '{}': aliases={}, oldIndices={}, stagedIndex={}",
|
||||
entityType,
|
||||
aliasesToAttach,
|
||||
oldIndicesToDelete,
|
||||
stagedIndex);
|
||||
|
||||
if (oldIndicesToDelete.contains(canonicalIndex)) {
|
||||
if (searchClient.indexExists(canonicalIndex)) {
|
||||
searchClient.deleteIndexWithBackoff(canonicalIndex);
|
||||
|
|
@ -97,8 +130,6 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
}
|
||||
|
||||
// Atomically swap aliases from old indices to staged index
|
||||
// This ensures zero-downtime: aliases point to new index before old ones are deleted
|
||||
if (!aliasesToAttach.isEmpty()) {
|
||||
boolean swapSuccess =
|
||||
searchClient.swapAliases(oldIndicesToDelete, stagedIndex, aliasesToAttach);
|
||||
|
|
@ -111,12 +142,17 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
|
||||
LOG.info(
|
||||
"Promoted staged index '{}' to serve entity '{}' (aliases: {}).",
|
||||
"Promoted staged index '{}' to serve entity '{}' (aliases: {}, reindexSuccess: {}).",
|
||||
stagedIndex,
|
||||
entityType,
|
||||
aliasesToAttach);
|
||||
aliasesToAttach,
|
||||
reindexSuccess);
|
||||
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.recordPromotionSuccess(entityType);
|
||||
}
|
||||
|
||||
// Delete old indices after successful alias swap (with backoff for snapshot scenarios)
|
||||
for (String oldIndex : oldIndicesToDelete) {
|
||||
try {
|
||||
if (searchClient.indexExists(oldIndex)) {
|
||||
|
|
@ -131,6 +167,10 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
} catch (Exception ex) {
|
||||
LOG.error(
|
||||
"Failed to promote staged index '{}' for entity '{}'.", stagedIndex, entityType, ex);
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
if (metrics != null) {
|
||||
metrics.recordPromotionFailure(entityType);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
try {
|
||||
|
|
@ -166,12 +206,39 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
|
||||
if (canonicalIndex == null || stagedIndex == null) {
|
||||
LOG.error(
|
||||
"Cannot promote index for entity '{}'. Missing canonical or staged index name.",
|
||||
entityType);
|
||||
"Cannot promote index for entity '{}'. canonicalIndex={}, stagedIndex={}",
|
||||
entityType,
|
||||
canonicalIndex,
|
||||
stagedIndex);
|
||||
return;
|
||||
}
|
||||
|
||||
if (!reindexSuccess) {
|
||||
// Always-promote: check doc count when reindex failed
|
||||
boolean shouldPromote = reindexSuccess;
|
||||
if (!shouldPromote) {
|
||||
long docCount = searchClient.getDocumentCount(stagedIndex);
|
||||
if (docCount > 0) {
|
||||
LOG.info(
|
||||
"Per-entity reindex failed for '{}' but staged index '{}' has {} documents. Promoting.",
|
||||
entityType,
|
||||
stagedIndex,
|
||||
docCount);
|
||||
shouldPromote = true;
|
||||
} else if (docCount == 0) {
|
||||
LOG.info(
|
||||
"Per-entity reindex failed for '{}' and staged index '{}' is empty. Deleting.",
|
||||
entityType,
|
||||
stagedIndex);
|
||||
} else {
|
||||
LOG.warn(
|
||||
"Could not determine doc count for staged index '{}' (entity '{}'). Promoting.",
|
||||
stagedIndex,
|
||||
entityType);
|
||||
shouldPromote = true;
|
||||
}
|
||||
}
|
||||
|
||||
if (!shouldPromote) {
|
||||
try {
|
||||
if (searchClient.indexExists(stagedIndex)) {
|
||||
searchClient.deleteIndexWithBackoff(stagedIndex);
|
||||
|
|
@ -191,11 +258,9 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
|
||||
try {
|
||||
// Get aliases from indexMapping.json (not from old index)
|
||||
Set<String> aliasesToAttach =
|
||||
getAliasesFromMapping(indexMapping, searchRepository.getClusterAlias());
|
||||
|
||||
// Find old indices with this prefix (except staged)
|
||||
Set<String> allEntityIndices = searchClient.listIndicesByPrefix(canonicalIndex);
|
||||
Set<String> oldIndicesToDelete = new HashSet<>();
|
||||
for (String oldIndex : allEntityIndices) {
|
||||
|
|
@ -204,7 +269,13 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
}
|
||||
|
||||
// Canonical Indexes needs to be removed before attached that as aliases
|
||||
LOG.debug(
|
||||
"promoteEntityIndex '{}': aliases={}, oldIndices={}, stagedIndex={}",
|
||||
entityType,
|
||||
aliasesToAttach,
|
||||
oldIndicesToDelete,
|
||||
stagedIndex);
|
||||
|
||||
if (oldIndicesToDelete.contains(canonicalIndex)) {
|
||||
if (searchClient.indexExists(canonicalIndex)) {
|
||||
searchClient.deleteIndexWithBackoff(canonicalIndex);
|
||||
|
|
@ -213,26 +284,35 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
}
|
||||
}
|
||||
|
||||
// Atomically swap aliases from old indices to staged index
|
||||
// This ensures zero-downtime: aliases point to new index before old ones are deleted
|
||||
if (!aliasesToAttach.isEmpty()) {
|
||||
boolean swapSuccess =
|
||||
searchClient.swapAliases(oldIndicesToDelete, stagedIndex, aliasesToAttach);
|
||||
if (!swapSuccess) {
|
||||
LOG.error(
|
||||
"Failed to atomically swap aliases for entity '{}'. Old indices will not be deleted.",
|
||||
entityType);
|
||||
"Failed to atomically swap aliases for entity '{}'. "
|
||||
+ "oldIndices={}, stagedIndex={}, aliases={}",
|
||||
entityType,
|
||||
oldIndicesToDelete,
|
||||
stagedIndex,
|
||||
aliasesToAttach);
|
||||
return;
|
||||
}
|
||||
} else {
|
||||
LOG.warn("Entity '{}': aliasesToAttach is empty, skipping alias swap", entityType);
|
||||
}
|
||||
|
||||
LOG.info(
|
||||
"Promoted staged index '{}' to serve entity '{}' (aliases: {}).",
|
||||
"Promoted staged index '{}' to serve entity '{}' (aliases: {}, reindexSuccess: {}).",
|
||||
stagedIndex,
|
||||
entityType,
|
||||
aliasesToAttach);
|
||||
aliasesToAttach,
|
||||
reindexSuccess);
|
||||
|
||||
ReindexingMetrics promoteMetrics = ReindexingMetrics.getInstance();
|
||||
if (promoteMetrics != null) {
|
||||
promoteMetrics.recordPromotionSuccess(entityType);
|
||||
}
|
||||
|
||||
// Delete old indices after successful alias swap (with backoff for snapshot scenarios)
|
||||
for (String oldIndex : oldIndicesToDelete) {
|
||||
try {
|
||||
if (searchClient.indexExists(oldIndex)) {
|
||||
|
|
@ -247,6 +327,10 @@ public class DefaultRecreateHandler implements RecreateIndexHandler {
|
|||
} catch (Exception ex) {
|
||||
LOG.error(
|
||||
"Failed to promote staged index '{}' for entity '{}'.", stagedIndex, entityType, ex);
|
||||
ReindexingMetrics promoteMetrics = ReindexingMetrics.getInstance();
|
||||
if (promoteMetrics != null) {
|
||||
promoteMetrics.recordPromotionFailure(entityType);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -143,4 +143,23 @@ public interface IndexManagementClient {
|
|||
Set<String> aliases) {}
|
||||
|
||||
List<IndexStats> getAllIndexStats() throws IOException;
|
||||
|
||||
/**
|
||||
* Get the document count for a specific index.
|
||||
*
|
||||
* @param indexName the name of the index
|
||||
* @return the number of documents in the index, or -1 if count cannot be determined
|
||||
*/
|
||||
default long getDocumentCount(String indexName) {
|
||||
try {
|
||||
for (IndexStats stats : getAllIndexStats()) {
|
||||
if (stats.name().equals(indexName)) {
|
||||
return stats.documents();
|
||||
}
|
||||
}
|
||||
} catch (Exception e) {
|
||||
return -1;
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -72,6 +72,21 @@ public class PaginatedEntityTimeSeriesSource
|
|||
this.endTs = endTs;
|
||||
}
|
||||
|
||||
public PaginatedEntityTimeSeriesSource(
|
||||
String entityType,
|
||||
int batchSize,
|
||||
List<String> fields,
|
||||
int knownTotal,
|
||||
Long startTs,
|
||||
Long endTs) {
|
||||
this.entityType = entityType;
|
||||
this.batchSize = batchSize;
|
||||
this.fields = fields;
|
||||
this.stats.withTotalRecords(knownTotal).withSuccessRecords(0).withFailedRecords(0);
|
||||
this.startTs = startTs;
|
||||
this.endTs = endTs;
|
||||
}
|
||||
|
||||
@Override
|
||||
public ResultList<? extends EntityTimeSeriesInterface> readNext(Map<String, Object> contextData)
|
||||
throws SearchIndexException {
|
||||
|
|
|
|||
|
|
@ -16,7 +16,8 @@
|
|||
"initialBackoff": 1000,
|
||||
"maxBackoff": 10000,
|
||||
"searchIndexMappingLanguage": "EN",
|
||||
"autoTune": false
|
||||
"autoTune": false,
|
||||
"timeSeriesMaxDays": 0
|
||||
},
|
||||
"appSchedule": {
|
||||
"scheduleTimeline": "Custom",
|
||||
|
|
|
|||
|
|
@ -0,0 +1,72 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertThrows;
|
||||
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@DisplayName("AdaptiveBackoff Tests")
|
||||
class AdaptiveBackoffTest {
|
||||
|
||||
@Test
|
||||
@DisplayName("returns initial delay on first call")
|
||||
void initialDelay() {
|
||||
AdaptiveBackoff backoff = new AdaptiveBackoff(100, 2000);
|
||||
assertEquals(100, backoff.nextDelay());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("doubles delay on each subsequent call")
|
||||
void exponentialDoubling() {
|
||||
AdaptiveBackoff backoff = new AdaptiveBackoff(50, 10000);
|
||||
assertEquals(50, backoff.nextDelay());
|
||||
assertEquals(100, backoff.nextDelay());
|
||||
assertEquals(200, backoff.nextDelay());
|
||||
assertEquals(400, backoff.nextDelay());
|
||||
assertEquals(800, backoff.nextDelay());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("caps at maxMs")
|
||||
void capAtMax() {
|
||||
AdaptiveBackoff backoff = new AdaptiveBackoff(100, 300);
|
||||
assertEquals(100, backoff.nextDelay());
|
||||
assertEquals(200, backoff.nextDelay());
|
||||
assertEquals(300, backoff.nextDelay());
|
||||
assertEquals(300, backoff.nextDelay());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("reset returns to initial delay")
|
||||
void resetToInitial() {
|
||||
AdaptiveBackoff backoff = new AdaptiveBackoff(50, 1000);
|
||||
backoff.nextDelay();
|
||||
backoff.nextDelay();
|
||||
backoff.nextDelay();
|
||||
|
||||
backoff.reset();
|
||||
assertEquals(50, backoff.nextDelay());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("rejects invalid initialMs")
|
||||
void rejectsInvalidInitialMs() {
|
||||
assertThrows(IllegalArgumentException.class, () -> new AdaptiveBackoff(0, 1000));
|
||||
assertThrows(IllegalArgumentException.class, () -> new AdaptiveBackoff(-1, 1000));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("rejects maxMs less than initialMs")
|
||||
void rejectsMaxLessThanInitial() {
|
||||
assertThrows(IllegalArgumentException.class, () -> new AdaptiveBackoff(200, 100));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("works when initialMs equals maxMs")
|
||||
void initialEqualsMax() {
|
||||
AdaptiveBackoff backoff = new AdaptiveBackoff(500, 500);
|
||||
assertEquals(500, backoff.nextDelay());
|
||||
assertEquals(500, backoff.nextDelay());
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,171 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertFalse;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import java.util.concurrent.CountDownLatch;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@DisplayName("BulkCircuitBreaker Tests")
|
||||
class BulkCircuitBreakerTest {
|
||||
|
||||
private BulkCircuitBreaker breaker;
|
||||
|
||||
@BeforeEach
|
||||
void setUp() {
|
||||
breaker = new BulkCircuitBreaker(3, 5000, 1000);
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("starts in CLOSED state")
|
||||
void startsInClosedState() {
|
||||
assertEquals(BulkCircuitBreaker.State.CLOSED, breaker.getState());
|
||||
assertTrue(breaker.allowRequest());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("transitions CLOSED → OPEN after threshold failures in window")
|
||||
void closedToOpenOnThreshold() {
|
||||
breaker.recordFailure();
|
||||
breaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.CLOSED, breaker.getState());
|
||||
|
||||
breaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.OPEN, breaker.getState());
|
||||
assertFalse(breaker.allowRequest());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("transitions OPEN → HALF_OPEN after probe interval")
|
||||
void openToHalfOpenAfterInterval() {
|
||||
BulkCircuitBreaker fastBreaker = new BulkCircuitBreaker(1, 5000, 50);
|
||||
|
||||
fastBreaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.OPEN, fastBreaker.getState());
|
||||
assertFalse(fastBreaker.allowRequest());
|
||||
|
||||
try {
|
||||
Thread.sleep(60);
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
|
||||
assertTrue(fastBreaker.allowRequest());
|
||||
assertEquals(BulkCircuitBreaker.State.HALF_OPEN, fastBreaker.getState());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("transitions HALF_OPEN → CLOSED on success")
|
||||
void halfOpenToClosedOnSuccess() {
|
||||
BulkCircuitBreaker fastBreaker = new BulkCircuitBreaker(1, 5000, 50);
|
||||
|
||||
fastBreaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.OPEN, fastBreaker.getState());
|
||||
|
||||
try {
|
||||
Thread.sleep(60);
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
|
||||
fastBreaker.allowRequest();
|
||||
assertEquals(BulkCircuitBreaker.State.HALF_OPEN, fastBreaker.getState());
|
||||
|
||||
fastBreaker.recordSuccess();
|
||||
assertEquals(BulkCircuitBreaker.State.CLOSED, fastBreaker.getState());
|
||||
assertTrue(fastBreaker.allowRequest());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("transitions HALF_OPEN → OPEN on failure")
|
||||
void halfOpenToOpenOnFailure() {
|
||||
BulkCircuitBreaker fastBreaker = new BulkCircuitBreaker(1, 5000, 50);
|
||||
|
||||
fastBreaker.recordFailure();
|
||||
|
||||
try {
|
||||
Thread.sleep(60);
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
|
||||
fastBreaker.allowRequest();
|
||||
assertEquals(BulkCircuitBreaker.State.HALF_OPEN, fastBreaker.getState());
|
||||
|
||||
fastBreaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.OPEN, fastBreaker.getState());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("failures outside window do not count toward threshold")
|
||||
void expiryOutsideWindow() {
|
||||
BulkCircuitBreaker shortWindow = new BulkCircuitBreaker(3, 100, 1000);
|
||||
|
||||
shortWindow.recordFailure();
|
||||
shortWindow.recordFailure();
|
||||
|
||||
try {
|
||||
Thread.sleep(150);
|
||||
} catch (InterruptedException e) {
|
||||
Thread.currentThread().interrupt();
|
||||
}
|
||||
|
||||
shortWindow.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.CLOSED, shortWindow.getState());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("reset forces CLOSED state")
|
||||
void resetForcesClosed() {
|
||||
breaker.recordFailure();
|
||||
breaker.recordFailure();
|
||||
breaker.recordFailure();
|
||||
assertEquals(BulkCircuitBreaker.State.OPEN, breaker.getState());
|
||||
|
||||
breaker.reset();
|
||||
assertEquals(BulkCircuitBreaker.State.CLOSED, breaker.getState());
|
||||
assertTrue(breaker.allowRequest());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("thread safety under concurrent access")
|
||||
void threadSafety() throws InterruptedException {
|
||||
BulkCircuitBreaker concurrentBreaker = new BulkCircuitBreaker(100, 10_000, 1000);
|
||||
int threadCount = 10;
|
||||
int failuresPerThread = 15;
|
||||
CountDownLatch latch = new CountDownLatch(threadCount);
|
||||
AtomicInteger allowedCount = new AtomicInteger(0);
|
||||
|
||||
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
|
||||
for (int i = 0; i < threadCount; i++) {
|
||||
executor.submit(
|
||||
() -> {
|
||||
try {
|
||||
for (int j = 0; j < failuresPerThread; j++) {
|
||||
concurrentBreaker.recordFailure();
|
||||
if (concurrentBreaker.allowRequest()) {
|
||||
allowedCount.incrementAndGet();
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
latch.countDown();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
latch.await(5, TimeUnit.SECONDS);
|
||||
executor.shutdown();
|
||||
|
||||
BulkCircuitBreaker.State finalState = concurrentBreaker.getState();
|
||||
assertTrue(
|
||||
finalState == BulkCircuitBreaker.State.OPEN
|
||||
|| finalState == BulkCircuitBreaker.State.CLOSED);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@DisplayName("EntityBatchSizeEstimator Tests")
|
||||
class EntityBatchSizeEstimatorTest {
|
||||
|
||||
@Test
|
||||
@DisplayName("LARGE entities get smaller batch size")
|
||||
void largeEntitiesGetSmallerBatch() {
|
||||
int base = 200;
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("table", base));
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("topic", base));
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("dashboard", base));
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("mlmodel", base));
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("container", base));
|
||||
assertEquals(100, EntityBatchSizeEstimator.estimateBatchSize("storedProcedure", base));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("LARGE entities respect minimum batch size of 25")
|
||||
void largeEntitiesRespectMinimum() {
|
||||
assertEquals(25, EntityBatchSizeEstimator.estimateBatchSize("table", 40));
|
||||
assertEquals(25, EntityBatchSizeEstimator.estimateBatchSize("table", 10));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("SMALL entities get larger batch size")
|
||||
void smallEntitiesGetLargerBatch() {
|
||||
int base = 200;
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("user", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("team", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("bot", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("role", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("policy", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("tag", base));
|
||||
assertEquals(400, EntityBatchSizeEstimator.estimateBatchSize("classification", base));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("SMALL entities respect maximum batch size of 1000")
|
||||
void smallEntitiesRespectMaximum() {
|
||||
assertEquals(1000, EntityBatchSizeEstimator.estimateBatchSize("user", 600));
|
||||
assertEquals(1000, EntityBatchSizeEstimator.estimateBatchSize("user", 800));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("MEDIUM (unknown) entities get base batch size unchanged")
|
||||
void mediumEntitiesUnchanged() {
|
||||
int base = 200;
|
||||
assertEquals(base, EntityBatchSizeEstimator.estimateBatchSize("pipeline", base));
|
||||
assertEquals(base, EntityBatchSizeEstimator.estimateBatchSize("database", base));
|
||||
assertEquals(base, EntityBatchSizeEstimator.estimateBatchSize("glossaryTerm", base));
|
||||
assertEquals(base, EntityBatchSizeEstimator.estimateBatchSize("unknownEntity", base));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("handles zero and negative base batch size gracefully")
|
||||
void handlesZeroAndNegative() {
|
||||
assertEquals(0, EntityBatchSizeEstimator.estimateBatchSize("table", 0));
|
||||
assertTrue(EntityBatchSizeEstimator.estimateBatchSize("table", -1) < 0);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,151 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import java.util.LinkedHashSet;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@DisplayName("EntityPriority Tests")
|
||||
class EntityPriorityTest {
|
||||
|
||||
@Test
|
||||
@DisplayName("Services sort before users/teams")
|
||||
void servicesSortBeforeUsers() {
|
||||
Set<String> entities = Set.of("user", "databaseService", "team");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
assertTrue(sorted.indexOf("databaseService") < sorted.indexOf("user"));
|
||||
assertTrue(sorted.indexOf("databaseService") < sorted.indexOf("team"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Users/teams sort before data assets")
|
||||
void usersSortBeforeDataAssets() {
|
||||
Set<String> entities = Set.of("table", "user", "dashboard");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
assertTrue(sorted.indexOf("user") < sorted.indexOf("table"));
|
||||
assertTrue(sorted.indexOf("user") < sorted.indexOf("dashboard"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Data assets sort before governance entities")
|
||||
void dataAssetsSortBeforeGovernance() {
|
||||
Set<String> entities = Set.of("glossaryTerm", "table", "testCase");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
assertTrue(sorted.indexOf("table") < sorted.indexOf("glossaryTerm"));
|
||||
assertTrue(sorted.indexOf("table") < sorted.indexOf("testCase"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Time series entities sort last")
|
||||
void timeSeriesEntitiesSortLast() {
|
||||
Set<String> entities = Set.of("testCaseResult", "user", "table", "databaseService");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
assertEquals("testCaseResult", sorted.get(sorted.size() - 1));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Unknown entity types default to LOW tier")
|
||||
void unknownEntitiesDefaultToLow() {
|
||||
assertEquals(EntityPriority.Tier.LOW, EntityPriority.getTier("someUnknownEntity"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Empty set returns empty list")
|
||||
void emptySetReturnsEmptyList() {
|
||||
List<String> sorted = EntityPriority.sortByPriority(Set.of());
|
||||
assertTrue(sorted.isEmpty());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Single entity returns single-element list")
|
||||
void singleEntityReturnsSingleElement() {
|
||||
List<String> sorted = EntityPriority.sortByPriority(Set.of("table"));
|
||||
assertEquals(1, sorted.size());
|
||||
assertEquals("table", sorted.get(0));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Full priority ordering: CRITICAL > HIGH > MEDIUM > LOW > LOWEST")
|
||||
void fullPriorityOrdering() {
|
||||
Set<String> entities =
|
||||
Set.of("testCaseResult", "glossaryTerm", "table", "user", "databaseService");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
|
||||
int serviceIdx = sorted.indexOf("databaseService");
|
||||
int userIdx = sorted.indexOf("user");
|
||||
int tableIdx = sorted.indexOf("table");
|
||||
int glossaryIdx = sorted.indexOf("glossaryTerm");
|
||||
int tsIdx = sorted.indexOf("testCaseResult");
|
||||
|
||||
assertTrue(serviceIdx < userIdx, "CRITICAL < HIGH");
|
||||
assertTrue(userIdx < tableIdx, "HIGH < MEDIUM");
|
||||
assertTrue(tableIdx < glossaryIdx, "MEDIUM < LOW");
|
||||
assertTrue(glossaryIdx < tsIdx, "LOW < LOWEST");
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Entities within same tier preserve relative order from input")
|
||||
void sameTierPreservesOrder() {
|
||||
LinkedHashSet<String> entities = new LinkedHashSet<>();
|
||||
entities.add("dashboard");
|
||||
entities.add("table");
|
||||
entities.add("pipeline");
|
||||
List<String> sorted = EntityPriority.sortByPriority(entities);
|
||||
assertEquals(List.of("dashboard", "table", "pipeline"), sorted);
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("All service entities are CRITICAL tier")
|
||||
void allServicesAreCritical() {
|
||||
for (String svc :
|
||||
List.of(
|
||||
"databaseService",
|
||||
"messagingService",
|
||||
"dashboardService",
|
||||
"pipelineService",
|
||||
"mlmodelService",
|
||||
"storageService",
|
||||
"searchService",
|
||||
"apiService",
|
||||
"metadataService")) {
|
||||
assertEquals(EntityPriority.Tier.CRITICAL, EntityPriority.getTier(svc), svc);
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("All time series entities are LOWEST tier")
|
||||
void allTimeSeriesAreLowest() {
|
||||
for (String ts :
|
||||
List.of(
|
||||
"entityReportData",
|
||||
"rawCostAnalysisReportData",
|
||||
"webAnalyticUserActivityReportData",
|
||||
"webAnalyticEntityViewReportData",
|
||||
"aggregatedCostAnalysisReportData",
|
||||
"testCaseResolutionStatus",
|
||||
"testCaseResult",
|
||||
"queryCostRecord")) {
|
||||
assertEquals(EntityPriority.Tier.LOWEST, EntityPriority.getTier(ts), ts);
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Numeric priority maps correctly from tiers")
|
||||
void numericPriorityMapsFromTiers() {
|
||||
assertEquals(100, EntityPriority.getNumericPriority("databaseService"));
|
||||
assertEquals(80, EntityPriority.getNumericPriority("user"));
|
||||
assertEquals(60, EntityPriority.getNumericPriority("table"));
|
||||
assertEquals(40, EntityPriority.getNumericPriority("glossaryTerm"));
|
||||
assertEquals(20, EntityPriority.getNumericPriority("testCaseResult"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Unknown entities get LOW numeric priority")
|
||||
void unknownEntitiesGetLowNumericPriority() {
|
||||
assertEquals(40, EntityPriority.getNumericPriority("someUnknownEntity"));
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,108 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertFalse;
|
||||
import static org.junit.jupiter.api.Assertions.assertNotNull;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.openmetadata.schema.system.IndexingError;
|
||||
import org.openmetadata.service.exception.SearchIndexException;
|
||||
|
||||
@DisplayName("EntityReader Retry Tests")
|
||||
class EntityReaderRetryTest {
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError detects timeout errors")
|
||||
void detectsTimeoutErrors() {
|
||||
SearchIndexException e =
|
||||
new SearchIndexException(
|
||||
new IndexingError().withMessage("Connection timeout while reading entities"));
|
||||
assertTrue(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError detects connection errors")
|
||||
void detectsConnectionErrors() {
|
||||
SearchIndexException e =
|
||||
new SearchIndexException(
|
||||
new IndexingError().withMessage("java.net.ConnectException: Connection refused"));
|
||||
assertTrue(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError detects pool exhaustion")
|
||||
void detectsPoolExhaustion() {
|
||||
SearchIndexException e =
|
||||
new SearchIndexException(
|
||||
new IndexingError().withMessage("Pool exhausted - no connections available"));
|
||||
assertTrue(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError detects socket timeout")
|
||||
void detectsSocketTimeout() {
|
||||
SearchIndexException e =
|
||||
new SearchIndexException(
|
||||
new IndexingError().withMessage("java.net.SocketTimeoutException: Read timed out"));
|
||||
assertTrue(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError returns false for non-transient errors")
|
||||
void rejectsNonTransientErrors() {
|
||||
SearchIndexException e =
|
||||
new SearchIndexException(new IndexingError().withMessage("Entity not found: table.xyz"));
|
||||
assertFalse(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("isTransientError returns false for null message")
|
||||
void handleNullMessage() {
|
||||
SearchIndexException e = new SearchIndexException(new IndexingError());
|
||||
assertFalse(EntityReader.isTransientError(e));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("EntityReader constructor accepts custom retry configuration")
|
||||
void customRetryConfiguration() {
|
||||
java.util.concurrent.ExecutorService executor =
|
||||
java.util.concurrent.Executors.newSingleThreadExecutor();
|
||||
java.util.concurrent.atomic.AtomicBoolean stopped =
|
||||
new java.util.concurrent.atomic.AtomicBoolean(false);
|
||||
EntityReader reader = new EntityReader(executor, stopped, 5, 1000);
|
||||
assertNotNull(reader);
|
||||
executor.shutdown();
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("EntityReader default constructor uses default retry values")
|
||||
void defaultRetryConfiguration() {
|
||||
java.util.concurrent.ExecutorService executor =
|
||||
java.util.concurrent.Executors.newSingleThreadExecutor();
|
||||
java.util.concurrent.atomic.AtomicBoolean stopped =
|
||||
new java.util.concurrent.atomic.AtomicBoolean(false);
|
||||
EntityReader reader = new EntityReader(executor, stopped);
|
||||
assertNotNull(reader);
|
||||
executor.shutdown();
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("VectorCompletionResult.success creates completed result")
|
||||
void vectorCompletionSuccess() {
|
||||
VectorCompletionResult result = VectorCompletionResult.success(150);
|
||||
assertTrue(result.completed());
|
||||
assertEquals(0, result.pendingTaskCount());
|
||||
assertEquals(150, result.waitedMillis());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("VectorCompletionResult.timeout creates timeout result")
|
||||
void vectorCompletionTimeout() {
|
||||
VectorCompletionResult result = VectorCompletionResult.timeout(5, 30000);
|
||||
assertFalse(result.completed());
|
||||
assertEquals(5, result.pendingTaskCount());
|
||||
assertEquals(30000, result.waitedMillis());
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,785 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertFalse;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
import static org.mockito.ArgumentMatchers.any;
|
||||
import static org.mockito.ArgumentMatchers.anyInt;
|
||||
import static org.mockito.ArgumentMatchers.anyList;
|
||||
import static org.mockito.ArgumentMatchers.anyLong;
|
||||
import static org.mockito.ArgumentMatchers.anyString;
|
||||
import static org.mockito.ArgumentMatchers.eq;
|
||||
import static org.mockito.Mockito.doAnswer;
|
||||
import static org.mockito.Mockito.doThrow;
|
||||
import static org.mockito.Mockito.lenient;
|
||||
import static org.mockito.Mockito.mock;
|
||||
import static org.mockito.Mockito.mockStatic;
|
||||
import static org.mockito.Mockito.never;
|
||||
import static org.mockito.Mockito.verify;
|
||||
import static org.mockito.Mockito.when;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collections;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import java.util.UUID;
|
||||
import java.util.concurrent.atomic.AtomicInteger;
|
||||
import lombok.extern.slf4j.Slf4j;
|
||||
import org.junit.jupiter.api.AfterEach;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Nested;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.junit.jupiter.api.extension.ExtendWith;
|
||||
import org.mockito.Mock;
|
||||
import org.mockito.MockedStatic;
|
||||
import org.mockito.junit.jupiter.MockitoExtension;
|
||||
import org.mockito.junit.jupiter.MockitoSettings;
|
||||
import org.mockito.quality.Strictness;
|
||||
import org.openmetadata.schema.EntityInterface;
|
||||
import org.openmetadata.schema.system.EntityError;
|
||||
import org.openmetadata.schema.type.Paging;
|
||||
import org.openmetadata.schema.utils.ResultList;
|
||||
import org.openmetadata.search.IndexMapping;
|
||||
import org.openmetadata.service.Entity;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.DistributedSearchIndexCoordinator;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.EntityCompletionTracker;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.PartitionStatus;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.PartitionWorker;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.PartitionWorker.PartitionResult;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.SearchIndexPartition;
|
||||
import org.openmetadata.service.apps.bundles.searchIndex.distributed.ServerIdentityResolver;
|
||||
import org.openmetadata.service.jdbi3.CollectionDAO;
|
||||
import org.openmetadata.service.jdbi3.EntityRepository;
|
||||
import org.openmetadata.service.jdbi3.ListFilter;
|
||||
import org.openmetadata.service.search.DefaultRecreateHandler;
|
||||
import org.openmetadata.service.search.EntityReindexContext;
|
||||
import org.openmetadata.service.search.SearchClient;
|
||||
import org.openmetadata.service.search.SearchRepository;
|
||||
import org.openmetadata.service.util.EntityUtil.Fields;
|
||||
|
||||
@ExtendWith(MockitoExtension.class)
|
||||
@MockitoSettings(strictness = Strictness.LENIENT)
|
||||
@Slf4j
|
||||
@DisplayName("Reindex Error Scenario Integration Tests")
|
||||
class ReindexErrorScenarioIntegrationTest {
|
||||
|
||||
@Mock private DistributedSearchIndexCoordinator coordinator;
|
||||
@Mock private BulkSink bulkSink;
|
||||
@Mock private CollectionDAO collectionDAO;
|
||||
@Mock private CollectionDAO.SearchIndexServerStatsDAO statsDAO;
|
||||
@Mock private CollectionDAO.SearchIndexFailureDAO failureDAO;
|
||||
@Mock private EntityRepository<?> mockRepository;
|
||||
|
||||
private MockedStatic<Entity> entityMock;
|
||||
private MockedStatic<ServerIdentityResolver> serverIdMock;
|
||||
private List<CollectionDAO.SearchIndexFailureDAO.SearchIndexFailureRecord> capturedFailures;
|
||||
private IndexingFailureRecorder failureRecorder;
|
||||
private PartitionWorker worker;
|
||||
private UUID jobId;
|
||||
|
||||
@BeforeEach
|
||||
void setUp() {
|
||||
jobId = UUID.randomUUID();
|
||||
capturedFailures = new ArrayList<>();
|
||||
|
||||
entityMock = mockStatic(Entity.class);
|
||||
serverIdMock = mockStatic(ServerIdentityResolver.class);
|
||||
|
||||
entityMock.when(() -> Entity.getEntityRepository("table")).thenReturn(mockRepository);
|
||||
entityMock
|
||||
.when(() -> Entity.getFields(eq("table"), anyList()))
|
||||
.thenReturn(new Fields(Collections.emptySet()));
|
||||
|
||||
ServerIdentityResolver mockResolver = mock(ServerIdentityResolver.class);
|
||||
when(mockResolver.getServerId()).thenReturn("test-server");
|
||||
serverIdMock.when(ServerIdentityResolver::getInstance).thenReturn(mockResolver);
|
||||
|
||||
when(coordinator.getCollectionDAO()).thenReturn(collectionDAO);
|
||||
when(collectionDAO.searchIndexServerStatsDAO()).thenReturn(statsDAO);
|
||||
when(collectionDAO.searchIndexFailureDAO()).thenReturn(failureDAO);
|
||||
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
@SuppressWarnings("unchecked")
|
||||
List<CollectionDAO.SearchIndexFailureDAO.SearchIndexFailureRecord> records =
|
||||
invocation.getArgument(0);
|
||||
capturedFailures.addAll(records);
|
||||
return null;
|
||||
})
|
||||
.when(failureDAO)
|
||||
.insertBatch(anyList());
|
||||
|
||||
lenient()
|
||||
.when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt()))
|
||||
.thenReturn("encoded-cursor");
|
||||
|
||||
lenient().when(bulkSink.flushAndAwait(anyInt())).thenReturn(true);
|
||||
lenient().when(bulkSink.getPendingVectorTaskCount()).thenReturn(0);
|
||||
|
||||
failureRecorder =
|
||||
new IndexingFailureRecorder(collectionDAO, jobId.toString(), "test-server", 100);
|
||||
|
||||
worker = new PartitionWorker(coordinator, bulkSink, 100, null, false, failureRecorder, null);
|
||||
}
|
||||
|
||||
@AfterEach
|
||||
void tearDown() {
|
||||
if (failureRecorder != null) {
|
||||
failureRecorder.close();
|
||||
}
|
||||
if (entityMock != null) {
|
||||
entityMock.close();
|
||||
}
|
||||
if (serverIdMock != null) {
|
||||
serverIdMock.close();
|
||||
}
|
||||
}
|
||||
|
||||
private SearchIndexPartition createPartition(long rangeStart, long rangeEnd) {
|
||||
return SearchIndexPartition.builder()
|
||||
.id(UUID.randomUUID())
|
||||
.jobId(jobId)
|
||||
.entityType("table")
|
||||
.partitionIndex(0)
|
||||
.rangeStart(rangeStart)
|
||||
.rangeEnd(rangeEnd)
|
||||
.estimatedCount(rangeEnd - rangeStart)
|
||||
.workUnits(rangeEnd - rangeStart)
|
||||
.priority(0)
|
||||
.status(PartitionStatus.PENDING)
|
||||
.cursor(0L)
|
||||
.processedCount(0L)
|
||||
.successCount(0L)
|
||||
.failedCount(0L)
|
||||
.claimableAt(System.currentTimeMillis())
|
||||
.build();
|
||||
}
|
||||
|
||||
private ResultList<EntityInterface> createResultList(int count, String nextCursor) {
|
||||
List<EntityInterface> entities = new ArrayList<>();
|
||||
for (int i = 0; i < count; i++) {
|
||||
entities.add(mock(EntityInterface.class));
|
||||
}
|
||||
Paging paging = new Paging();
|
||||
paging.setAfter(nextCursor);
|
||||
paging.setTotal(count);
|
||||
ResultList<EntityInterface> result = new ResultList<>(entities);
|
||||
result.setPaging(paging);
|
||||
return result;
|
||||
}
|
||||
|
||||
private ResultList<EntityInterface> createResultListWithErrors(
|
||||
int successCount, int errorCount, String nextCursor) {
|
||||
List<EntityInterface> entities = new ArrayList<>();
|
||||
for (int i = 0; i < successCount; i++) {
|
||||
entities.add(mock(EntityInterface.class));
|
||||
}
|
||||
List<EntityError> errors = new ArrayList<>();
|
||||
for (int i = 0; i < errorCount; i++) {
|
||||
errors.add(new EntityError().withMessage("Error reading entity " + i).withEntity("eid-" + i));
|
||||
}
|
||||
Paging paging = new Paging();
|
||||
paging.setAfter(nextCursor);
|
||||
paging.setTotal(successCount + errorCount);
|
||||
ResultList<EntityInterface> result = new ResultList<>(entities);
|
||||
result.setPaging(paging);
|
||||
result.setErrors(errors);
|
||||
return result;
|
||||
}
|
||||
|
||||
private void stubListAfterKeysetThrowFirst(Throwable t) {
|
||||
doThrow(t)
|
||||
.when(mockRepository)
|
||||
.listAfterKeyset(
|
||||
any(ListFilter.class), anyInt(), any(), anyInt(), eq(true), any(Fields.class));
|
||||
}
|
||||
|
||||
private void stubListAfterKeysetViaAnswer(List<Object> responses) {
|
||||
AtomicInteger callIndex = new AtomicInteger(0);
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
int idx = callIndex.getAndIncrement();
|
||||
if (idx < responses.size()) {
|
||||
Object resp = responses.get(idx);
|
||||
if (resp instanceof Throwable t) {
|
||||
throw t;
|
||||
}
|
||||
return resp;
|
||||
}
|
||||
return createResultList(0, null);
|
||||
})
|
||||
.when(mockRepository)
|
||||
.listAfterKeyset(
|
||||
any(ListFilter.class), anyInt(), any(), anyInt(), eq(true), any(Fields.class));
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("1. Reader Failure Tests")
|
||||
class ReaderFailureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader throws mid-partition — batches 1,3 OK, batch 2 throws")
|
||||
void testReaderThrowsMidPartition() {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(
|
||||
List.of(batch1, new RuntimeException("DB connection lost"), batch3));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(200, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
assertEquals(100, result.readerFailed());
|
||||
assertFalse(result.wasStopped());
|
||||
verify(coordinator).completePartition(eq(partition.getId()), eq(200L), eq(100L));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader returns empty — data exhausted early")
|
||||
void testReaderReturnsEmpty() {
|
||||
SearchIndexPartition partition = createPartition(0, 200);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(100, result.successCount());
|
||||
assertEquals(0, result.failedCount());
|
||||
assertFalse(result.wasStopped());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader throws on first batch")
|
||||
void testReaderThrowsOnFirstBatch() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
stubListAfterKeysetThrowFirst(new RuntimeException("Table not found"));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(0, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
assertEquals(100, result.readerFailed());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader returns ResultList with EntityErrors")
|
||||
void testReaderReturnsEntityErrors() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
ResultList<EntityInterface> batchWithErrors = createResultListWithErrors(95, 5, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batchWithErrors));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(95, result.successCount());
|
||||
assertEquals(5, result.failedCount());
|
||||
|
||||
failureRecorder.flush();
|
||||
long readerFailureCount =
|
||||
capturedFailures.stream().filter(r -> "READER".equals(r.getFailureStage())).count();
|
||||
assertEquals(5, readerFailureCount);
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader throws on last batch only")
|
||||
void testReaderThrowsOnLastBatch() {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, "cursor-2");
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, batch2, new RuntimeException("Timeout")));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(200, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("2. Sink Failure Tests")
|
||||
class SinkFailureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Sink throws on write")
|
||||
void testSinkThrowsOnWrite() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1));
|
||||
|
||||
doThrow(new RuntimeException("Connection reset"))
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(0, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
verify(coordinator).completePartition(eq(partition.getId()), eq(0L), eq(100L));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Sink fails second batch only")
|
||||
void testSinkFailsSecondBatchOnly() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, "cursor-2");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, batch2, batch3));
|
||||
|
||||
AtomicInteger writeCallCount = new AtomicInteger(0);
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
int call = writeCallCount.incrementAndGet();
|
||||
if (call == 2) {
|
||||
throw new RuntimeException("Connection reset");
|
||||
}
|
||||
return null;
|
||||
})
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(200, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Sink fails all batches")
|
||||
void testSinkFailsAllBatches() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, "cursor-2");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, batch2, batch3));
|
||||
|
||||
doThrow(new RuntimeException("Connection reset"))
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(0, result.successCount());
|
||||
assertEquals(300, result.failedCount());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("3. Process Failure Tests")
|
||||
class ProcessFailureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Sink write throws RuntimeException — treated as SINK error")
|
||||
void testDocBuildFailureTreatedAsSink() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1));
|
||||
|
||||
doThrow(new RuntimeException("Failed to serialize"))
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(0, result.successCount());
|
||||
assertEquals(100, result.failedCount());
|
||||
assertEquals(0, result.readerFailed());
|
||||
|
||||
failureRecorder.flush();
|
||||
long sinkFailures =
|
||||
capturedFailures.stream().filter(r -> "SINK".equals(r.getFailureStage())).count();
|
||||
assertTrue(sinkFailures > 0);
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Fatal exception in updatePartitionProgress — failPartition called")
|
||||
void testFatalExceptionCallsFailPartition() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
doThrow(new NullPointerException("Unexpected null"))
|
||||
.when(coordinator)
|
||||
.updatePartitionProgress(any(SearchIndexPartition.class));
|
||||
|
||||
worker.processPartition(partition);
|
||||
|
||||
verify(coordinator).failPartition(eq(partition.getId()), anyString());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("4. Vector Embedding Failure Tests")
|
||||
class VectorEmbeddingFailureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Vector timeout — partition completes normally with warning")
|
||||
void testVectorTimeout() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
when(bulkSink.getPendingVectorTaskCount()).thenReturn(5);
|
||||
when(bulkSink.awaitVectorCompletion(120)).thenReturn(false);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(100, result.successCount());
|
||||
assertFalse(result.wasStopped());
|
||||
verify(coordinator).completePartition(eq(partition.getId()), eq(100L), eq(0L));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Vector tasks complete normally")
|
||||
void testVectorTasksCompleteNormally() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
when(bulkSink.getPendingVectorTaskCount()).thenReturn(3).thenReturn(0);
|
||||
when(bulkSink.awaitVectorCompletion(120)).thenReturn(true);
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(100, result.successCount());
|
||||
verify(coordinator).completePartition(eq(partition.getId()), eq(100L), eq(0L));
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("5. Promotion Failure Tests")
|
||||
class PromotionFailureTests {
|
||||
|
||||
private SearchClient setupPromotionMocks(SearchRepository searchRepo) {
|
||||
SearchClient searchClient = mock(SearchClient.class);
|
||||
when(searchRepo.getSearchClient()).thenReturn(searchClient);
|
||||
when(searchRepo.getClusterAlias()).thenReturn("");
|
||||
|
||||
IndexMapping indexMapping =
|
||||
IndexMapping.builder()
|
||||
.indexName("table_search_index")
|
||||
.alias("table")
|
||||
.parentAliases(List.of("all"))
|
||||
.childAliases(List.of())
|
||||
.build();
|
||||
when(searchRepo.getIndexMapping("table")).thenReturn(indexMapping);
|
||||
|
||||
entityMock.when(Entity::getSearchRepository).thenReturn(searchRepo);
|
||||
|
||||
return searchClient;
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("swapAliases returns false — old indices NOT deleted")
|
||||
void testSwapAliasesReturnsFalse() {
|
||||
SearchRepository searchRepo = mock(SearchRepository.class);
|
||||
SearchClient searchClient = setupPromotionMocks(searchRepo);
|
||||
|
||||
when(searchClient.listIndicesByPrefix("table_search_index"))
|
||||
.thenReturn(Set.of("table_search_index_rebuild_old", "table_search_index_rebuild_new"));
|
||||
when(searchClient.indexExists(anyString())).thenReturn(false);
|
||||
when(searchClient.swapAliases(any(), anyString(), any())).thenReturn(false);
|
||||
|
||||
EntityReindexContext context =
|
||||
EntityReindexContext.builder()
|
||||
.entityType("table")
|
||||
.canonicalIndex("table_search_index")
|
||||
.stagedIndex("table_search_index_rebuild_new")
|
||||
.build();
|
||||
|
||||
new DefaultRecreateHandler().promoteEntityIndex(context, true);
|
||||
|
||||
verify(searchClient, never()).deleteIndexWithBackoff(eq("table_search_index_rebuild_old"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getDocumentCount returns -1 — should promote to avoid data loss")
|
||||
void testDocCountUnknownPromotes() {
|
||||
SearchRepository searchRepo = mock(SearchRepository.class);
|
||||
SearchClient searchClient = setupPromotionMocks(searchRepo);
|
||||
|
||||
when(searchClient.getDocumentCount("table_search_index_rebuild_new")).thenReturn(-1L);
|
||||
when(searchClient.listIndicesByPrefix("table_search_index")).thenReturn(Set.of());
|
||||
when(searchClient.indexExists(anyString())).thenReturn(false);
|
||||
when(searchClient.swapAliases(any(), anyString(), any())).thenReturn(true);
|
||||
|
||||
EntityReindexContext context =
|
||||
EntityReindexContext.builder()
|
||||
.entityType("table")
|
||||
.canonicalIndex("table_search_index")
|
||||
.stagedIndex("table_search_index_rebuild_new")
|
||||
.build();
|
||||
|
||||
new DefaultRecreateHandler().promoteEntityIndex(context, false);
|
||||
|
||||
verify(searchClient).swapAliases(any(), eq("table_search_index_rebuild_new"), any());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Promotion callback throws — entity still in promotedEntities")
|
||||
void testPromotionCallbackThrowsEntityStillPromoted() {
|
||||
EntityCompletionTracker tracker = new EntityCompletionTracker(jobId);
|
||||
tracker.initializeEntity("table", 1);
|
||||
tracker.setOnEntityComplete(
|
||||
(entityType, success) -> {
|
||||
throw new RuntimeException("Promotion failed");
|
||||
});
|
||||
|
||||
tracker.recordPartitionComplete("table", false);
|
||||
|
||||
assertTrue(tracker.isPromoted("table"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Zero-doc entity, reindex failed — should NOT promote")
|
||||
void testZeroDocReindexFailedNoPromotion() {
|
||||
SearchRepository searchRepo = mock(SearchRepository.class);
|
||||
SearchClient searchClient = setupPromotionMocks(searchRepo);
|
||||
|
||||
when(searchClient.getDocumentCount("table_search_index_rebuild_new")).thenReturn(0L);
|
||||
when(searchClient.indexExists("table_search_index_rebuild_new")).thenReturn(true);
|
||||
|
||||
EntityReindexContext context =
|
||||
EntityReindexContext.builder()
|
||||
.entityType("table")
|
||||
.canonicalIndex("table_search_index")
|
||||
.stagedIndex("table_search_index_rebuild_new")
|
||||
.build();
|
||||
|
||||
new DefaultRecreateHandler().promoteEntityIndex(context, false);
|
||||
|
||||
verify(searchClient, never()).swapAliases(any(), anyString(), any());
|
||||
verify(searchClient).deleteIndexWithBackoff("table_search_index_rebuild_new");
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Zero-doc entity, reindex succeeded — should promote")
|
||||
void testZeroDocReindexSuccessPromotes() {
|
||||
SearchRepository searchRepo = mock(SearchRepository.class);
|
||||
SearchClient searchClient = setupPromotionMocks(searchRepo);
|
||||
|
||||
when(searchClient.listIndicesByPrefix("table_search_index")).thenReturn(Set.of());
|
||||
when(searchClient.indexExists(anyString())).thenReturn(false);
|
||||
when(searchClient.swapAliases(any(), anyString(), any())).thenReturn(true);
|
||||
|
||||
EntityReindexContext context =
|
||||
EntityReindexContext.builder()
|
||||
.entityType("table")
|
||||
.canonicalIndex("table_search_index")
|
||||
.stagedIndex("table_search_index_rebuild_new")
|
||||
.build();
|
||||
|
||||
new DefaultRecreateHandler().promoteEntityIndex(context, true);
|
||||
|
||||
verify(searchClient).swapAliases(any(), eq("table_search_index_rebuild_new"), any());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("6. Mixed Failure Tests")
|
||||
class MixedFailureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Reader + sink failures in same partition")
|
||||
void testReaderAndSinkFailuresSamePartition() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, "cursor-2");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(new RuntimeException("DB error"), batch2, batch3));
|
||||
|
||||
AtomicInteger writeCallCount = new AtomicInteger(0);
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
int call = writeCallCount.incrementAndGet();
|
||||
if (call == 2) {
|
||||
throw new RuntimeException("Sink error");
|
||||
}
|
||||
return null;
|
||||
})
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(100, result.successCount());
|
||||
assertEquals(200, result.failedCount());
|
||||
assertEquals(100, result.readerFailed());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Multiple processPartition calls have independent stats")
|
||||
void testMultipleCallsIndependentStats() {
|
||||
ResultList<EntityInterface> batch = createResultList(100, null);
|
||||
stubListAfterKeysetViaAnswer(List.of(batch));
|
||||
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
SearchIndexPartition partition1 = createPartition(0, 100);
|
||||
PartitionResult result1 = worker.processPartition(partition1);
|
||||
assertEquals(100, result1.successCount());
|
||||
assertEquals(0, result1.failedCount());
|
||||
|
||||
stubListAfterKeysetThrowFirst(new RuntimeException("Failure"));
|
||||
when(mockRepository.getCursorAtOffset(any(ListFilter.class), anyInt())).thenReturn(null);
|
||||
|
||||
SearchIndexPartition partition2 = createPartition(0, 100);
|
||||
PartitionResult result2 = worker.processPartition(partition2);
|
||||
assertEquals(0, result2.successCount());
|
||||
assertEquals(100, result2.failedCount());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("7. Stats Accuracy Tests")
|
||||
class StatsAccuracyTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Stats consistent after reader failure: success + failed == total")
|
||||
void testStatsConsistentAfterReaderFailure() {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, new RuntimeException("DB error"), batch3));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(300, result.successCount() + result.failedCount());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Stats consistent after sink failure: success + failed == total")
|
||||
void testStatsConsistentAfterSinkFailure() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 200);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, batch2));
|
||||
|
||||
AtomicInteger writeCallCount = new AtomicInteger(0);
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
int call = writeCallCount.incrementAndGet();
|
||||
if (call == 1) {
|
||||
throw new RuntimeException("Sink error");
|
||||
}
|
||||
return null;
|
||||
})
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(200, result.successCount() + result.failedCount());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Stats consistent after mixed failures: success + failed == total")
|
||||
void testStatsConsistentAfterMixedFailures() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 500);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch3 = createResultList(100, "cursor-3");
|
||||
ResultList<EntityInterface> batch4 = createResultList(100, "cursor-4");
|
||||
ResultList<EntityInterface> batch5 = createResultList(100, null);
|
||||
|
||||
stubListAfterKeysetViaAnswer(
|
||||
List.of(batch1, new RuntimeException("Reader error"), batch3, batch4, batch5));
|
||||
|
||||
AtomicInteger writeCallCount = new AtomicInteger(0);
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
int call = writeCallCount.incrementAndGet();
|
||||
if (call == 3) {
|
||||
throw new RuntimeException("Sink error");
|
||||
}
|
||||
return null;
|
||||
})
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertEquals(500, result.successCount() + result.failedCount());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("8. Partition Lifecycle Tests")
|
||||
class PartitionLifecycleTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Worker stopped mid-partition — wasStopped=true, completePartition NOT called")
|
||||
void testWorkerStoppedMidPartition() throws Exception {
|
||||
SearchIndexPartition partition = createPartition(0, 300);
|
||||
|
||||
ResultList<EntityInterface> batch1 = createResultList(100, "cursor-1");
|
||||
ResultList<EntityInterface> batch2 = createResultList(100, "cursor-2");
|
||||
|
||||
stubListAfterKeysetViaAnswer(List.of(batch1, batch2));
|
||||
|
||||
doAnswer(
|
||||
invocation -> {
|
||||
worker.stop();
|
||||
return null;
|
||||
})
|
||||
.when(bulkSink)
|
||||
.write(anyList(), any(Map.class));
|
||||
|
||||
PartitionResult result = worker.processPartition(partition);
|
||||
|
||||
assertTrue(result.wasStopped());
|
||||
verify(coordinator, never()).completePartition(any(UUID.class), anyLong(), anyLong());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Coordinator fails partition on fatal error")
|
||||
void testCoordinatorFailsPartitionOnFatalError() {
|
||||
SearchIndexPartition partition = createPartition(0, 100);
|
||||
|
||||
doThrow(new NullPointerException("Unexpected"))
|
||||
.when(coordinator)
|
||||
.updatePartitionProgress(any(SearchIndexPartition.class));
|
||||
|
||||
worker.processPartition(partition);
|
||||
|
||||
verify(coordinator).failPartition(eq(partition.getId()), anyString());
|
||||
verify(coordinator, never()).completePartition(any(UUID.class), anyLong(), anyLong());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,137 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
|
||||
@DisplayName("ReindexingConfiguration Time Series Tests")
|
||||
class ReindexingConfigurationTimeSeriesTest {
|
||||
|
||||
@Test
|
||||
@DisplayName("getTimeSeriesStartTs returns correct timestamp for default days")
|
||||
void defaultDaysReturnsCorrectTimestamp() {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("testCaseResult"))
|
||||
.timeSeriesMaxDays(15)
|
||||
.build();
|
||||
|
||||
long startTs = config.getTimeSeriesStartTs("testCaseResult");
|
||||
long expectedApprox = System.currentTimeMillis() - (15 * 86_400_000L);
|
||||
assertTrue(
|
||||
Math.abs(startTs - expectedApprox) < 1000,
|
||||
"Start timestamp should be approximately 15 days ago");
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getTimeSeriesStartTs uses entity-specific override when configured")
|
||||
void entitySpecificOverride() {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("testCaseResult", "entityReportData"))
|
||||
.timeSeriesMaxDays(15)
|
||||
.timeSeriesEntityDays(Map.of("testCaseResult", 30))
|
||||
.build();
|
||||
|
||||
long testCaseStartTs = config.getTimeSeriesStartTs("testCaseResult");
|
||||
long reportDataStartTs = config.getTimeSeriesStartTs("entityReportData");
|
||||
|
||||
long expected30Days = System.currentTimeMillis() - (30 * 86_400_000L);
|
||||
long expected15Days = System.currentTimeMillis() - (15 * 86_400_000L);
|
||||
|
||||
assertTrue(
|
||||
Math.abs(testCaseStartTs - expected30Days) < 1000,
|
||||
"testCaseResult should use 30-day override");
|
||||
assertTrue(
|
||||
Math.abs(reportDataStartTs - expected15Days) < 1000,
|
||||
"entityReportData should use default 15 days");
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getTimeSeriesStartTs returns -1 when days is 0 (no filtering)")
|
||||
void zeroDaysReturnsNegativeOne() {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("testCaseResult"))
|
||||
.timeSeriesMaxDays(0)
|
||||
.build();
|
||||
|
||||
assertEquals(-1, config.getTimeSeriesStartTs("testCaseResult"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getTimeSeriesStartTs returns -1 when days is -1 (no filtering)")
|
||||
void negativeDaysReturnsNegativeOne() {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("testCaseResult"))
|
||||
.timeSeriesMaxDays(-1)
|
||||
.build();
|
||||
|
||||
assertEquals(-1, config.getTimeSeriesStartTs("testCaseResult"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getTimeSeriesStartTs returns -1 when entity override is 0")
|
||||
void entityOverrideZeroReturnsNegativeOne() {
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("testCaseResult"))
|
||||
.timeSeriesMaxDays(15)
|
||||
.timeSeriesEntityDays(Map.of("testCaseResult", 0))
|
||||
.build();
|
||||
|
||||
assertEquals(-1, config.getTimeSeriesStartTs("testCaseResult"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("from(EventPublisherJob) correctly reads new fields")
|
||||
void fromJobReadsNewFields() {
|
||||
EventPublisherJob job = new EventPublisherJob();
|
||||
job.setTimeSeriesMaxDays(30);
|
||||
job.setTimeSeriesEntityDays(Map.of("testCaseResult", 60));
|
||||
job.setEntities(Set.of("table"));
|
||||
|
||||
ReindexingConfiguration config = ReindexingConfiguration.from(job);
|
||||
|
||||
assertEquals(30, config.timeSeriesMaxDays());
|
||||
assertEquals(Map.of("testCaseResult", 60), config.timeSeriesEntityDays());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("from(EventPublisherJob) uses defaults when fields are null")
|
||||
void fromJobUsesDefaults() {
|
||||
EventPublisherJob job = new EventPublisherJob();
|
||||
job.setEntities(Set.of("table"));
|
||||
|
||||
ReindexingConfiguration config = ReindexingConfiguration.from(job);
|
||||
|
||||
assertEquals(0, config.timeSeriesMaxDays());
|
||||
assertEquals(Collections.emptyMap(), config.timeSeriesEntityDays());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Builder propagates new fields")
|
||||
void builderPropagatesFields() {
|
||||
Map<String, Integer> entityDays = Map.of("queryCostRecord", 7, "testCaseResult", 45);
|
||||
ReindexingConfiguration config =
|
||||
ReindexingConfiguration.builder()
|
||||
.entities(Set.of("table"))
|
||||
.timeSeriesMaxDays(20)
|
||||
.timeSeriesEntityDays(entityDays)
|
||||
.build();
|
||||
|
||||
assertEquals(20, config.timeSeriesMaxDays());
|
||||
assertEquals(entityDays, config.timeSeriesEntityDays());
|
||||
|
||||
long queryCostStartTs = config.getTimeSeriesStartTs("queryCostRecord");
|
||||
long expected7Days = System.currentTimeMillis() - (7 * 86_400_000L);
|
||||
assertTrue(Math.abs(queryCostStartTs - expected7Days) < 1000);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,445 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertNotNull;
|
||||
import static org.junit.jupiter.api.Assertions.assertNull;
|
||||
|
||||
import io.micrometer.core.instrument.Counter;
|
||||
import io.micrometer.core.instrument.DistributionSummary;
|
||||
import io.micrometer.core.instrument.Gauge;
|
||||
import io.micrometer.core.instrument.Timer;
|
||||
import io.micrometer.core.instrument.simple.SimpleMeterRegistry;
|
||||
import java.lang.reflect.Field;
|
||||
import org.junit.jupiter.api.BeforeEach;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Nested;
|
||||
import org.junit.jupiter.api.Test;
|
||||
|
||||
@DisplayName("ReindexingMetrics Tests")
|
||||
class ReindexingMetricsTest {
|
||||
|
||||
private SimpleMeterRegistry meterRegistry;
|
||||
|
||||
@BeforeEach
|
||||
void setUp() throws Exception {
|
||||
resetSingleton();
|
||||
meterRegistry = new SimpleMeterRegistry();
|
||||
ReindexingMetrics.initialize(meterRegistry);
|
||||
}
|
||||
|
||||
private void resetSingleton() throws Exception {
|
||||
Field instanceField = ReindexingMetrics.class.getDeclaredField("instance");
|
||||
instanceField.setAccessible(true);
|
||||
instanceField.set(null, null);
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Initialization")
|
||||
class InitializationTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("getInstance returns non-null after initialize")
|
||||
void testGetInstanceAfterInitialize() {
|
||||
assertNotNull(ReindexingMetrics.getInstance());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("getInstance returns null before initialize")
|
||||
void testGetInstanceBeforeInitialize() throws Exception {
|
||||
resetSingleton();
|
||||
assertNull(ReindexingMetrics.getInstance());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Double initialization is idempotent")
|
||||
void testDoubleInitialize() {
|
||||
ReindexingMetrics first = ReindexingMetrics.getInstance();
|
||||
SimpleMeterRegistry secondRegistry = new SimpleMeterRegistry();
|
||||
ReindexingMetrics.initialize(secondRegistry);
|
||||
assertEquals(first, ReindexingMetrics.getInstance());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("All metrics are registered")
|
||||
void testAllMetricsRegistered() {
|
||||
assertNotNull(meterRegistry.find("reindexing.jobs").tag("status", "started").counter());
|
||||
assertNotNull(meterRegistry.find("reindexing.jobs").tag("status", "completed").counter());
|
||||
assertNotNull(meterRegistry.find("reindexing.jobs").tag("status", "failed").counter());
|
||||
assertNotNull(meterRegistry.find("reindexing.jobs").tag("status", "stopped").counter());
|
||||
|
||||
assertNotNull(
|
||||
meterRegistry.find("reindexing.job.duration").tag("status", "completed").timer());
|
||||
assertNotNull(meterRegistry.find("reindexing.job.duration").tag("status", "failed").timer());
|
||||
assertNotNull(meterRegistry.find("reindexing.job.duration").tag("status", "stopped").timer());
|
||||
|
||||
assertNotNull(meterRegistry.find("reindexing.jobs.active").gauge());
|
||||
|
||||
assertNotNull(meterRegistry.find("reindexing.bulk.duration").tag("success", "true").timer());
|
||||
assertNotNull(meterRegistry.find("reindexing.bulk.duration").tag("success", "false").timer());
|
||||
assertNotNull(meterRegistry.find("reindexing.bulk.payload.size").summary());
|
||||
assertNotNull(meterRegistry.find("reindexing.sink.pending").gauge());
|
||||
assertNotNull(meterRegistry.find("reindexing.backpressure.events").counter());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Job Lifecycle")
|
||||
class JobLifecycleTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("recordJobStarted increments counter and active gauge")
|
||||
void testRecordJobStarted() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "started"));
|
||||
assertEquals(1.0, getGaugeValue("reindexing.jobs.active"));
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordJobCompleted increments counter and decrements active gauge")
|
||||
void testRecordJobCompleted() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
Timer.Sample sample = metrics.startJobTimer();
|
||||
metrics.recordJobCompleted(sample);
|
||||
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "completed"));
|
||||
assertEquals(0.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
Timer timer =
|
||||
meterRegistry.find("reindexing.job.duration").tag("status", "completed").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(1, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordJobFailed increments counter and decrements active gauge")
|
||||
void testRecordJobFailed() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
Timer.Sample sample = metrics.startJobTimer();
|
||||
metrics.recordJobFailed(sample);
|
||||
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "failed"));
|
||||
assertEquals(0.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
Timer timer = meterRegistry.find("reindexing.job.duration").tag("status", "failed").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(1, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordJobStopped increments counter and decrements active gauge")
|
||||
void testRecordJobStopped() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
Timer.Sample sample = metrics.startJobTimer();
|
||||
metrics.recordJobStopped(sample);
|
||||
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "stopped"));
|
||||
assertEquals(0.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
Timer timer = meterRegistry.find("reindexing.job.duration").tag("status", "stopped").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(1, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordJobCompleted handles null sample gracefully")
|
||||
void testRecordJobCompletedNullSample() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
metrics.recordJobCompleted(null);
|
||||
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "completed"));
|
||||
Timer timer =
|
||||
meterRegistry.find("reindexing.job.duration").tag("status", "completed").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(0, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Multiple jobs tracked correctly")
|
||||
void testMultipleJobs() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordJobStarted();
|
||||
metrics.recordJobStarted();
|
||||
assertEquals(2.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
metrics.recordJobCompleted(null);
|
||||
assertEquals(1.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
metrics.recordJobFailed(null);
|
||||
assertEquals(0.0, getGaugeValue("reindexing.jobs.active"));
|
||||
|
||||
assertEquals(2, getCounterValue("reindexing.jobs", "status", "started"));
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "completed"));
|
||||
assertEquals(1, getCounterValue("reindexing.jobs", "status", "failed"));
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Stage Counters")
|
||||
class StageCounterTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("recordStageSuccess creates counter with correct tags")
|
||||
void testRecordStageSuccess() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordStageSuccess("reader", "table", 10);
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.success")
|
||||
.tag("stage", "reader")
|
||||
.tag("entity_type", "table")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(10.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordStageFailed creates counter with correct tags")
|
||||
void testRecordStageFailed() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordStageFailed("sink", "dashboard", 3);
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.failed")
|
||||
.tag("stage", "sink")
|
||||
.tag("entity_type", "dashboard")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(3.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordStageWarnings creates counter with correct tags")
|
||||
void testRecordStageWarnings() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordStageWarnings("reader", "pipeline", 5);
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.warnings")
|
||||
.tag("stage", "reader")
|
||||
.tag("entity_type", "pipeline")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(5.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Stage counters accumulate across calls")
|
||||
void testStageCountersAccumulate() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordStageSuccess("process", "table", 100);
|
||||
metrics.recordStageSuccess("process", "table", 50);
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.success")
|
||||
.tag("stage", "process")
|
||||
.tag("entity_type", "table")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(150.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Different entity types produce separate counters")
|
||||
void testDifferentEntityTypes() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordStageSuccess("reader", "table", 10);
|
||||
metrics.recordStageSuccess("reader", "topic", 20);
|
||||
|
||||
Counter tableCounter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.success")
|
||||
.tag("stage", "reader")
|
||||
.tag("entity_type", "table")
|
||||
.counter();
|
||||
Counter topicCounter =
|
||||
meterRegistry
|
||||
.find("reindexing.stage.success")
|
||||
.tag("stage", "reader")
|
||||
.tag("entity_type", "topic")
|
||||
.counter();
|
||||
assertNotNull(tableCounter);
|
||||
assertNotNull(topicCounter);
|
||||
assertEquals(10.0, tableCounter.count());
|
||||
assertEquals(20.0, topicCounter.count());
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Bulk Request Metrics")
|
||||
class BulkRequestTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("Bulk request timer records successful request")
|
||||
void testBulkRequestSuccess() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
Timer.Sample sample = metrics.startBulkRequestTimer();
|
||||
metrics.recordBulkRequestCompleted(sample, true);
|
||||
|
||||
Timer timer = meterRegistry.find("reindexing.bulk.duration").tag("success", "true").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(1, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Bulk request timer records failed request")
|
||||
void testBulkRequestFailure() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
Timer.Sample sample = metrics.startBulkRequestTimer();
|
||||
metrics.recordBulkRequestCompleted(sample, false);
|
||||
|
||||
Timer timer = meterRegistry.find("reindexing.bulk.duration").tag("success", "false").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(1, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordBulkRequestCompleted handles null sample gracefully")
|
||||
void testBulkRequestNullSample() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordBulkRequestCompleted(null, true);
|
||||
|
||||
Timer timer = meterRegistry.find("reindexing.bulk.duration").tag("success", "true").timer();
|
||||
assertNotNull(timer);
|
||||
assertEquals(0, timer.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Payload size is recorded in distribution summary")
|
||||
void testRecordPayloadSize() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordPayloadSize(1024);
|
||||
metrics.recordPayloadSize(2048);
|
||||
|
||||
DistributionSummary summary = meterRegistry.find("reindexing.bulk.payload.size").summary();
|
||||
assertNotNull(summary);
|
||||
assertEquals(2, summary.count());
|
||||
assertEquals(3072.0, summary.totalAmount());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Pending bulk requests gauge tracks increment and decrement")
|
||||
void testPendingBulkRequests() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.incrementPendingBulkRequests();
|
||||
metrics.incrementPendingBulkRequests();
|
||||
assertEquals(2.0, getGaugeValue("reindexing.sink.pending"));
|
||||
|
||||
metrics.decrementPendingBulkRequests();
|
||||
assertEquals(1.0, getGaugeValue("reindexing.sink.pending"));
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Backpressure")
|
||||
class BackpressureTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("recordBackpressureEvent increments counter")
|
||||
void testRecordBackpressureEvent() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordBackpressureEvent();
|
||||
metrics.recordBackpressureEvent();
|
||||
metrics.recordBackpressureEvent();
|
||||
|
||||
assertEquals(3, getCounterValue("reindexing.backpressure.events"));
|
||||
}
|
||||
}
|
||||
|
||||
@Nested
|
||||
@DisplayName("Promotion Metrics")
|
||||
class PromotionTests {
|
||||
|
||||
@Test
|
||||
@DisplayName("recordPromotionSuccess creates counter with correct tags")
|
||||
void testRecordPromotionSuccess() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordPromotionSuccess("table");
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.promotion")
|
||||
.tag("entity_type", "table")
|
||||
.tag("result", "success")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(1.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("recordPromotionFailure creates counter with correct tags")
|
||||
void testRecordPromotionFailure() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordPromotionFailure("dashboard");
|
||||
|
||||
Counter counter =
|
||||
meterRegistry
|
||||
.find("reindexing.promotion")
|
||||
.tag("entity_type", "dashboard")
|
||||
.tag("result", "failure")
|
||||
.counter();
|
||||
assertNotNull(counter);
|
||||
assertEquals(1.0, counter.count());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("Promotion success and failure tracked independently per entity type")
|
||||
void testPromotionPerEntityType() {
|
||||
ReindexingMetrics metrics = ReindexingMetrics.getInstance();
|
||||
metrics.recordPromotionSuccess("table");
|
||||
metrics.recordPromotionSuccess("table");
|
||||
metrics.recordPromotionFailure("table");
|
||||
metrics.recordPromotionSuccess("topic");
|
||||
|
||||
Counter tableSuccess =
|
||||
meterRegistry
|
||||
.find("reindexing.promotion")
|
||||
.tag("entity_type", "table")
|
||||
.tag("result", "success")
|
||||
.counter();
|
||||
Counter tableFailure =
|
||||
meterRegistry
|
||||
.find("reindexing.promotion")
|
||||
.tag("entity_type", "table")
|
||||
.tag("result", "failure")
|
||||
.counter();
|
||||
Counter topicSuccess =
|
||||
meterRegistry
|
||||
.find("reindexing.promotion")
|
||||
.tag("entity_type", "topic")
|
||||
.tag("result", "success")
|
||||
.counter();
|
||||
|
||||
assertNotNull(tableSuccess);
|
||||
assertNotNull(tableFailure);
|
||||
assertNotNull(topicSuccess);
|
||||
assertEquals(2.0, tableSuccess.count());
|
||||
assertEquals(1.0, tableFailure.count());
|
||||
assertEquals(1.0, topicSuccess.count());
|
||||
}
|
||||
}
|
||||
|
||||
private long getCounterValue(String name) {
|
||||
Counter counter = meterRegistry.find(name).counter();
|
||||
return counter != null ? (long) counter.count() : 0;
|
||||
}
|
||||
|
||||
private long getCounterValue(String name, String tagKey, String tagValue) {
|
||||
Counter counter = meterRegistry.find(name).tag(tagKey, tagValue).counter();
|
||||
return counter != null ? (long) counter.count() : 0;
|
||||
}
|
||||
|
||||
private double getGaugeValue(String name) {
|
||||
Gauge gauge = meterRegistry.find(name).gauge();
|
||||
return gauge != null ? gauge.value() : 0.0;
|
||||
}
|
||||
}
|
||||
|
|
@ -47,8 +47,6 @@ import org.openmetadata.schema.api.services.CreateMessagingService;
|
|||
import org.openmetadata.schema.api.services.CreateMessagingService.MessagingServiceType;
|
||||
import org.openmetadata.schema.entity.app.App;
|
||||
import org.openmetadata.schema.entity.app.AppRunRecord;
|
||||
import org.openmetadata.schema.entity.app.FailureContext;
|
||||
import org.openmetadata.schema.entity.app.SuccessContext;
|
||||
import org.openmetadata.schema.entity.data.Database;
|
||||
import org.openmetadata.schema.entity.data.DatabaseSchema;
|
||||
import org.openmetadata.schema.entity.data.Table;
|
||||
|
|
@ -57,9 +55,7 @@ import org.openmetadata.schema.entity.services.DatabaseService;
|
|||
import org.openmetadata.schema.entity.services.MessagingService;
|
||||
import org.openmetadata.schema.service.configuration.elasticsearch.ElasticSearchConfiguration;
|
||||
import org.openmetadata.schema.system.EventPublisherJob;
|
||||
import org.openmetadata.schema.system.IndexingError;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
import org.openmetadata.schema.type.AccessDetails;
|
||||
import org.openmetadata.schema.type.Column;
|
||||
import org.openmetadata.schema.type.ColumnDataType;
|
||||
|
|
@ -794,96 +790,42 @@ class SearchIndexAppTest extends OpenMetadataApplicationTest {
|
|||
}
|
||||
|
||||
@Test
|
||||
void testJobCompletionStatus() throws Exception {
|
||||
void testOrchestratorSendUpdatesPopulatesAppRunRecord() {
|
||||
try (MockedStatic<WebSocketManager> wsMock = mockStatic(WebSocketManager.class)) {
|
||||
wsMock.when(WebSocketManager::getInstance).thenReturn(webSocketManager);
|
||||
|
||||
searchIndexApp.init(
|
||||
new App()
|
||||
.withName("SearchIndexingApplication")
|
||||
.withAppConfiguration(JsonUtils.convertValue(testJobData, Object.class)));
|
||||
UUID appId = UUID.randomUUID();
|
||||
AppRunRecord record = new AppRunRecord();
|
||||
record.setStatus(AppRunRecord.Status.RUNNING);
|
||||
|
||||
EventPublisherJob jobData = searchIndexApp.getJobData();
|
||||
jobData.setStatus(EventPublisherJob.Status.RUNNING);
|
||||
OrchestratorContext orchCtx = mock(OrchestratorContext.class);
|
||||
when(orchCtx.getJobRecord()).thenReturn(record);
|
||||
when(orchCtx.getJobName()).thenReturn("TestJob");
|
||||
when(orchCtx.getAppConfigJson()).thenReturn(JsonUtils.pojoToJson(testJobData));
|
||||
when(orchCtx.getAppId()).thenReturn(appId);
|
||||
when(orchCtx.createProgressListener(any()))
|
||||
.thenReturn(mock(ReindexingProgressListener.class));
|
||||
when(orchCtx.createReindexingContext(false)).thenReturn(mock(ReindexingJobContext.class));
|
||||
|
||||
var method =
|
||||
SearchIndexApp.class.getDeclaredMethod(
|
||||
"sendUpdates", JobExecutionContext.class, boolean.class);
|
||||
method.setAccessible(true);
|
||||
CollectionDAO.SearchIndexFailureDAO failureDAO =
|
||||
mock(CollectionDAO.SearchIndexFailureDAO.class);
|
||||
when(collectionDAO.searchIndexFailureDAO()).thenReturn(failureDAO);
|
||||
when(failureDAO.deleteAll()).thenReturn(0);
|
||||
when(failureDAO.countByJobId(anyString())).thenReturn(0);
|
||||
|
||||
if (jobData.getStatus() == EventPublisherJob.Status.RUNNING) {
|
||||
jobData.setStatus(EventPublisherJob.Status.COMPLETED);
|
||||
method.invoke(searchIndexApp, jobExecutionContext, true);
|
||||
}
|
||||
ReindexingOrchestrator orch =
|
||||
new ReindexingOrchestrator(collectionDAO, searchRepository, orchCtx);
|
||||
|
||||
assertEquals(EventPublisherJob.Status.COMPLETED, jobData.getStatus());
|
||||
}
|
||||
}
|
||||
EventPublisherJob emptyJob =
|
||||
new EventPublisherJob()
|
||||
.withEntities(Set.of())
|
||||
.withBatchSize(100)
|
||||
.withRecreateIndex(false);
|
||||
|
||||
@Test
|
||||
void testWebSocketThrottling() throws Exception {
|
||||
try (MockedStatic<WebSocketManager> wsMock = mockStatic(WebSocketManager.class)) {
|
||||
wsMock.when(WebSocketManager::getInstance).thenReturn(webSocketManager);
|
||||
orch.run(emptyJob);
|
||||
|
||||
searchIndexApp.init(
|
||||
new App()
|
||||
.withName("SearchIndexingApplication")
|
||||
.withAppConfiguration(JsonUtils.convertValue(testJobData, Object.class)));
|
||||
|
||||
var method =
|
||||
SearchIndexApp.class.getDeclaredMethod(
|
||||
"sendUpdates", JobExecutionContext.class, boolean.class);
|
||||
method.setAccessible(true);
|
||||
|
||||
method.invoke(searchIndexApp, jobExecutionContext, false);
|
||||
method.invoke(searchIndexApp, jobExecutionContext, false);
|
||||
method.invoke(searchIndexApp, jobExecutionContext, false);
|
||||
method.invoke(searchIndexApp, jobExecutionContext, true);
|
||||
}
|
||||
}
|
||||
|
||||
@Test
|
||||
void testAppRunRecordCreation() {
|
||||
try (MockedStatic<WebSocketManager> wsMock = mockStatic(WebSocketManager.class)) {
|
||||
wsMock.when(WebSocketManager::getInstance).thenReturn(webSocketManager);
|
||||
|
||||
searchIndexApp.init(
|
||||
new App()
|
||||
.withName("SearchIndexingApplication")
|
||||
.withAppConfiguration(JsonUtils.convertValue(testJobData, Object.class)));
|
||||
|
||||
EventPublisherJob jobData = searchIndexApp.getJobData();
|
||||
|
||||
IndexingError error =
|
||||
new IndexingError()
|
||||
.withErrorSource(IndexingError.ErrorSource.SINK)
|
||||
.withMessage("Test error")
|
||||
.withFailedCount(5);
|
||||
jobData.setFailure(error);
|
||||
jobData.setStatus(EventPublisherJob.Status.ACTIVE_ERROR);
|
||||
|
||||
Stats stats =
|
||||
new Stats().withJobStats(new StepStats().withSuccessRecords(95).withFailedRecords(5));
|
||||
jobData.setStats(stats);
|
||||
|
||||
AppRunRecord mockRecord = mock(AppRunRecord.class);
|
||||
lenient().when(mockRecord.getStatus()).thenReturn(AppRunRecord.Status.FAILED);
|
||||
lenient().when(mockRecord.getFailureContext()).thenReturn(new FailureContext());
|
||||
lenient().when(mockRecord.getSuccessContext()).thenReturn(new SuccessContext());
|
||||
|
||||
try {
|
||||
var method =
|
||||
SearchIndexApp.class.getDeclaredMethod(
|
||||
"updateRecordToDbAndNotify", JobExecutionContext.class);
|
||||
method.setAccessible(true);
|
||||
method.invoke(searchIndexApp, jobExecutionContext);
|
||||
assertEquals(IndexingError.ErrorSource.SINK, jobData.getFailure().getErrorSource());
|
||||
assertEquals("Test error", jobData.getFailure().getMessage());
|
||||
assertEquals(5, jobData.getFailure().getFailedCount());
|
||||
|
||||
} catch (Exception e) {
|
||||
LOG.debug("Expected exception during partial mocking: {}", e.getMessage());
|
||||
}
|
||||
assertEquals(EventPublisherJob.Status.COMPLETED, orch.getJobData().getStatus());
|
||||
assertEquals(AppRunRecord.Status.COMPLETED.value(), record.getStatus().value());
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -1319,14 +1261,9 @@ class SearchIndexAppTest extends OpenMetadataApplicationTest {
|
|||
|
||||
@Test
|
||||
void testInitializeTotalRecords() {
|
||||
App testApp =
|
||||
new App()
|
||||
.withName("SearchIndexingApplication")
|
||||
.withAppConfiguration(JsonUtils.convertValue(testJobData, Object.class));
|
||||
SearchIndexExecutor executor = new SearchIndexExecutor(collectionDAO, searchRepository);
|
||||
|
||||
searchIndexApp.init(testApp);
|
||||
|
||||
Stats stats = searchIndexApp.initializeTotalRecords(Set.of("table", "user"));
|
||||
Stats stats = executor.initializeTotalRecords(Set.of("table", "user"));
|
||||
assertNotNull(stats);
|
||||
assertNotNull(stats.getJobStats());
|
||||
assertNotNull(stats.getReaderStats());
|
||||
|
|
|
|||
|
|
@ -66,13 +66,33 @@ class SearchIndexFailureIntegrationTest {
|
|||
|
||||
// Simulate what happens when BulkSink fails to index entities
|
||||
BulkSink.FailureCallback callback =
|
||||
(entityType, entityId, entityFqn, errorMessage) ->
|
||||
(entityType, entityId, entityFqn, errorMessage, stage) -> {
|
||||
if (stage == IndexingFailureRecorder.FailureStage.PROCESS) {
|
||||
recorder.recordProcessFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
} else {
|
||||
recorder.recordSinkFailure(entityType, entityId, entityFqn, errorMessage);
|
||||
}
|
||||
};
|
||||
|
||||
// Simulate 3 sink failures
|
||||
callback.onFailure("table", "uuid-1", "db.schema.table1", "Mapping error");
|
||||
callback.onFailure("table", "uuid-2", "db.schema.table2", "Document too large");
|
||||
callback.onFailure("dashboard", "uuid-3", "service.dashboard1", "Index not found");
|
||||
callback.onFailure(
|
||||
"table",
|
||||
"uuid-1",
|
||||
"db.schema.table1",
|
||||
"Mapping error",
|
||||
IndexingFailureRecorder.FailureStage.SINK);
|
||||
callback.onFailure(
|
||||
"table",
|
||||
"uuid-2",
|
||||
"db.schema.table2",
|
||||
"Document too large",
|
||||
IndexingFailureRecorder.FailureStage.SINK);
|
||||
callback.onFailure(
|
||||
"dashboard",
|
||||
"uuid-3",
|
||||
"service.dashboard1",
|
||||
"Index not found",
|
||||
IndexingFailureRecorder.FailureStage.SINK);
|
||||
|
||||
// Flush to capture
|
||||
recorder.flush();
|
||||
|
|
|
|||
|
|
@ -0,0 +1,196 @@
|
|||
package org.openmetadata.service.apps.bundles.searchIndex;
|
||||
|
||||
import static org.junit.jupiter.api.Assertions.assertEquals;
|
||||
import static org.junit.jupiter.api.Assertions.assertNotNull;
|
||||
import static org.junit.jupiter.api.Assertions.assertTrue;
|
||||
|
||||
import java.util.Set;
|
||||
import java.util.concurrent.CountDownLatch;
|
||||
import java.util.concurrent.ExecutorService;
|
||||
import java.util.concurrent.Executors;
|
||||
import java.util.concurrent.TimeUnit;
|
||||
import java.util.concurrent.atomic.AtomicReference;
|
||||
import org.junit.jupiter.api.DisplayName;
|
||||
import org.junit.jupiter.api.Test;
|
||||
import org.openmetadata.schema.system.EntityStats;
|
||||
import org.openmetadata.schema.system.Stats;
|
||||
import org.openmetadata.schema.system.StepStats;
|
||||
|
||||
@DisplayName("Stats Thread Safety Tests")
|
||||
class StatsThreadSafetyTest {
|
||||
|
||||
private Stats createTestStats(Set<String> entityTypes) {
|
||||
Stats stats = new Stats();
|
||||
stats.setEntityStats(new EntityStats());
|
||||
stats.setJobStats(new StepStats());
|
||||
stats.setReaderStats(new StepStats());
|
||||
stats.setSinkStats(new StepStats());
|
||||
|
||||
int total = 0;
|
||||
for (String entityType : entityTypes) {
|
||||
int entityTotal = 1000;
|
||||
total += entityTotal;
|
||||
StepStats es = new StepStats();
|
||||
es.setTotalRecords(entityTotal);
|
||||
es.setSuccessRecords(0);
|
||||
es.setFailedRecords(0);
|
||||
stats.getEntityStats().getAdditionalProperties().put(entityType, es);
|
||||
}
|
||||
stats.getJobStats().setTotalRecords(total);
|
||||
stats.getJobStats().setSuccessRecords(0);
|
||||
stats.getJobStats().setFailedRecords(0);
|
||||
stats.getReaderStats().setTotalRecords(total);
|
||||
stats.getReaderStats().setSuccessRecords(0);
|
||||
stats.getReaderStats().setFailedRecords(0);
|
||||
stats.getReaderStats().setWarningRecords(0);
|
||||
stats.getSinkStats().setTotalRecords(0);
|
||||
stats.getSinkStats().setSuccessRecords(0);
|
||||
stats.getSinkStats().setFailedRecords(0);
|
||||
return stats;
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("concurrent updateStats calls produce consistent entity and job stats")
|
||||
void concurrentUpdateStats() throws InterruptedException {
|
||||
Set<String> entities = Set.of("table", "user");
|
||||
AtomicReference<Stats> statsRef = new AtomicReference<>(createTestStats(entities));
|
||||
Stats stats = statsRef.get();
|
||||
|
||||
int threadCount = 8;
|
||||
int updatesPerThread = 100;
|
||||
CountDownLatch latch = new CountDownLatch(threadCount);
|
||||
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
|
||||
|
||||
for (int t = 0; t < threadCount; t++) {
|
||||
final String entityType = (t % 2 == 0) ? "table" : "user";
|
||||
executor.submit(
|
||||
() -> {
|
||||
try {
|
||||
for (int i = 0; i < updatesPerThread; i++) {
|
||||
synchronized (statsRef) {
|
||||
StepStats es = stats.getEntityStats().getAdditionalProperties().get(entityType);
|
||||
if (es != null) {
|
||||
es.setSuccessRecords(es.getSuccessRecords() + 1);
|
||||
}
|
||||
int totalSuccess =
|
||||
stats.getEntityStats().getAdditionalProperties().values().stream()
|
||||
.mapToInt(StepStats::getSuccessRecords)
|
||||
.sum();
|
||||
stats.getJobStats().setSuccessRecords(totalSuccess);
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
latch.countDown();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
latch.await(10, TimeUnit.SECONDS);
|
||||
executor.shutdown();
|
||||
|
||||
assertNotNull(stats.getJobStats());
|
||||
int tableSuccess =
|
||||
stats.getEntityStats().getAdditionalProperties().get("table").getSuccessRecords();
|
||||
int userSuccess =
|
||||
stats.getEntityStats().getAdditionalProperties().get("user").getSuccessRecords();
|
||||
int jobSuccess = stats.getJobStats().getSuccessRecords();
|
||||
|
||||
assertEquals(threadCount * updatesPerThread, tableSuccess + userSuccess);
|
||||
assertEquals(tableSuccess + userSuccess, jobSuccess);
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("concurrent reader stats updates do not lose updates")
|
||||
void concurrentReaderStats() throws InterruptedException {
|
||||
Stats stats = createTestStats(Set.of("table"));
|
||||
|
||||
int threadCount = 8;
|
||||
int updatesPerThread = 100;
|
||||
CountDownLatch latch = new CountDownLatch(threadCount);
|
||||
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
|
||||
|
||||
for (int t = 0; t < threadCount; t++) {
|
||||
executor.submit(
|
||||
() -> {
|
||||
try {
|
||||
for (int i = 0; i < updatesPerThread; i++) {
|
||||
synchronized (stats) {
|
||||
StepStats rs = stats.getReaderStats();
|
||||
rs.setSuccessRecords(
|
||||
(rs.getSuccessRecords() != null ? rs.getSuccessRecords() : 0) + 1);
|
||||
rs.setFailedRecords(
|
||||
(rs.getFailedRecords() != null ? rs.getFailedRecords() : 0) + 0);
|
||||
rs.setWarningRecords(
|
||||
(rs.getWarningRecords() != null ? rs.getWarningRecords() : 0) + 0);
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
latch.countDown();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
latch.await(10, TimeUnit.SECONDS);
|
||||
executor.shutdown();
|
||||
|
||||
int expectedSuccess = threadCount * updatesPerThread;
|
||||
assertEquals(expectedSuccess, stats.getReaderStats().getSuccessRecords());
|
||||
assertEquals(0, stats.getReaderStats().getFailedRecords());
|
||||
}
|
||||
|
||||
@Test
|
||||
@DisplayName("reconciler invariant: job total >= job success + job failed")
|
||||
void reconcilerInvariant() throws InterruptedException {
|
||||
Stats stats = createTestStats(Set.of("table", "user"));
|
||||
|
||||
int threadCount = 4;
|
||||
int updatesPerThread = 200;
|
||||
CountDownLatch latch = new CountDownLatch(threadCount);
|
||||
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
|
||||
|
||||
for (int t = 0; t < threadCount; t++) {
|
||||
final int tid = t;
|
||||
executor.submit(
|
||||
() -> {
|
||||
try {
|
||||
for (int i = 0; i < updatesPerThread; i++) {
|
||||
String entityType = (tid % 2 == 0) ? "table" : "user";
|
||||
synchronized (stats) {
|
||||
StepStats es = stats.getEntityStats().getAdditionalProperties().get(entityType);
|
||||
if (es != null) {
|
||||
if (i % 10 == 0) {
|
||||
es.setFailedRecords(es.getFailedRecords() + 1);
|
||||
} else {
|
||||
es.setSuccessRecords(es.getSuccessRecords() + 1);
|
||||
}
|
||||
}
|
||||
int totalSuccess =
|
||||
stats.getEntityStats().getAdditionalProperties().values().stream()
|
||||
.mapToInt(StepStats::getSuccessRecords)
|
||||
.sum();
|
||||
int totalFailed =
|
||||
stats.getEntityStats().getAdditionalProperties().values().stream()
|
||||
.mapToInt(StepStats::getFailedRecords)
|
||||
.sum();
|
||||
stats.getJobStats().setSuccessRecords(totalSuccess);
|
||||
stats.getJobStats().setFailedRecords(totalFailed);
|
||||
}
|
||||
}
|
||||
} finally {
|
||||
latch.countDown();
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
latch.await(10, TimeUnit.SECONDS);
|
||||
executor.shutdown();
|
||||
|
||||
StepStats jobStats = stats.getJobStats();
|
||||
int success = jobStats.getSuccessRecords();
|
||||
int failed = jobStats.getFailedRecords();
|
||||
int total = jobStats.getTotalRecords();
|
||||
|
||||
assertTrue(total >= success + failed, "Total must be >= success + failed");
|
||||
assertEquals(threadCount * updatesPerThread, success + failed);
|
||||
}
|
||||
}
|
||||
|
|
@ -104,7 +104,7 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
Set<String> entities = Set.of("table", "user");
|
||||
EventPublisherJob jobConfig = new EventPublisherJob().withEntities(entities).withBatchSize(500);
|
||||
|
||||
when(partitionCalculator.getEntityCounts(entities))
|
||||
when(partitionCalculator.getEntityCounts(entities, null))
|
||||
.thenReturn(java.util.Map.of("table", 10000L, "user", 5000L));
|
||||
|
||||
SearchIndexJob job = coordinator.createJob(entities, jobConfig, "admin");
|
||||
|
|
@ -194,7 +194,7 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
.cursor(5000)
|
||||
.build());
|
||||
|
||||
when(partitionCalculator.calculatePartitions(jobId, entities)).thenReturn(mockPartitions);
|
||||
when(partitionCalculator.calculatePartitions(jobId, entities, null)).thenReturn(mockPartitions);
|
||||
|
||||
SearchIndexJob result = coordinator.initializePartitions(jobId);
|
||||
|
||||
|
|
@ -275,7 +275,8 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
// Atomic claim succeeds
|
||||
when(partitionDAO.claimNextPartitionAtomic(eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(1);
|
||||
when(partitionDAO.findLatestClaimedPartition(jobId.toString(), TEST_SERVER_ID))
|
||||
when(partitionDAO.findLatestClaimedPartition(
|
||||
eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(
|
||||
new SearchIndexPartitionRecord(
|
||||
partitionId.toString(),
|
||||
|
|
@ -326,7 +327,7 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
Optional<SearchIndexPartition> result = coordinator.claimNextPartition(jobId);
|
||||
|
||||
assertFalse(result.isPresent());
|
||||
verify(partitionDAO, never()).findLatestClaimedPartition(anyString(), anyString());
|
||||
verify(partitionDAO, never()).findLatestClaimedPartition(anyString(), anyString(), anyLong());
|
||||
}
|
||||
|
||||
@Test
|
||||
|
|
@ -354,7 +355,8 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
|
||||
when(partitionDAO.claimNextPartitionAtomic(eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(1);
|
||||
when(partitionDAO.findLatestClaimedPartition(jobId.toString(), TEST_SERVER_ID))
|
||||
when(partitionDAO.findLatestClaimedPartition(
|
||||
eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(
|
||||
new SearchIndexPartitionRecord(
|
||||
partitionId.toString(),
|
||||
|
|
@ -405,7 +407,7 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
// Should return empty - lost the race
|
||||
assertFalse(result.isPresent());
|
||||
// Should NOT call findLatestClaimedPartition since claim failed
|
||||
verify(partitionDAO, never()).findLatestClaimedPartition(anyString(), anyString());
|
||||
verify(partitionDAO, never()).findLatestClaimedPartition(anyString(), anyString(), anyLong());
|
||||
}
|
||||
|
||||
@Test
|
||||
|
|
@ -418,7 +420,8 @@ class DistributedSearchIndexCoordinatorTest {
|
|||
|
||||
when(partitionDAO.claimNextPartitionAtomic(eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(1);
|
||||
when(partitionDAO.findLatestClaimedPartition(jobId.toString(), TEST_SERVER_ID))
|
||||
when(partitionDAO.findLatestClaimedPartition(
|
||||
eq(jobId.toString()), eq(TEST_SERVER_ID), anyLong()))
|
||||
.thenReturn(
|
||||
new SearchIndexPartitionRecord(
|
||||
partitionId.toString(),
|
||||
|
|
|
|||
|
|
@ -249,7 +249,7 @@ class DistributedSearchIndexIntegrationTest extends OpenMetadataApplicationTest
|
|||
assertEquals(1, claimed, "Should claim exactly one partition");
|
||||
|
||||
SearchIndexPartitionRecord claimedPartition =
|
||||
partitionDAO.findLatestClaimedPartition(jobId, "server-1");
|
||||
partitionDAO.findLatestClaimedPartition(jobId, "server-1", now);
|
||||
assertNotNull(claimedPartition, "Should find the claimed partition");
|
||||
assertEquals("PROCESSING", claimedPartition.status());
|
||||
assertEquals("server-1", claimedPartition.assignedServer());
|
||||
|
|
@ -778,7 +778,8 @@ class DistributedSearchIndexIntegrationTest extends OpenMetadataApplicationTest
|
|||
int processedPartitions = 0;
|
||||
|
||||
while (partitionDAO.claimNextPartitionAtomic(jobId, serverId, now) > 0) {
|
||||
SearchIndexPartitionRecord claimed = partitionDAO.findLatestClaimedPartition(jobId, serverId);
|
||||
SearchIndexPartitionRecord claimed =
|
||||
partitionDAO.findLatestClaimedPartition(jobId, serverId, now);
|
||||
assertNotNull(claimed, "Should have a claimed partition");
|
||||
|
||||
partitionDAO.updateProgress(
|
||||
|
|
|
|||
|
|
@ -109,6 +109,21 @@
|
|||
"default": 10000,
|
||||
"minimum": 1000,
|
||||
"maximum": 50000
|
||||
},
|
||||
"timeSeriesMaxDays": {
|
||||
"title": "Time Series Max Days",
|
||||
"description": "Maximum age in days for time series data during reindexing. Default 0 (index all data). Set to a positive value like 15 to limit to recent data only.",
|
||||
"type": "integer",
|
||||
"default": 0,
|
||||
"minimum": -1
|
||||
},
|
||||
"timeSeriesEntityDays": {
|
||||
"title": "Time Series Entity Days Override",
|
||||
"description": "Per-entity-type override for time series max days. Keys are entity type names (e.g. testCaseResult, queryCostRecord), values are number of days. Entities not listed here use the default Time Series Max Days value.",
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "integer"
|
||||
}
|
||||
}
|
||||
},
|
||||
"additionalProperties": false
|
||||
|
|
|
|||
|
|
@ -232,6 +232,22 @@
|
|||
"default": 10000,
|
||||
"minimum": 1000,
|
||||
"maximum": 50000
|
||||
},
|
||||
"timeSeriesMaxDays": {
|
||||
"title": "Time Series Max Days",
|
||||
"description": "Maximum age in days for time series data during reindexing. Only records from the last N days will be indexed. Default 0 (index all data). Set to a positive value like 15 to limit to recent data.",
|
||||
"type": "integer",
|
||||
"default": 0,
|
||||
"minimum": -1
|
||||
},
|
||||
"timeSeriesEntityDays": {
|
||||
"title": "Time Series Entity Days Override",
|
||||
"description": "Per-entity-type override for time series max days. Keys are entity type names, values are number of days. Entities not in this map use timeSeriesMaxDays as default.",
|
||||
"type": "object",
|
||||
"existingJavaType": "java.util.Map<String, Integer>",
|
||||
"additionalProperties": {
|
||||
"type": "integer"
|
||||
}
|
||||
}
|
||||
},
|
||||
"additionalProperties": false
|
||||
|
|
|
|||
|
|
@ -101,4 +101,18 @@ $$section
|
|||
|
||||
Number of entities per partition for distributed indexing. Smaller values create more partitions for better distribution across servers. Range: 1000-50000.
|
||||
|
||||
$$
|
||||
|
||||
$$section
|
||||
### Time Series Max Days $(id="timeSeriesMaxDays")
|
||||
|
||||
Maximum age in days for time series data during reindexing. Default 0 (index all data). Set to a positive value like 15 to limit to recent data only.
|
||||
|
||||
$$
|
||||
|
||||
$$section
|
||||
### Time Series Entity Days Override $(id="timeSeriesEntityDays")
|
||||
|
||||
Per-entity-type override for time series max days. Keys are entity type names (e.g. testCaseResult, queryCostRecord), values are number of days. Entities not listed here use the default Time Series Max Days value.
|
||||
|
||||
$$
|
||||
|
|
@ -1090,6 +1090,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -355,6 +355,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -72,6 +72,17 @@ export interface SearchIndexingAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Application Type
|
||||
*/
|
||||
|
|
|
|||
|
|
@ -336,6 +336,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -293,6 +293,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -1627,6 +1627,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -196,6 +196,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -181,6 +181,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -5545,6 +5545,17 @@ export interface CollateAIAppConfig {
|
|||
* Recreate Indexes with updated Language
|
||||
*/
|
||||
searchIndexMappingLanguage?: SearchIndexMappingLanguage;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names (e.g.
|
||||
* testCaseResult, queryCostRecord), values are number of days. Entities not listed here use
|
||||
* the default Time Series Max Days value.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Default 0 (index all data).
|
||||
* Set to a positive value like 15 to limit to recent data only.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
/**
|
||||
* Enable distributed indexing to scale reindexing across multiple servers with fault
|
||||
* tolerance and parallel processing
|
||||
|
|
|
|||
|
|
@ -101,8 +101,19 @@ export interface EventPublisherJob {
|
|||
/**
|
||||
* This schema publisher run job status.
|
||||
*/
|
||||
status?: Status;
|
||||
timestamp?: number;
|
||||
status?: Status;
|
||||
/**
|
||||
* Per-entity-type override for time series max days. Keys are entity type names, values are
|
||||
* number of days. Entities not in this map use timeSeriesMaxDays as default.
|
||||
*/
|
||||
timeSeriesEntityDays?: { [key: string]: number };
|
||||
/**
|
||||
* Maximum age in days for time series data during reindexing. Only records from the last N
|
||||
* days will be indexed. Default 0 (index all data). Set to a positive value like 15 to
|
||||
* limit to recent data.
|
||||
*/
|
||||
timeSeriesMaxDays?: number;
|
||||
timestamp?: number;
|
||||
/**
|
||||
* Enable distributed indexing across multiple servers. When enabled, reindexing work is
|
||||
* partitioned and can be processed by multiple servers concurrently with crash recovery
|
||||
|
|
|
|||
|
|
@ -161,6 +161,21 @@
|
|||
"default": 10000,
|
||||
"minimum": 1000,
|
||||
"maximum": 50000
|
||||
},
|
||||
"timeSeriesMaxDays": {
|
||||
"title": "Time Series Max Days",
|
||||
"description": "Maximum age in days for time series data during reindexing. Default 0 (index all data). Set to a positive value like 15 to limit to recent data only.",
|
||||
"type": "integer",
|
||||
"default": 0,
|
||||
"minimum": -1
|
||||
},
|
||||
"timeSeriesEntityDays": {
|
||||
"title": "Time Series Entity Days Override",
|
||||
"description": "Per-entity-type override for time series max days. Keys are entity type names (e.g. testCaseResult, queryCostRecord), values are number of days. Entities not listed here use the default Time Series Max Days value.",
|
||||
"type": "object",
|
||||
"additionalProperties": {
|
||||
"type": "integer"
|
||||
}
|
||||
}
|
||||
},
|
||||
"additionalProperties": false
|
||||
|
|
|
|||
Loading…
Reference in a new issue