OpenMetadata/bootstrap/sql/migrations
Sriharsha Chintalapani a4998bc1c7
Continuous indexing to handle failures (#26111)
* Add Continuous Indexing

* Add continuous Search indexing

* Update to 1.12.3

* Make search index retry queue reliable with stale recovery, health checks, and silent failure coverage

  - Add entityType, retryCount, claimedAt columns to search_index_retry_queue table
  - Implement stale IN_PROGRESS recovery (10min threshold, 60s sweep interval)
  - Replace static isClientAvailable flag with cached ping health check (5s TTL)
  - Narrow catch blocks in resolveById/resolveByFqn to EntityNotFoundException
  - Use entityType hint for O(1) entity resolution instead of scanning all types
  - Switch from status-string-based retry to retryCount-based (< 3 retries → PENDING, ≥ 3 → FAILED)
  - Batch cascade reindex at 200 entities instead of accumulating up to 5000
  - Add retry queue enqueue in catch blocks of createTimeSeriesEntity, updateTimeSeriesEntity,
    deleteTimeSeriesEntityById, bulkIndexPipelineExecutions, reindexAcrossIndices, and
    TestSuiteRepository.postCreate
  - Re-throw exceptions from indexTableColumns/deleteTableColumns to parent catch blocks
  - Add Micrometer counters for enqueued, processed (success/failure), and stale recovered

* Add missing lineage call site and Add test

* Review comments

* Add resilience to search index retry worker: client availability checks, backoff, and error classification

  - Add exponential backoff when search client is unreachable so the
    worker does not burn retries during cluster outages (5s → 10s → … → 60s cap)
  - Classify errors using HTTP status codes from ES/OS exceptions:
    4xx (except 429) are non-retryable and skip straight to FAILED;
    429, 5xx, and IOException are retryable
  - Preserve first bulk failure detail in RuntimeException so error
    classification works for the bulk indexing path
  - Reorganize SearchIndexRetryWorker into clearly separated sections
    (lifecycle, main loop, record processing, entity resolution,
    reindexing, resilience, suspension, utilities)
  - Add isRetryableStatusCode utility to SearchIndexRetryQueue
  - Add integration tests: status code classification, retry exhaustion
    to FAILED, recovery from PENDING_RETRY_1, error detail preservation

* Address review comments

* Revert fqn size

* Spotless

* Address volatile review comments

* Fix Failing Test

* update review comments

---------

Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-03-18 16:23:04 +05:30
..
flyway MINOR - Remove flyway (#23179) 2025-10-28 09:11:03 +05:30
native Continuous indexing to handle failures (#26111) 2026-03-18 16:23:04 +05:30