OpenMetadata

Elgato_dark/OpenMetadata

Fork 0

mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

Commit graph

Author	SHA1	Message	Date
Sriharsha Chintalapani	b5374f9fec	Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details * Reindex robustness: selective fields, cache fail-fast, stop actually stops Three independent fixes that all surfaced from the same incident: a 580k- container reindex that froze for hours, then refused to actually stop when the user clicked Stop. Selective fields in the distributed reader path. PartitionWorker was hardcoding List.of("*"), triggering every fieldFetcher in setFieldsInBulk — including fetchAndSetOwns on Team/User where every owned entity becomes a getEntityReferenceById round-trip. PR #27723 fixed this for EntityReader (single-server) but the distributed pipeline never picked it up. Lifted the field-resolution into ReindexingUtil so both paths share one source of truth. Cache layer no longer flaps on a single Redis hiccup. RedisCacheProvider used to flip the whole provider unavailable on the first 300 ms timeout and flip back on the next PING success — which combined with a 1 s health-check made the indexer pay one timeout per cycle indefinitely. Replaced with a sliding-window failure detector (5 failures in 30 s to trip, 3 consecutive successes to recover) on the BulkCircuitBreaker pattern. CacheWarmupApp parsed user config as EventPublisherJob (the SearchIndex schema), which broke the Configuration page once cacheWarmupAppConfig.json gained a type discriminator. Switched to CacheWarmupAppConfig in all four parse sites and decoupled runtime status/stats from the parsed config. Removed the readAppConfigFlags() workaround that read warmBundles / enableDistributedClaim out of a raw map. Bails with ACTIVE_ERROR (not COMPLETED) when an entity type is only partially warmed; retries on transient cache unavailability instead of giving up on the first miss. Stop actually stops. Three pieces: - DistributedJobStatsAggregator skips the WebSocket status broadcast while the job is STOPPING so it doesn't overwrite the AppRunRecord.STOPPED that AppScheduler.updateAndBroadcastStoppedStatus pushed. Self-stops after a 30 s grace if the executor never gets to call stop() on it. - DistributedSearchIndexExecutor.stop() now calls workerExecutor.shutdownNow() after flagging workers, so threads parked inside the bulk-sink semaphore, initializeKeysetCursor, or waitForSinkOperations (5-min deadline) get interrupted instead of grinding for minutes. - OpenSearchBulkSink replaces concurrentRequestSemaphore.acquire() with a 60-second tryAcquire, recording permanent failure on timeout. A leaked bulk future (callback never fires) can no longer permanently freeze every subsequent flush at a fixed record count.	2026-05-04 13:22:15 -07:00

Author

SHA1

Message

Date

Sriharsha Chintalapani

b5374f9fec

Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876 )

Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run

Details

Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions

Details

Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run

Details

Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions

Details

Java Checkstyle / java-checkstyle (push) Waiting to run

Details

Maven Collate Tests / maven-collate-ci (push) Waiting to run

Details

OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run

Details

OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions

Details

OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions

Details

OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions

Details

OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions

Details

Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run

Details

* Reindex robustness: selective fields, cache fail-fast, stop actually stops

Three independent fixes that all surfaced from the same incident: a 580k-
container reindex that froze for hours, then refused to actually stop when
the user clicked Stop.

Selective fields in the distributed reader path. PartitionWorker was
hardcoding List.of("*"), triggering every fieldFetcher in setFieldsInBulk —
including fetchAndSetOwns on Team/User where every owned entity becomes a
getEntityReferenceById round-trip. PR #27723 fixed this for EntityReader
(single-server) but the distributed pipeline never picked it up. Lifted the
field-resolution into ReindexingUtil so both paths share one source of
truth.

Cache layer no longer flaps on a single Redis hiccup. RedisCacheProvider
used to flip the whole provider unavailable on the first 300 ms timeout and
flip back on the next PING success — which combined with a 1 s health-check
made the indexer pay one timeout per cycle indefinitely. Replaced with a
sliding-window failure detector (5 failures in 30 s to trip, 3 consecutive
successes to recover) on the BulkCircuitBreaker pattern.

CacheWarmupApp parsed user config as EventPublisherJob (the SearchIndex
schema), which broke the Configuration page once cacheWarmupAppConfig.json
gained a type discriminator. Switched to CacheWarmupAppConfig in all four
parse sites and decoupled runtime status/stats from the parsed config.
Removed the readAppConfigFlags() workaround that read warmBundles /
enableDistributedClaim out of a raw map. Bails with ACTIVE_ERROR (not
COMPLETED) when an entity type is only partially warmed; retries on
transient cache unavailability instead of giving up on the first miss.

Stop actually stops. Three pieces:
- DistributedJobStatsAggregator skips the WebSocket status broadcast while
  the job is STOPPING so it doesn't overwrite the AppRunRecord.STOPPED that
  AppScheduler.updateAndBroadcastStoppedStatus pushed. Self-stops after a
  30 s grace if the executor never gets to call stop() on it.
- DistributedSearchIndexExecutor.stop() now calls workerExecutor.shutdownNow()
  after flagging workers, so threads parked inside the bulk-sink semaphore,
  initializeKeysetCursor, or waitForSinkOperations (5-min deadline) get
  interrupted instead of grinding for minutes.
- OpenSearchBulkSink replaces concurrentRequestSemaphore.acquire() with a
  60-second tryAcquire, recording permanent failure on timeout. A leaked
  bulk future (callback never fires) can no longer permanently freeze every
  subsequent flush at a fixed record count.

2026-05-04 13:22:15 -07:00

1 commit