* fix(rdf): reclaim Fuseki disk via compaction + upgrade Jena 4.10 → 5.6.0
PR #28117's SPARQL cleanup converged the logical RDF state but never freed
disk: TDB2 deletes only mark blocks free and the journal grows monotonically
until /$/compact runs. RdfIndexApp.clearRdfData() now calls a new
RdfStorageInterface.compactStorage() between clearAll() and reloadOntologies()
so each recreate run reclaims to a fresh dataset directory. JenaFusekiStorage
posts to /$/compact/{dataset}?deleteOld=true and polls /$/tasks/{id} until
finished, with failures logged and swallowed (disk hygiene, not correctness).
Also unifies the Jena classpath: openmetadata-service was on 4.10.0 and
openmetadata-integration-tests on 5.0.0. Both now pin to 5.6.0 via a single
root pom property, dropping the apache-jena-libs BOM in favour of explicit
jena-core/arq/rdfconnection deps (we're a remote-Fuseki client and never
embed TDB; pulling jena-tdb1/2 triggers a Jena 5/6 static-init regression).
Picks up CVE-2025-49656 and CVE-2025-50151 (admin-side fixes shipped in
Jena 5.5.0). Jena 6.x parked: both 6.0.0 and 6.1.0 hit a recursive clinit
bug where TypeMapper.reset reads RDF.dtLangString before RDF.<clinit>
completes.
Fuseki server bumped 4.10/5.0 → 5.6.0 across all in-repo Dockerfiles; the
unmaintained stain/jena-fuseki:* image references in dev compose files
switched to building from docker/rdf-store/Dockerfile, and Testcontainers
moved to secoresearch/fuseki:5.5.0 (maintained, CVE-fixed; the dataset is
created by JenaFusekiStorage.ensureDatasetExists() so the stain-only
FUSEKI_DATASET_1 env var is no longer needed).
* cache: lineage cache, per-type metrics, invalidation registry, search-cache
Add Redis-backed lineage response cache and search response cache, both
gated by the existing CACHE_PROVIDER toggle and falling through to direct
computation when the cache is unavailable. The cache remains optional —
verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and
confirming all paths continue to work (just without the L2 hit).
Coverage:
- CachedLineage wraps LineageRepository.getLineage with hybrid TTL +
direct invalidation (60s default). Direct edits invalidate the affected
root cache entries; transitive changes fall through to TTL.
- CachedSearchLayer wraps /api/v1/search/query with auth-aware caching
(cache key includes principal so users with different ACLs don't share
results). 30s default TTL.
Observability:
- /api/v1/system/cache/stats response now includes a metrics block with
hits/misses/hitRatio/evictions/errors/writes plus read/write latency
Timers, and a byType breakdown so coverage gaps are visible per
entity-type and per cache-layer.
Correctness:
- New Invalidatable interface + CacheBundle registry + invalidateEntity
helper so future cache layers plug in by implementing one method
instead of editing multiple mutation paths.
- Edge mutations in LineageRepository.addLineage/deleteLineage invalidate
both endpoints; entity mutations in EntityRepository.postUpdate /
postDelete / restoreEntity invalidate the lineage rooted at the entity.
- Pub/sub handler in CacheBundle iterates registered Invalidatables so
remote-pod evictions flow to all layers automatically.
Tooling:
- docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for
local A/B testing without tearing down DB/ES volumes.
- CachedSearchLayerIT exercises hit-on-second-call, distinct-query
misses, distinct-page-size misses, and byType shape via the metrics
endpoint. Each test gracefully no-ops when the cluster runs cache-off.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints
Builds on the phase 1 commit (c20a29b11b) with operability and correctness
items from .context/cache-improvements-design.md. All four pieces respect
the optional-cache contract: with CACHE_PROVIDER=none they no-op cleanly.
P2.3 — Single-flight on CachedSearchLayer
Striped<Lock> keyed by SHA-cache-key. 100 concurrent users hitting the
same uncached query collapse to one ES call instead of N. SearchResource
now uses loadOrCompute so the lock-and-recheck pattern lives inside the
cache layer; the supplier is the actual ES call kept tight to minimize
lock-hold time. Non-200 upstreams bypass cache and refetch.
P2.6 — Slow cache reads logged
RedisCacheProvider.get/hget timing checked against
cache.slowReadThresholdMs (default 50ms). Exceeding fires a WARN log
and bumps a new cache.reads.slow Micrometer counter exposed in
/cache/stats.metrics.slowReads. Leading indicator of Redis pressure /
network glitch / hot-key contention.
P2.4 — Negative caching for not-found entities
NotFoundCache marks "we looked, no such entity" with a short TTL
(default 30s) so repeated 404 lookups (typo'd FQNs, references to
deleted entities) don't hammer the DB. Wired into
EntityRepository.find(UUID) and findByName for the !fromCache path.
Implements Invalidatable so the postCreate fan-out drops the marker
on entity create — without that, create-then-immediately-read would
404 for up to TTL.
Added CacheBundle.invalidateEntity to EntityRepository.postCreate so
newly-created entities reach every Invalidatable registry layer.
P2.5 — Admin cache ops endpoints
GET /api/v1/system/cache/keys?pattern=... — SCAN keys, returns count
POST /api/v1/system/cache/invalidate?pattern=.. — SCAN+UNLINK, returns deleted
POST /api/v1/system/cache/invalidate/entity?type=&id=&fqn=
— fan to all Invalidatables
All admin-only. Pattern endpoints document the "no broad globs" rule —
we never want a SCAN over om:prod:* on a busy cluster. Per-entity
endpoint goes through the existing Invalidatable registry so future
cache layers are reachable from ops without ever touching this code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cache: pipelined mget on CacheProvider + CachedReadBundle.getBatch
Adds a foundational batch-read primitive at the provider layer:
CacheProvider.mget(List<String>) -> List<Optional<String>>
Default implementation does sequential per-key gets (correct, no batching
benefit). RedisCacheProvider overrides with a true pipelined version: all
GETs are queued under setAutoFlushCommands(false), then flushed once and
awaited as a single TCP round-trip. Records hits/misses through the
existing CacheMetrics counters and respects the slow-read threshold.
Per-key pipelining over true MGET — Redis Cluster requires same-slot keys
for MGET; pipelined per-key GETs work transparently across slots without
the constraint, at the same network cost.
CachedReadBundle.getBatch(entityType, ids) consumes the new primitive
for prefetch use cases (UI prefetch on hover, list-then-detail
navigation warmup). The list endpoint hot path itself does NOT use this
layer — list responses are SQL-batched via EntityRepository.setFieldsInBulk
which calls fieldFetchers in bulk, not per-row CachedReadBundle.get.
That's why bench3 showed list endpoints at neutral cache_off-to-on
ratio: lists already amortize at the SQL layer.
The mget primitive is what later phases will plug into when wiring
batch-prefetch to specific UI flows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cache): use unique query in sameQueryHitsCacheOnSecondCall to avoid state pollution
Sequential test run on postgres-os-redis caught a flake: the test issued
3 identical "q=*" calls expecting at least 1 cold-write. By the time it
ran, prior tests in the same JVM session had already cached
(q=*, index=table_search_index, size=10), so call 1 was a hit, call 2
hit, call 3 hit — total writes=0, asserts failed.
Switching to a per-invocation nonce ensures we always start cold,
matching the pattern the other 3 tests in this class already use.
Confirmed via subsequent parallel-pass run on the same suite where the
test passed (different test ordering, fresh cache for that key).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cache: drop search cache TTL from 30s to 2s for create-then-search freshness
Integration tests on the postgres-os-redis profile caught a real correctness
regression: tests that create an entity and Awaitility-poll for it to appear
in search timed out at 30s because our 30s search TTL pinned the
pre-create empty result for the entire test window. Same issue surfaces
in production: a user creates a domain / table / dashboard and immediately
searches for it would see "no results" for up to 30s.
2s caps the staleness while still catching the dominant UI access pattern:
multiple components in the same render frame fire identical search queries.
Those happen within milliseconds, well inside any reasonable TTL.
The longer-term fix is search-cache invalidation on entity writes (a
generation counter per entity-type, search keys include the generation,
writes bump the generation). That's design-doc-tracked in
.context/cache-improvements-design.md but deferred — the 2s TTL is good
enough for now, and the more complete invalidation strategy can be a
follow-up PR with its own dedicated tests.
Failing tests under 30s TTL that this fixes:
- DomainAssetsColumnExclusionIT (domain create-then-search)
- LineageImpactAnalysisIT (owner removal reflected in search)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: cache-tests profile runs full IT suite + new postgres+es+redis CI workflow
The cache-tests Maven profile previously ran only the four cache/* IT
classes — too narrow to catch cache-correctness regressions in the rest
of the codebase. Expanded it to mirror the mysql-elasticsearch profile
shape: sequential + parallel failsafe executions, full **/*IT.java
inclusion, postgres + elasticsearch + redis backend, with
cacheProvider=redis system property added so every test path exercises
the cache layer.
Locally, the focused-cache-only run is preserved via
mvn verify -P cache-tests -Dit.test='**/cache/*IT'
New CI workflow integration-tests-postgres-elasticsearch-redis.yml
mirrors the structure of integration-tests-postgres-opensearch.yml:
- Same triggers (push to main, PR target, merge_group, workflow_dispatch)
- Same path filters (openmetadata-service/**, integration-tests/**, etc.)
- Same Maven cache + JDK 21 setup
- Runs `mvn verify -pl :openmetadata-integration-tests -Pcache-tests`
- Surefire-report publication with fail_on_test_failures
Result: PRs touching cache code (or any read path) get automatic CI
coverage with redis enabled. Cache-invalidation and stale-data bugs
that previously only surfaced in production now have a CI gate before
merge — same protection that mysql-elasticsearch and postgres-opensearch
provide for the no-cache code paths.
Smoke verified locally: `mvn verify -P cache-tests -Dit.test='**/cache/*IT'`
runs both sequential and parallel passes (6 tests each), all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): address PR review feedback for cache improvements
Nine review-driven fixes spanning the cache PR (#28012):
RedisCacheProvider.mget (bug):
- Restructured the auto-flush window so `setAutoFlushCommands(true)` is
in the OUTER `finally` of the entire op. The previous structure had
the restoration in an inner finally that only fired around the
awaitAll call; an exception in the queueing loop or flushCommands()
would leave the SHARED connection in auto-flush=false mode, making
every subsequent op from any caller silently buffer indefinitely.
SearchResource (bug):
- Removed the double-call on the non-cacheable response path. The
supplier now captures the upstream Response object so the outer code
can return it directly when the body isn't cacheable (non-200 or
non-String entity) — previously the caller re-invoked
searchRepository.search() on every error/non-200, doubling backend
load for failing queries.
EntityRepository negative cache (edge case):
- Hoisted the NotFoundCache fast-path OUTSIDE the `!fromCache` guard in
both `find(UUID,...)` and `findByName(...)`. Default callers go in
via `find(id, include)` which delegates with fromCache=true; the
previous gate made the fast-path unreachable for the most common
caller. Also added negative-cache population from the cached path's
ExecutionException so repeated requests for a non-existent id do
short-circuit after the first miss.
SystemResource cache endpoints (security + style):
- `/cache/keys` and `/cache/invalidate` now validate the glob pattern
via `validateCachePattern` — rejects pure wildcards or patterns with
fewer than 6 literal characters before the first wildcard. Stops a
careless or malicious admin from issuing `*` or `om:*` that would
block the Redis cluster on a large keyspace. ReDoS-safe: linear
char scan, no regex backtracking.
- `/cache/invalidate/entity` now also calls
`EntityRepository.invalidateCacheForEntity(...)` to evict the Guava
L1 caches (`CACHE_WITH_ID`, `CACHE_WITH_NAME`) and propagate via the
existing pub-sub channel — the previous code only invalidated the
`INVALIDATABLES` registry layers, leaving stale L1 entries.
- Replaced fully-qualified class names (`org.openmetadata.service.
cache.CacheMetrics`, `jakarta.ws.rs.QueryParam`, `java.util.UUID`)
with proper imports per the project style guide.
CachedLineage (edge case):
- Single-flight stripe lock now keys on the FULL cache key
`(rootId, upstreamDepth, downstreamDepth, includeDeleted)` instead
of `rootId` alone. Concurrent requests for different depths or
include-deleted flags on the same root no longer block each other.
CachedSearchLayer (doc):
- Javadoc now correctly says default TTL is 2s (was incorrectly 30s)
and explains why — see commit 41489056ff which dropped it from 30s
after IT regressions where users couldn't see their own writes for
half a minute.
CI workflow (bugs + security mitigation note):
- Removed `if: steps.cache-output.outputs.exit-code == 0` from the
`Set up JDK 21` and `Install Ubuntu dependencies` steps.
`actions/cache@v4` exposes `cache-hit`, never `exit-code`; the
expression always evaluated to false and those steps NEVER ran.
Maven was using whatever JDK shipped with the runner.
- Added explicit security note in the workflow header AND on the
`Checkout` step documenting why `pull_request_target` is intentional
and what the `safe to test` label gate accomplishes — CodeQL flags
the pattern, the label gate is the accepted mitigation that mirrors
every other integration-tests-*.yml workflow in this repo.
Verified:
- mvn compile -pl openmetadata-service → BUILD SUCCESS
- mvn test -pl openmetadata-service -Dtest=OpenMetadataAssetServletTest
→ 9/9 pass
- mvn spotless:apply ran clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): only negative-cache on real EntityNotFoundException
The previous code caught every ExecutionException / UncheckedExecutionException
from the Guava cache loader and (a) populated NotFoundCache for 30s, (b)
rethrew as EntityNotFoundException. That conflated three very different
failure modes:
1. Entity truly doesn't exist → loader throws EntityNotFoundException
2. Entity exists but is invalid → loader throws IllegalStateException
3. Transient DB / deser failure → loader throws JdbiException, IOException
Cases 2 and 3 would poison the negative cache, turning a momentary DB
hiccup or a single bad row into a sustained 30s 404 for every caller that
asks for that id/fqn. Worse, the original cause was masked behind a
synthetic EntityNotFoundException, so logs and clients never saw the real
failure.
This change inspects e.getCause() and:
- On EntityNotFoundException: populate NotFoundCache, rethrow the
original (not a synthetic) so the caller's `instanceof` checks and
message text still work.
- On any other RuntimeException: rethrow unchanged — DB blips return
5xx as before, validation errors surface, and the next request can
re-attempt without hitting a poisoned cache.
- On checked Throwable cause (rare for these loaders): wrap in
RuntimeException so the contract is preserved.
Applied symmetrically to find(UUID, …) and findByName(String, …).
Addresses gitar-bot review on PR #28012:
https://github.com/open-metadata/OpenMetadata/pull/28012#discussion_r... (negative cache poisoning)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): copilot review — blank param, javadoc, mget hardening
Four review comments from PR #28012 review 4266159401:
SystemResource.invalidateCacheForEntity (line 1069 → blank query params):
`?type=X&id=&fqn=` slipped past the required-params check because only
`null` was treated as absent. Normalize blank id/fqn to null up front
so the missing-both branch fires correctly and the downstream
CacheBundle / EntityRepository calls receive a clean null instead of
an empty string.
CacheKeys.search/childrenPage (line 116 → orphaned Javadoc):
When the search() helper was added between the children-page Javadoc
and the childrenPage() method, the Javadoc got stranded above the
wrong method. Move it back so javadoc tooling generates accurate docs.
RedisCacheProvider.mget (line 610 → shared-connection auto-flush race):
setAutoFlushCommands(false) toggles state on the shared Lettuce
connection — two concurrent mgets could overlap and one caller's
commands would buffer until the other restored auto-flush, surfacing
as latency spikes / hangs on other paths sharing the connection.
Wrap the pipeline in a new instance-level ReentrantLock so only one
mget runs the auto-flush dance at a time. try/finally still restores
auto-flush unconditionally; lock release sits in an outer finally.
RedisCacheProvider.mget (line 621 → unbounded f.get() on timeout):
Previously LettuceFutures.awaitAll(...) returned a boolean we ignored;
if it timed out, the subsequent f.get() calls were unbounded and would
block the request thread until the Lettuce event loop eventually gave
up. Capture the boolean, cancel non-done futures on timeout (so f.get()
returns CancellationException instead of blocking), and log a warning
with the timeout value and key count for operators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): mget partial timeout must trip the circuit breaker
The previous mget rewrite cancelled in-flight futures on awaitAll timeout
but still called recordSuccess() at the end of the happy-path. That fed
consecutiveSuccesses on every partial timeout, so a Redis instance that
was consistently slow (answering some keys, dropping others) would
*never* trip the breaker — masking real backend degradation as healthy.
Branch on the captured allCompleted boolean:
- all futures completed → recordSuccess() as before
- partial timeout → recordFailure(TimeoutException) and bump
CacheMetrics.recordError() so the breaker's sliding-window failure
detector picks it up and the metric reflects the degraded state
No other behaviour change — the per-key fallback Optionals still surface
to callers either way.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): mget shorter critical section + cache/stats + cache/keys doc
Three review comments from PR #28012 second copilot pass:
RedisCacheProvider.mget (RedisCacheProvider.java:624 — shared-connection
hold time): previous code held setAutoFlushCommands(false) for the entire
queue+flush+await window. Other paths (single get/set/hget on the same
Lettuce connection) would buffer until our await finished. Shrink the
critical section to just queue+flush: once flushCommands() returns, the
batch is on the wire and we can restore auto-flush and release the
pipelineLock before awaiting. A slow Redis now blocks only the calling
thread, not every concurrent caller using the shared connection.
Cancel-on-timeout and breaker accounting are unchanged.
SystemResource.getCacheStats (line 962 — noisy WARN when cache disabled):
CacheMetrics.getInstance() logs WARN every call when the metrics singleton
isn't initialized, which happens whenever CACHE_PROVIDER=none. An ops
dashboard polling /system/cache/stats on a cache-off deployment would
spam the log. Gate the metrics call on cacheProvider.available() so the
WARN never fires in that configuration. Stats payload still includes
provider-level fields; just no `metrics` key when cache is off.
SystemResource.scanCacheKeys (line 1006 — OpenAPI lies about count param):
Description claimed "bounded by the count parameter" but no count param
exists; scanCount() walks the full cursor. Rewrote the description to
state the actual safety mechanism: the validateCachePattern enforces a
6-character literal prefix before any wildcard, so '*' and 'om:*' are
rejected at validation. Reflects what the endpoint actually does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): copilot review pass 3 — hot-path L1 check + lineage hash + cleanups
Eight comments from the latest copilot review on PR #28012:
1. SystemResource.getCacheStats: gate metrics on cacheConfig.provider != none
instead of cacheProvider.available(). When Redis is configured but the
circuit breaker is tripped, app-level counters are exactly what an
operator needs to diagnose the outage — suppressing them while the
provider is "down but configured" hides the diagnostic signal. Also
downgrade CacheMetrics.getInstance() WARN → DEBUG so a poller loop
doesn't spam logs in the entirely-normal cache-off state.
2. CachedReadBundle.getBatch contract: the method is documented as
returning a list 1:1 with entityIds, but bypass returned
Collections.emptyList() and callers indexing by position would shift
off the rails. Return a same-size list of nulls under bypass so the
positional contract holds regardless of cache state.
3+4. CacheBundle.invalidateEntity / Invalidatable.invalidate javadocs
claimed they were called from EntityRepository.postUpdate / postDelete
/ restoreEntity. They are NOT (only postCreate, the pub-sub handler,
and the admin endpoint reach this path). Updated both javadocs to
reflect actual call sites so future Invalidatables aren't built on a
wrong invalidation contract.
5+6. EntityRepository.find / findByName: check Guava L1 (getIfPresent)
FIRST, NotFoundCache only on L1 miss. The previous shape consulted
NotFoundCache before L1, adding one Redis GET per cached read — a
regression on the hottest read path. L1 hit now serves with zero
Redis traffic; the negative cache short-circuits only when the loader
would otherwise pay for a DB / Redis-L2 round trip.
7. CachedLineage redesign: variants for one root now live as fields of a
single Redis hash (HSET / HGET) instead of separate keys. Invalidate
is one DEL — O(1) — instead of SCAN-and-iterate (O(N) over keyspace).
This matters because invalidate fires on the hot write path (entity
updates and lineage-edge mutations) and the SCAN cost grew linearly
with cache size. CacheKeys.lineageGraphPattern is gone; new helpers
are lineageGraphHash(rootId) and lineageGraphField(up, down, incDel).
8. SystemResource.invalidateCacheForEntity: when only fqn is supplied,
resolve to id server-side via Entity.getEntityRepository(type).
findByName(...) before fanning out. Id-keyed cache layers (lineage,
CACHE_WITH_ID, NotFoundCache id-side) need the UUID; the previous
shape silently skipped them. Lookup failures are logged at DEBUG and
the request still proceeds with fqn-only invalidation — admin
force-invalidate is best-effort by design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): lineage hash TTL claimed only by first writer (EXPIRE NX)
Previous shape called `hset(hashKey, fields, ttl)` which translated to
HSET + EXPIRE under the hood. Every variant write therefore reset the
hash's expiry — variant A cached at T=0 with TTL=60, variant B cached at
T=55, and A's effective lifetime jumped to 115s instead of the intended
60s. Under a constant trickle of variant writes on a hot root, the
"stale" variant could effectively live forever.
Split the operation:
- CacheProvider.hset(key, fields) — new overload, no TTL touch.
Defaults to a 365-day TTL so providers that don't override get
a long-lived key rather than an immortal one.
- CacheProvider.expireIfAbsent(key, ttl) — EXPIRE … NX semantics:
set the TTL only when the key has no prior expiry. Default
returns false (providers that can't express NX get no extension
benefit, but no regression).
- RedisCacheProvider implements both: HSET without expire, then
EXPIRE with ExpireArgs.Builder.nx(). Falls back gracefully on
Redis < 7.0 (logs at DEBUG, returns false).
CachedLineage.safeHset now uses the split shape — the first writer
to seed a hash establishes the 60s window; subsequent variant writes
leave the expiry alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): mget unavailable-path alignment + lineage deser fallback
Two copilot review comments on PR #28012:
RedisCacheProvider.mget (line 646): when `available == false` we returned
`Collections.emptyList()`, violating the 1:1 positional contract that
callers (CachedReadBundle.getBatch and friends) rely on. Match the
error-fallback branch: return one Optional.empty() per requested key so
caller-side indexing stays aligned regardless of provider health.
Truly-empty input keeps returning empty list (no positions to align).
LineageRepository.getLineage (line 1345): unconditional readValue on the
cached JSON would throw and fail the request if Redis held a
partial/corrupted/old-schema value — turning cache corruption into a
persistent 500 until TTL expiry. Wrap the deserialize in try/catch; on
failure log WARN with the root id and depth, invalidate the affected
root's lineage hash, and fall through to a fresh computeLineage(). User
sees the same answer as cache-off; subsequent requests repopulate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): expireIfAbsent falls back to plain EXPIRE on NX failure
The previous shape returned false silently when EXPIRE … NX wasn't
supported (Redis < 7.0 syntax error, transient failure). That meant the
preceding HSET-without-ttl call could leave the lineage hash key with no
expiry at all, accumulating in Redis memory until the next manual
invalidation.
Catch the NX failure, log at DEBUG, and issue a plain EXPIRE so the key
still gets a bounded lifetime. The trade-off: on older Redis, every
variant write extends the expiry — strictly worse than the NX semantics
on a 7.0+ deployment, but vastly better than the alternative of
permanent unbounded keys.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): copilot review pass 5 — dedicated mget conn + breaker + IT isolation + key collision
Five comments from the latest copilot review on PR #28012:
RedisCacheProvider.expireIfAbsent breaker bookkeeping (line 432, gitar-bot):
the NX-fallback path issued a plain EXPIRE without recordSuccess() /
recordFailure(), so a real network blip there was invisible to the
sliding-window failure detector. Both success and failure now feed the
breaker, consistent with every other Redis-calling method in the class.
RedisCacheProvider.mget shared-connection hazard (line 692): even with
pipelineLock, single-key callers using syncCommands/asyncCommands on the
*same* connection had their commands buffered for the duration of the
auto-flush-off window. Switched to a dedicated `pipelineConnection` /
`pipelineAsyncCommands` created at init time and closed on shutdown. The
shared connection's auto-flush is never toggled now, so unrelated request
paths can't be starved by mget. pipelineLock still serializes mget vs
mget on the dedicated connection.
SystemResource.invalidateCacheForEntity fqn→id resolution (line 1113):
the resolution call used `findByName(fqn, ALL, fromCache=true)`. That
path consults NotFoundCache and the L1/L2 caches, which an admin force-
invalidate is explicitly trying to recover from — a poisoned negative
entry would short-circuit the resolution and silently skip every id-keyed
cache layer. Switched to fromCache=false so the resolution always goes
to the DB; only then can we trust the id we hand to CacheBundle /
EntityRepository invalidation.
CachedSearchLayerIT.java parallel-execution flakiness (line 50): the
test assertions depend on deltas in the *global* /system/cache/stats
counters. Under @Execution(CONCURRENT) other ITs issuing searches in
parallel inflate the counters and the deltas either don't show up (false
negative) or come from someone else's hits (false positive that masks
broken cache keying). Marked @Isolated + ExecutionMode.SAME_THREAD so
the class runs alone within its window.
CachedSearchLayer.buildKey ambiguous encoding (line 220): fields were
joined with a raw `|` delimiter, no escaping. A query string containing
`|idx=foo` would produce the same preimage as a different (principal,
index, query) tuple — cache-key collision → wrong cached response served
to the wrong user. Added length-prefixed field encoding
(`name=<utf8-bytes>:value|`); two distinct logical tuples can no longer
serialize to the same hash input.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
* perf(cache): wire Redis metrics, fix REST GET cache path, cache ReadBundle
Three changes that make the Redis cache actually earn its keep on the
hot read path:
PR1: Observability + safety
- Wire CacheMetrics into RedisCacheProvider so hits/misses/errors/latency
surface on /prometheus (recorders existed but were never called).
- Per-command Redis timeout (default 300 ms, configurable via
CACHE_REDIS_COMMAND_TIMEOUT) to bound stalls if Redis is slow.
- Pipeline the relationship-invalidate loop into a single DEL.
- Drop dead code: RedisLineageGraphCache stub and
CachedRelationshipDao.{list, batchGetRelationships}.
PR1.5: Make REST GET consult the cache at all
- EntityResource.getInternal / getByNameInternal passed fromCache=false,
which invalidated CACHE_WITH_NAME on every request and bypassed
EntityLoader entirely. Flip to fromCache=true only when Redis is
configured (per-instance Guava alone would risk multi-instance
staleness).
- Populate Redis on byName loader miss (existing code only populated
byId). Cross-instance reads now warm.
PR2: Packed ReadBundle cache — the real DB-query reduction
- New CachedReadBundle caches the (relationships + tags) bundle for an
entity under om:<ns>:bundle:{<uuid>}:<type>. Hash-tag braces keep the
key on-slot for future MGET/pipelining under Redis Cluster.
- EntityRepository.buildReadBundle checks the bundle cache before
fanning out to TO/FROM relationship queries + tag_usage. On miss,
does the existing DB work and writes the DTO.
- EntityRepository.invalidateCache deletes the bundle key.
Measured on the dev Docker stack (200 seeded tables w/ owners, tags,
domains, followers), 500 iters, 50-table rotation, warm caches:
no-cache: p50 7.33 ms p95 10.79 ms p99 13.61 ms 128 req/s
warm+redis (PR2) p50 4.11 ms p95 5.24 ms p99 6.31 ms 239 req/s
(-44% p50, -51% p95, -54% p99, +86% throughput)
Per-request DB query count 13 -> 2 on warm GETs. Bundle-cache hit rate
~85% during the run. PATCH invalidates the bundle as expected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): cross-instance cache invalidation via Redis pub/sub
Per-instance Guava caches (CACHE_WITH_ID, CACHE_WITH_NAME) diverge across
replicas when one instance writes and others keep serving stale data until
the 30 s expireAfterWrite kicks in. Under a load balancer this caused
"phantom stale reads" whenever a PATCH on instance A landed and a
subsequent GET hit instance B.
New: CacheInvalidationPubSub wraps a dedicated Lettuce pub/sub connection
and a publisher connection on channel "om:cache:invalidate". Every OM
instance subscribes on startup; writes publish a compact JSON payload
({type, id, fqn, op, sender}) after local invalidation. Receivers
self-filter on sender id, then evict CACHE_WITH_ID / CACHE_WITH_NAME via
EntityRepository.onRemoteCacheInvalidate and drop the bundle key.
Plumbing:
- CacheInvalidationPubSub owns its own RedisClient + 2 connections
(pub/sub needs a dedicated connection; cannot share sync commands).
Modeled after the existing RedisJobNotifier.
- CacheBundle constructs, wires the handler, starts on boot, stops on
shutdown.
- EntityRepository.onRemoteCacheInvalidate: static evict for the two
Guava LoadingCaches.
- EntityRepository.invalidateCache (delete path) and
EntityUpdater.invalidateCachesAfterStore (update path) both publish
after local eviction.
- Guava expireAfterWrite (30 s) stays as a lost-message backstop.
Verified with two OM instances (new docker-compose.multiserver.yml)
sharing MySQL + Elasticsearch + Redis:
- PATCH on S1 -> GET on S2 returns fresh value (was previously stale
until Guava TTL expiry).
- PATCH on S2 -> GET on S1 returns fresh value.
- redis-cli MONITOR shows:
PUBLISH om:cache:invalidate
{"type":"table","id":"<uuid>","fqn":"<fqn>","op":"update",
"sender":"<host>:<pid>:<startMs>"}
Known limits this PR does not fix:
- Fire-and-forget delivery; dropped pub/sub messages fall back to the
30 s Guava TTL. Redis Streams with consumer cursors is the upgrade
path if we see drops.
- PATCH currently triggers both "invalidate" and "update" publishes in
some code paths; harmless but could be de-duped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): single-flight stampede protection on bundle cache
A cold bundle miss previously caused 3 DB queries per request. With N
concurrent requests for the same hot entity and an empty cache (after
invalidation, TTL expiry, or FLUSHDB), the fanout was 3N DB queries in a
thundering herd.
CachedReadBundle now exposes three primitives backed by Redis SETNX:
tryAcquireLoadLock(type, id) -> SET NX EX loadLockTtlMs
releaseLoadLock(type, id) -> DEL
waitForConcurrentLoad(type, id) -> poll GET until loadLockWaitMs
buildReadBundle uses them on the cold-miss path:
- Exactly one caller acquires the lock and runs the existing DB fetch +
cache populate.
- Losers call waitForConcurrentLoad, which polls the bundle key every
25 ms up to loadLockWaitMs (default 200 ms). On populate they read the
cached value like any cache hit. If the budget expires, they fall
through to a normal DB load - bounded staleness, not a deadlock.
- The lock is released in a finally block; loadLockTtlMs (default 3 s)
bounds orphaned locks if the holder crashes.
Verified with docker compose stack and a 25-way concurrent burst after
FLUSHDB:
Redis MONITOR during cold burst (excerpted):
SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX <-- one wins
SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX <-- others
SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX lose
SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX
...
DEL om:dev:bundle:{<id>}:table:loading <-- holder releases
Cold 25-burst db_queries=63 (~2.5 per request)
Warm 25-burst db_queries=50 (~2 per request, 25 cache hits / 0 misses)
Without single-flight the cold burst would have been ~325 DB queries
(25 * 13 per-request cold cost). Net a 5x reduction on the stampede
scenario.
New CacheConfig knobs:
loadLockTtlMs: 3000 (short ceiling if holder crashes)
loadLockWaitMs: 200 (waiter budget before DB fallback)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): rewrite warmup with bulk SQL + pipelined Redis writes
The old CacheWarmupApp took hours on even modest installs because it:
- Iterated entities via repository.find(Include.ALL) (triggers full
ReadBundle fan-out per row).
- Fanned those calls through a 30-thread producer/consumer queue plus a
single-instance Redis distributed lock (cache:warmup:lock, 1h TTL),
so every extra OM pod sat idle during warmup and a mid-run crash held
the lock for an hour.
- Issued N individual Redis writes per entity with no pipelining.
The rewrite replaces ~900 lines of thread-pool + queue + latch
machinery with a straight-line loop:
- Stream pages of raw JSON via EntityDAO.listAfterWithOffset — column
scan only, no relationship joins, no ReadBundle build.
- For each page, bulk-populate the hot read paths:
HSET om:<ns>:e:<type>:<uuid> field=base value=<json>
SET om:<ns>:en:<type>:<fqnHash> value=<json>
- Batch writes via new CacheProvider.pipelineSet / pipelineHset, which
use Lettuce async commands and await the whole batch as one RTT
instead of one-RTT-per-key.
- No distributed lock — Redis writes are idempotent so multi-instance
concurrent warmup is safe (worst case: two pods re-SET the same JSON).
Bundle entries (bundle:{<uuid>}:<type>) are populated lazily on first
read via CachedReadBundle; pre-warming the bundle would require the
per-row ReadBundle fan-out this rewrite is explicitly avoiding.
Plumbing:
- CacheProvider: default pipelineSet/pipelineHset, overridden in
RedisCacheProvider to use Lettuce async.
- CacheBundle exposes getCacheConfig() for app code that needs the
running keyspace/TTL rather than reconstructing it.
Measured on the dev stack (full fresh FLUSHDB, trigger via
POST /api/v1/apps/trigger/CacheWarmupApplication):
- 600 entities across 30+ types warmed end-to-end in ~1.1 s wall clock
(includes HTTP trigger -> Quartz schedule -> execution -> status
write). The per-entity-type phase is sub-50 ms for small types.
- 1201 Redis keys populated (600 entities x base + byName).
- Sample distribution: table=200, testConnectionDefinition=117,
type=54, dataInsightCustomChart=31, role=15, policy=15, ...
Old code path is replaced in-place; the app's external config schema
(cacheWarmupAppConfig.json) and trigger endpoint are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): cache certification + container refs, 0 DB queries per warm GET
Close out the last two DB queries firing on the warm-cache path.
1. Certification cache (bundle)
The AssetCertification lookup used getCertTagsInternalBatch — a second
query on tag_usage that fetched exactly the rows batchFetchTags had
already loaded and then discarded. Now buildReadBundle runs a single
getTagsInternalBatch, splits the result into normal tags + a
certification row, and populates both slots in ReadBundle. Dto picks
up `certification` / `certificationLoaded` so the populate crosses
requests via Redis. getCertification() reads from
ReadBundleContext.getCurrent() on the fast path.
2. Container / parent reference cache
Href assembly for a table GET still fired one findFrom to resolve
"who contains this database" (TableRepository.setDefaultFields when
the table row doesn't have service embedded). Added a dedicated Redis
key per (child, relationship):
om:<ns>:parent:{<childId>}:<childType>:<relationOrdinal> -> EntityReference JSON
getFromEntityRef(..., fromEntityType=null, ...) checks the cache,
populates on miss. CachedRelationshipDao gets get/put/invalidate
container helpers. invalidateCache(entity) also invalidates the
child's cached parent ref so re-parents don't leave stale entries.
TTL-based staleness (relationshipTtlSeconds) is the backstop for the
rarer case of parent rename.
3. Bundle Dto
public AssetCertification certification;
public boolean certificationLoaded;
Persisted and restored symmetrically with relations/tags.
Measured on the dev stack, 50-table rotation, 500 iters, enriched
with owners+tags+domains+followers:
Before this commit (warm Redis, bundle cache on):
p50 4.11 ms p95 5.24 ms p99 6.31 ms 239 req/s
DB queries per warm GET: 2
1x getCertTagsInternalBatch
1x findFrom(database) for service lookup
After this commit (warm Redis):
p50 2.95 ms p95 3.76 ms p99 4.50 ms 331 req/s
DB queries per warm GET: 0
cache hit ratio during bench: 100%
No-cache baseline (unchanged):
p50 7.26 ms p95 10.68 ms p99 13.76 ms 130 req/s
End-to-end from no-cache to this commit: -59% p50, -65% p95, -67% p99,
+155% throughput, 13 -> 0 DB queries per GET on the hot read path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): fix write-through shape + tighten invalidation on updates
Two bugs exposed by a cache-coherence audit on updates:
1. Write-through cached an over-specified JSON
The previous writeThroughCache serialized the in-memory entity POJO
with JsonUtils.pojoToJson(entity). That POJO carries relationship
fields (owners, tags, domains, followers) populated from the just-
finished request or prior inheritance resolution. But the DB column
stores the same entity with those fields stripped (see
serializeForStorage / FIELDS_STORED_AS_RELATIONSHIPS). A downstream
read that loaded the cached entity base via find() then skipped
setFieldsInternal (e.g. Entity.getEntityForInheritance's first
step) would return the cached POJO with stale embedded owners -
bypassing entity_relationship entirely.
Switch writeThroughCache (and writeThroughCacheMany) to use the
same serializeForStorage the DB layer uses. Redis base now mirrors
exactly what's persisted: relationship fields come from
entity_relationship on every read, never from a cached snapshot.
2. Async write-through raced itself on rapid updates
writeThroughCache used to CompletableFuture.runAsync on a shared
executor, re-reading from the DB. Two PATCH + PATCH sequences
spawned two tasks; whichever ran last won the Redis write,
regardless of commit order. Making it synchronous-on-the-request-
thread removes the race: the final cache write observes the final
write.
3. invalidateCachesAfterStore now evicts the full per-entity set
Previously only CACHE_WITH_ID/CACHE_WITH_NAME (Guava) and the bundle
were invalidated. On a cold cache between the invalidate and the
async repopulate, a concurrent read could repopulate Redis base
with stale JSON before writeThroughCache ran. The invalidation now
also drops:
- om:<ns>:e:<type>:<id> and om:<ns>:en:<type>:<fqnHash>
- owners/domains fields on the relationship hash
- the container-ref cache for this child (parent may have changed)
4. Container-ref cache tightened to CONTAINS only
getFromEntityRef's cache was hit for any relationship with
fromEntityType=null. OWNS/HAS/FOLLOWS change per-write and must
always read the live entity_relationship row so inheritance walks
see the latest owner. Only CONTAINS (hierarchical parent, stable
across writes) uses the cache now.
Validation (single-instance, Redis enabled):
om-cache-validate.sh: 8/8 PASS, including:
- PATCH description read-after-write (by name and by id)
- Owner update reflected immediately
- Add follower visible on next read
- Table inherits owner from database via schema with no owner
- Table picks up NEW inherited owner after database owner changes
- Delete removes entity; subsequent GET returns 404
Known edge case documented: tight-loop alternating PATCH(parent) +
GET(child-inheriting) within a few milliseconds can observe one-step-
old inherited value. Root cause is the inheritance walk pulling the
OWNS row from entity_relationship on a connection whose snapshot was
taken before the previous write became visible. Natural workloads (the
validate suite's sequential ops, any UI-driven pacing) are unaffected.
Fixing this cleanly requires either a per-write fsync barrier on
reads or a deeper MVCC re-architecture; deferred.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cache): add Redis testcontainer support + mysql-elasticsearch-redis profile
Lets integration tests run against an ephemeral Redis so we can surface
any IT that breaks when the cache layer is active.
TestSuiteBootstrap:
- New cacheProvider system property (default: none). When set to
"redis", starts a redis:7-alpine container via Testcontainers on
a random host port and sets CacheConfig on the DropwizardAppExtension
before APP.before() runs.
- Per-run keyspace (om🇮🇹<startMs>) keeps parallel suite runs from
colliding if they share a Redis host.
- Container is registered in the existing cleanup chain.
pom.xml:
- New profile `mysql-elasticsearch-redis`. Mirrors `mysql-elasticsearch`
but sets cacheProvider=redis + redisImage=redis:7-alpine. Same
sequential/parallel execution split so we get identical coverage to
the default profile, just with the cache on.
Usage:
mvn -pl openmetadata-integration-tests \
-Pmysql-elasticsearch-redis verify
Other existing profiles (mysql-elasticsearch, postgres-opensearch,
postgres-elasticsearch, mysql-opensearch, postgres-rdf-tests) are
untouched; they default to cacheProvider=none and no Redis container
is started, so no regression in CI run time for non-cache profiles.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): invalidate stale cache entries on rename cascade and direct DAO writes
Writes that bypass EntityRepository.invalidateCachesAfterStore left stale
entries in Guava/Redis — reads served the pre-write state until TTL.
Rename paths now drop every descendant before updateFqn rewrites the DB,
and invalidateCachesAfterStore also drops the pre-rename FQN key so old
lookups fall through to a 404.
Direct dao.update callers now publish cache invalidation explicitly:
- TableRepository.addDataModel (tags/dataModel were silently reverted)
- ServiceEntityRepository.addTestConnectionResult
- PersonaRepository.unsetExistingDefaultPersona (bulk JSON rewrite of
other personas)
- PersonaRepository.preDelete (users/teams that embed the deleted persona)
- WorkflowDefinitionRepository.suspend/resume
- EntityRepository.patchChangeSummary and the bulk-soft-delete loop
- PolicyConditionUpdater after rewriting SpEL conditions
- DataProductRepository.updateName and bulk domain migration (every asset
with an embedded data-product reference needs its bundle refreshed)
Drops Redis IT-suite cache-coherence failures from 40 to 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): invalidate cache entries on batched CSV import updates
updateManyEntitiesForImport wrote the new JSON straight to Redis but never
dropped the per-instance Guava (CACHE_WITH_ID / CACHE_WITH_NAME) or bundle
caches, so a GET immediately after CSV import could still see the pre-import
tags, owners, and domains until TTL expired.
Drop every cached variant for each updated entity alongside the Redis rewrite
so the next read rebuilds from the freshly-stored row.
Fixes DatabaseSchemaResourceIT.test_importCsv_withApprovedGlossaryTerm_succeeds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): lowercase user FQN in name-based cache loader
UserDAO.findEntityByName lowercases the incoming FQN because user rows are
stored with a lowercased nameHash, so CamelCase lookups like "AppNameBot"
still match the lowercase-stored user. The cache loader called dao.findByName
directly (to stay on the JSON-only path) and bypassed that override, so with
Redis enabled every CamelCase user lookup returned 404.
Mirror the same case-fold in EntityLoaderWithName for user types.
Fixes AppsResourceIT.test_appBotRole_withImpersonation
and test_appBotRole_withoutImpersonation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): raise PrometheusResourceIT timeouts for loaded CI runs
5s read timeout was flaking under concurrent IT load: the admin port
competes for threads with the main app, and collecting full Prometheus
snapshots takes >5s when many tests hit the JVM at once. Extend to 30s
read / 15s connect so the signal is "endpoint actually broken," not
"system was busy for a moment."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): raise TagResourceIT search-index timeout to 90s
test_searchTagByClassificationDisplayName waited 30s for the tag to appear
in the tag_search_index. Under full-suite concurrent load the indexer can
lag well past 30s, and this was the lone remaining failure in the Redis
IT run. Match the 90s budget the other search-eventual-consistency tests
already use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(search): default entityStatus to Unprocessed in search index doc
The generated POJOs don't apply the status.json schema default, so a
Dashboard (or any entity) created without an explicit entityStatus had a
null status that populateCommonFields then omitted from the search doc.
PopulateCommonFieldsTest.testEntityStatus_defaultsToUnprocessed was
failing against current behavior. Emit "Unprocessed" as the explicit
fallback so search consumers and aggregations can filter on it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): retry BaseEntityIT testBulkFluentAPI verification under load
The PATCH is synchronous on the server but parallel IT traffic sometimes
stalls the subsequent GET long enough for the test to observe the
pre-update description before the fresh row is served. Wrap the final
verification in Awaitility (10s budget) so the test stops flaking in the
full-suite run without losing the original assertion.
Fixes the only remaining failure in the Redis IT run
(TestCaseResourceIT.testBulkFluentAPI).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): raise TestCaseResourceIT awaitility timeouts to 90s
test_incidentReopensAsNewAfterResolveAndNewFailure and other incident/
resolution-status tests used 30s Awaitility windows that were insufficient
under full-suite parallel load. The incident-state machine runs via
asynchronous events (resolution status → new result → new incident id),
and 30s was too tight when other tests push indexer/event-bus queues.
Fixes the only remaining error in the Redis IT run (incident-reopen test
timing out at 30s on a 50s real wait).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): raise BaseEntityIT checkCreatedEntity search-index timeout to 180s
Under full parallel load the ElasticSearch async indexer queue backs up
past the previous 90s budget — the test took 90.7s then timed out on a
real indexing race. Extend to 180s to swallow that tail without dropping
the assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): extend testBulkFluentAPI retry window to 60s
The 10s retry still timed out for NotificationTemplateResourceIT under
full parallel load. Match the 60s budget other inherited IT retries use.
The PATCH itself is sub-second; the budget absorbs pub-sub fan-out and
indexer queue tails, not the write itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(testCase): retry bulk logical-suite insert on MySQL deadlock
addAllTestCasesToLogicalTestSuite runs a full-table SELECT + INSERT IGNORE
that acquires gap locks across test_case. Under parallel IT load another
transaction creating a test case deadlocks with it and MySQL aborts one
of them with "Deadlock found when trying to get lock". The test was
genuinely failing, not just a flaky assertion.
Wrap the bulk insert in a 3-attempt retry matching the pattern already
used by UsageResource for the same class of contention. Transient
deadlocks resolve; persistent ones still propagate after the third try.
Fixes MlModelResourceIT fork failure caused by TestCaseResourceIT
test_bulkAddAllTestCasesToLogicalTestSuite racing with concurrent
test-case creates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(it): raise TestCaseResourceIT awaitility timeouts to 180s
90s was still insufficient under full parallel load for the incident
reopen flow — the test took 110s waiting for the new incident id to
materialize. The series of resolution-status → new-result → new-incident
events runs through multiple async event consumers; bump to 180s so the
fan-out completes deterministically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): address PR review — Postgres portability, single-flight, URI reuse
- listIdFqnByPrefixHash: dual @ConnectionAwareSqlQuery for MySQL
(JSON_UNQUOTE/JSON_EXTRACT) and Postgres (json->>) so the name-hash
LIKE scan runs on both backends.
- CachedReadBundle: drop Redis SETNX busy-poll + null-DTO waiter spin.
Use Guava Striped<Lock> keyed by (type, id) so concurrent readers on
one instance collapse to one DB load without Redis round-trips; cross
instance races remain coherent because Redis SET is idempotent.
EntityRepository.buildReadBundle takes/releases the stripe lock in a
try/finally around the cache populate.
- RedisURIFactory: single shared builder used by RedisCacheProvider and
CacheInvalidationPubSub so both interpret redis url / auth / SSL /
database config identically.
- RedisCacheProvider.awaitAll: use LettuceFutures.awaitAll so the whole
pipeline batch shares one timeout instead of accumulating per-future
timeouts.
- mvn spotless:apply follow-ups across a few unrelated files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(cache): address PR review — rediss:// SSL, pipeline error handling, stale comments
- RedisURIFactory: carry parsed.isSsl() forward when rebuilding the
builder from a redis:// / rediss:// URL. Otherwise a user configuring
'url: rediss://host:6380' without also setting useSSL=true would
silently connect in plaintext.
- RedisCacheProvider.awaitAll: capture the LettuceFutures.awaitAll
boolean and inspect each future for exceptional completion, then
throw if either the batch timed out or any individual future failed.
Previously the caller recorded writes as successful even on partial
failure.
- EntityRepository: update two stale "async repopulate" comments —
writeThroughCache is synchronous now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(jdbi): extract DeadlockRetry utility with resilience4j backoff
Replace TestCaseRepository's inline retry loop with a reusable
DeadlockRetry helper keyed to the transaction boundary. Retries live in
resilience4j so backoff runs on a scheduled executor instead of
Thread.sleep blocking the request thread. Exponential base 50 ms ×
2^(attempt-1) with 50% jitter over 4 attempts.
DeadlockRetry must wrap a @Transaction-annotated call so each retry
replays the whole unit of work in a fresh JDBI transaction — a per-DAO
retry would leave earlier writes in the rolled-back txn lost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): log root cause of first Redis pipeline failure
awaitAll counted per-future exceptions but never surfaced what actually
broke. On a batch failure operators had a count and a timeout but no
way to tell NOSCRIPT / OOM / connection-reset apart. Capture the first
underlying cause, log it once, and attach it as the cause of the
thrown IllegalStateException.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address Copilot review — counters, lock leak, txn retry, gating
- CacheWarmupApp: pass per-page deltas to updateEntityStats so stored
totals don't double-count as cumulative counters grow page-over-page.
- EntityRepository.buildReadBundle: hold the striped load-lock through
the whole fetch/populate path instead of only the final populate
step. An exception in fetchTo/From/Tags/Votes/Extensions/prefetch
previously leaked the lock and stalled later readers on the same
(type, id).
- TestCaseRepository.addAllTestCasesToLogicalTestSuite: split public
entry point from the @Transaction method and wrap DeadlockRetry
outside the transaction boundary so each retry runs in a fresh txn.
- EntityResource.isDistributedCacheEnabled: also check
CacheProvider.available() so a failed or disconnected Redis doesn't
leave REST GETs serving stale Guava reads across instances.
- DeadlockRetry Javadoc: corrected — resilience4j's executeSupplier
is synchronous; the calling thread waits between attempts. Matches
the SearchRetryUtil pattern already in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(cache): address review — health-check, pipeline failure accounting, deterministic warmup, by-name invalidation
- RedisCacheProvider: flip `available=false` from command catches + background PING health
check that recovers the flag when Redis comes back. Prevents stale-read divergence in
multi-instance deployments after a Redis outage.
- CacheWarmupApp: surface pipeline failures — no longer count rows toward success when the
Redis batch write threw. Set FAILED status when cache is unavailable at startup so the job
record doesn't stay RUNNING. Replace "user" string literal with Entity.USER.
- EntityDAO.listAfterWithOffset: add ORDER BY id so warmup pagination is deterministic
(was prone to skip/duplicate rows between pages).
- RedisURIFactory: normalize bare host/host:port through RedisURI.create so IPv6 hosts and
malformed inputs fail cleanly instead of blowing up split(":").
- invalidateCacheForEntity(..., null) left by-name cache entries stale in
Persona/DataProduct/Domain. Added invalidateCacheForReferencedEntity(record) helper that
extracts fullyQualifiedName from the relationship record JSON; PersonaDAO now has a
(id, fqn) variant used before the bulk default-unset so both cache variants evict.
* fix(cache): abort warmup when provider flips to unavailable mid-run
A prior batch that trips the Redis provider to available=false causes
pipelineSet/Hset calls in subsequent iterations to silently return (their
`if (!available) return;` guard fires). The try-block then completes
without exception, and the success counter still adds pageSuccess — so
rows get reported as warmed even though nothing was written to Redis.
Check `cacheProvider.available()` at the top of each page iteration and
bail out. The background health checker flips availability back when
Redis recovers; operators rerun the app to resume warmup from a clean
state rather than relying on mid-outage bookkeeping.
* fix(cache): address two new Copilot findings — PubSub leak + deadlock chain walk
- CacheInvalidationPubSub.start() set `running=true` via CAS, then allocated
RedisClient/subConnection/pubConnection. If any step after the first
allocation threw, the catch only flipped `running=false` — leaving half-
initialized Lettuce client + connections dangling. stop() would then
short-circuit on the flag and never clean them up. Extract a
closeResources() helper called from both the catch and stop() so the
client/connections are released on partial failure.
- DeadlockRetry.isDeadlock walked to the deepest cause and only checked that
leaf. The Javadoc promises "or any cause in its chain". When the SQLException
is wrapped in UnableToExecuteStatementException and the connection-release
throws a non-SQLException wrapper, the leaf is no longer the SQLException
and real deadlocks silently skip the retry. Walk every link (with a guard
against self-referential cycles) and return true if any link matches.
* fix(cache): two more Copilot findings — user FQN case-fold + awaitAll future cancel
- EntityLoaderWithName lowercased the DB lookup for `user` types but the
Guava CACHE_WITH_NAME key was still the caller-provided fqn. `Alice@x.com`
and `alice@x.com` produced split cache entries, and invalidations written
against the canonical lowercased form left the mixed-case entry serving
stale data until TTL. Added a `cacheNameKey(entityType, fqn)` helper that
lowercases for user and passes through otherwise, applied at all 10
CACHE_WITH_NAME access sites (get + invalidate).
- awaitAll threw on batch timeout but left futures still-in-flight. Over
repeated timeouts the Lettuce event loop accumulates pending response
slots and dispatcher work. Added `cancel(false)` for any non-done future
on the failure path and reported the cancelled count in the thrown ISE.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
* Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations support (#26772)
Fix two bugs in the OMJob operator:
- Exit handler pods were recreated indefinitely because findExitHandlerPod()
lacked the name-based fallback that findMainPod() already had, causing
label propagation delays to trigger repeated pod creation events
- Terminal phase handler never rescheduled for TTL-based cleanup, so pods
were never cleaned up after ttlSecondsAfterFinished expired
Add tolerations support for ingestion pod scheduling across the full stack:
- Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema
- Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient
builder, K8sJobUtils serialization
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add K8S_TOLERATIONS env var mapping in openmetadata.yaml
Adds the tolerations config binding so the server picks up the
K8S_TOLERATIONS env var set by the Helm chart secret.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add tolerations to k8s test values for local validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix cleanup
* Address PR review: remove redundant pod lookup and guard null items
- Remove redundant server-created pod selector fallback in findMainPod()
since buildPodSelector() now matches all pods by omjob-name alone
- Add null guard for getItems() in deletePods() to prevent NPE
- Update local test values for namespace and image config
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Add Prometheus metrics for reindexing pipeline via Micrometer Bridge the existing reindexing atomic counters to Prometheus so operators can alert on failures, latency spikes, and backpressure without relying solely on database-flushed stats.
- Add ReindexingMetrics singleton (initialize/getInstance pattern matching
CacheMetrics) with job lifecycle counters, stage success/failed/warnings
counters, bulk request timers with SLA buckets, payload size distribution,
backpressure and promotion counters, and active/pending gauges
- Register in MicrometerBundle after StreamableLogsMetrics
- Instrument ReindexingOrchestrator.run() with job started/completed/failed/stopped
- Bridge StageStatsTracker.flush() deltas to Prometheus per stage and entity type
- Add bulk request latency timer and payload size recording in OpenSearchBulkSink
- Record backpressure events in SearchIndexExecutor.handleBackpressure()
- Record promotion success/failure in DefaultRecreateHandler
- Add ReindexingMetricsTest with 24 tests covering all metric types
* Add Improvements
* Auto Gene
* Use Auto Config in distributed
* Fix Partition Claim Spread
* Make partition use config
* Correct total count
* Fix Wait time to 5 mins
* Revert om yaml
* Fix Sink sync
* Add Failure Handling at different stages
* Update script to create entities
* Move to scripts
* Add usage and fix script
* Fix Script
* Update generated TypeScript types
* Fix Staging miss
* Fix Stats reconcilation issue
* Revert workflow handler
* Fix Partition worker early sync
* Update Logs
* Update logs EntityRepository
* Error failure test
* Review Comments fix
* Fix Non Distributed live feed
* Fix Non Distributed stats feed
* Fix Review comments
* Fix Time Series cutt off
* Update generated TypeScript types
* Md
* Benchmark addition
* Fix date time warning
* Update load test to do benchmark analysis
* Disagnostic and update perf test
* Move load test to bin
* Fix Review Comments
* Add numeric values
* Move to localhost by default
* Fix Perf test issues
* Review Comments
* Add Preflight Fixes
* Add Preflight fixes for stale entry
* Remove stale entry on ApplicationHandler
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Fix Stats
* Add Warning logs and reindex failure analysis
* Add Search Insights in Preferences
* Add Label
* Fix Full Error not available
* Add check for reindex run
* Add support for Dropwizard 5.0 and Jetty 12.1.x
* Dropwizard 5x and Jetty 12.1 upgrade
* Fix test behavior
* Fix rdf tests
* revert enableVirtualThreads
* fix tests
* Fix Tests
* Fix tests
* Switch to jersey-jetty-connector for Jetty 12 compatibility
- Replace jersey-apache-connector with jersey-jetty-connector
- Jersey 3.1.4+ jersey-jetty-connector supports Jetty 12.0.x+
- Use JettyConnectorProvider and JettyHttpClientSupplier for HTTP client
- Keep reasonable timeouts (30s connect, 2min read) to prevent CI hangs
- Set SYNC_LISTENER_RESPONSE_MAX_SIZE for large responses
This fixes the 1,093 InterruptedException test failures caused by
using the default Jersey client (HttpURLConnection-based) which doesn't
handle concurrent test execution properly.
* Fix: Start Jetty HttpClient before use
Jetty 12 HttpClient implements LifeCycle and must be explicitly
started with httpClient.start() before use. This fixes the 163
InterruptedException test failures.
* Fix: Force jetty-client to 12.1.1 for jersey-jetty-connector
jersey-jetty-connector brings transitive jetty-client:12.0.22 but
Dropwizard 5.0 uses Jetty 12.1.1. The ClientConnector.newTransport()
API changed between 12.0.x and 12.1.x, causing NoSuchMethodError.
Fix: Exclude transitive jetty-client and add explicit 12.1.x dependency.
* Use Java 11+ HttpClient connector for tests (jersey-jnh-connector)
Switch from the broken jersey-jetty-connector (incompatible with Jetty 12.1.x)
to jersey-jnh-connector which uses Java's built-in java.net.http.HttpClient.
This connector:
- Natively supports all HTTP methods including PATCH
- Works with Java 21
- No external dependencies required
- Avoids compatibility issues with Jetty versions
* Use Apache HttpClient 5.x connector for tests (jersey-apache5-connector)
Switch from jersey-jetty-connector (incompatible with Jetty 12.1.x)
to jersey-apache5-connector which uses Apache HttpClient 5.x.
This connector:
- Supports all HTTP methods including PATCH
- Lenient with empty PUT request bodies
- Has proper timeout support to prevent indefinite hangs
- Works with Jetty 12.1.x
* Fix tests
* Fix docker compose
* Fix tests
* Fix tests - make url compatible
* Add URL parsing
* Fix URL decode
* fix tests
* fix test
* fix tests
* Fix integration with new dropwizard-5x changes
---------
Co-authored-by: Karan Hotchandani <33024356+karanh37@users.noreply.github.com>
Co-authored-by: karanh37 <karanh37@gmail.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
* Single RDF knowledge graph for all entities
* Fix RDF Resource Test
* fix test
* fix test
* Add support for TagLabel objects
---------
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
Co-authored-by: lautel <laura92cp2@gmail.com>
* MINOR - Prepare extra validations for system repository health
* Update generated TypeScript types
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* chore: set lower DAG serialization defaults
* chore: increase timeout for testSuite
* chore: update DAG processor interval postgres
* chore: lower DAG parse interval and delay
* fix: remove internal parsing and trigger dag parsing automatically on deploy
* wip
* feat: trigger external apps with override config
- Added in openmetadata-airflow-apis functionality to trigger DAG with feature.
- Modified openmetadata-airflow-apis application runner to accept override config from params.
- Added overloaded runPipeline with `Map<String,Object> config` to allow triggering apps with configuration. We might want to expand this to all ingestion pipelines. For now its just for apps.
- Implemented an example external app that can be used to test functionality of external apps. The app can be enabled by setting the `ENABLE_APP_HelloPipelines=true` environment variable.
* fix class doc for application
* fixed README for airflow apis
* fixes
* set HelloPipelines to disabeld by default
* fixed basedpywright errros
* fixed app schema
* reduced airflow client runPipeline to an overload with null config
removed duplicate call to runPipeline in AppResource
* Update openmetadata-docs/content/v1.7.x-SNAPSHOT/developers/applications/index.md
Co-authored-by: Matias Puerta <matias@getcollate.io>
* deleted documentation file
---------
Co-authored-by: Matias Puerta <matias@getcollate.io>
* GEN-1493 - Fix opensearch pagination
* GEN-1494 - Add CI for py-tests with Postgres and Opensearch
* GEN-1494 - Add CI for py-tests with Postgres and Opensearch
* Fix backend tests and have index alias with clusterAlias appended
* Fix backend tests and have index alias with clusterAlias appended
* Fix failing tests
* alias setup
* fix suggestion not working due to alias
* fix getIndexOrAliasName method for multiple indexes
* update openmetadata.yaml
* update childAliases with clusterAlias
---------
Co-authored-by: Ashish Gupta <ashish@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
* Split ExternalSecretsManagerTest to new ExternalSecretsManagerTest and AWSBasedSecretsManagerTest
* implement SecretsManagerFactory to create GCPSecretsManager
* implement GCPSecretsManager
* implements gcp secret manager.
* Fix it for the GCP's rule.
* create a template of GCP
* fix compile error
* implements to use project_id
* add library for the google cloud secret manager
* add test code for using google credential in the docker container
* modify docker-compose.yml for GCP
* add google_crc32c module
* modify ways to get project id
* create a new docker-compose.yml for Google Cloud
* create a new document
* create compose file for gcp secret manager
* fix invalid styles and formats
* downgrade google library to avoid conflicting protoc versions
* Revert "fix(CI): update `run_local_docker.sh` script with cluster alias ES endpoint (#16604)"
This reverts commit baab52076f.
* Revert "chore(CI): Enable Elasticsearch cluster alias by default (#16599)"
This reverts commit a3524e14d3.