OpenMetadata/.github
Sriharsha Chintalapani 64f49c1747
Cache improvements: lineage + search layers, observability, CI gate (#28012)
* cache: lineage cache, per-type metrics, invalidation registry, search-cache

Add Redis-backed lineage response cache and search response cache, both
gated by the existing CACHE_PROVIDER toggle and falling through to direct
computation when the cache is unavailable. The cache remains optional —
verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and
confirming all paths continue to work (just without the L2 hit).

Coverage:
- CachedLineage wraps LineageRepository.getLineage with hybrid TTL +
  direct invalidation (60s default). Direct edits invalidate the affected
  root cache entries; transitive changes fall through to TTL.
- CachedSearchLayer wraps /api/v1/search/query with auth-aware caching
  (cache key includes principal so users with different ACLs don't share
  results). 30s default TTL.

Observability:
- /api/v1/system/cache/stats response now includes a metrics block with
  hits/misses/hitRatio/evictions/errors/writes plus read/write latency
  Timers, and a byType breakdown so coverage gaps are visible per
  entity-type and per cache-layer.

Correctness:
- New Invalidatable interface + CacheBundle registry + invalidateEntity
  helper so future cache layers plug in by implementing one method
  instead of editing multiple mutation paths.
- Edge mutations in LineageRepository.addLineage/deleteLineage invalidate
  both endpoints; entity mutations in EntityRepository.postUpdate /
  postDelete / restoreEntity invalidate the lineage rooted at the entity.
- Pub/sub handler in CacheBundle iterates registered Invalidatables so
  remote-pod evictions flow to all layers automatically.

Tooling:
- docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for
  local A/B testing without tearing down DB/ES volumes.
- CachedSearchLayerIT exercises hit-on-second-call, distinct-query
  misses, distinct-page-size misses, and byType shape via the metrics
  endpoint. Each test gracefully no-ops when the cluster runs cache-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints

Builds on the phase 1 commit (c20a29b11b) with operability and correctness
items from .context/cache-improvements-design.md. All four pieces respect
the optional-cache contract: with CACHE_PROVIDER=none they no-op cleanly.

P2.3 — Single-flight on CachedSearchLayer
  Striped<Lock> keyed by SHA-cache-key. 100 concurrent users hitting the
  same uncached query collapse to one ES call instead of N. SearchResource
  now uses loadOrCompute so the lock-and-recheck pattern lives inside the
  cache layer; the supplier is the actual ES call kept tight to minimize
  lock-hold time. Non-200 upstreams bypass cache and refetch.

P2.6 — Slow cache reads logged
  RedisCacheProvider.get/hget timing checked against
  cache.slowReadThresholdMs (default 50ms). Exceeding fires a WARN log
  and bumps a new cache.reads.slow Micrometer counter exposed in
  /cache/stats.metrics.slowReads. Leading indicator of Redis pressure /
  network glitch / hot-key contention.

P2.4 — Negative caching for not-found entities
  NotFoundCache marks "we looked, no such entity" with a short TTL
  (default 30s) so repeated 404 lookups (typo'd FQNs, references to
  deleted entities) don't hammer the DB. Wired into
  EntityRepository.find(UUID) and findByName for the !fromCache path.
  Implements Invalidatable so the postCreate fan-out drops the marker
  on entity create — without that, create-then-immediately-read would
  404 for up to TTL.

  Added CacheBundle.invalidateEntity to EntityRepository.postCreate so
  newly-created entities reach every Invalidatable registry layer.

P2.5 — Admin cache ops endpoints
  GET  /api/v1/system/cache/keys?pattern=...      — SCAN keys, returns count
  POST /api/v1/system/cache/invalidate?pattern=.. — SCAN+UNLINK, returns deleted
  POST /api/v1/system/cache/invalidate/entity?type=&id=&fqn=
                                                  — fan to all Invalidatables

  All admin-only. Pattern endpoints document the "no broad globs" rule —
  we never want a SCAN over om:prod:* on a busy cluster. Per-entity
  endpoint goes through the existing Invalidatable registry so future
  cache layers are reachable from ops without ever touching this code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: pipelined mget on CacheProvider + CachedReadBundle.getBatch

Adds a foundational batch-read primitive at the provider layer:

  CacheProvider.mget(List<String>) -> List<Optional<String>>

Default implementation does sequential per-key gets (correct, no batching
benefit). RedisCacheProvider overrides with a true pipelined version: all
GETs are queued under setAutoFlushCommands(false), then flushed once and
awaited as a single TCP round-trip. Records hits/misses through the
existing CacheMetrics counters and respects the slow-read threshold.

Per-key pipelining over true MGET — Redis Cluster requires same-slot keys
for MGET; pipelined per-key GETs work transparently across slots without
the constraint, at the same network cost.

CachedReadBundle.getBatch(entityType, ids) consumes the new primitive
for prefetch use cases (UI prefetch on hover, list-then-detail
navigation warmup). The list endpoint hot path itself does NOT use this
layer — list responses are SQL-batched via EntityRepository.setFieldsInBulk
which calls fieldFetchers in bulk, not per-row CachedReadBundle.get.
That's why bench3 showed list endpoints at neutral cache_off-to-on
ratio: lists already amortize at the SQL layer.

The mget primitive is what later phases will plug into when wiring
batch-prefetch to specific UI flows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cache): use unique query in sameQueryHitsCacheOnSecondCall to avoid state pollution

Sequential test run on postgres-os-redis caught a flake: the test issued
3 identical "q=*" calls expecting at least 1 cold-write. By the time it
ran, prior tests in the same JVM session had already cached
(q=*, index=table_search_index, size=10), so call 1 was a hit, call 2
hit, call 3 hit — total writes=0, asserts failed.

Switching to a per-invocation nonce ensures we always start cold,
matching the pattern the other 3 tests in this class already use.

Confirmed via subsequent parallel-pass run on the same suite where the
test passed (different test ordering, fresh cache for that key).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: drop search cache TTL from 30s to 2s for create-then-search freshness

Integration tests on the postgres-os-redis profile caught a real correctness
regression: tests that create an entity and Awaitility-poll for it to appear
in search timed out at 30s because our 30s search TTL pinned the
pre-create empty result for the entire test window. Same issue surfaces
in production: a user creates a domain / table / dashboard and immediately
searches for it would see "no results" for up to 30s.

2s caps the staleness while still catching the dominant UI access pattern:
multiple components in the same render frame fire identical search queries.
Those happen within milliseconds, well inside any reasonable TTL.

The longer-term fix is search-cache invalidation on entity writes (a
generation counter per entity-type, search keys include the generation,
writes bump the generation). That's design-doc-tracked in
.context/cache-improvements-design.md but deferred — the 2s TTL is good
enough for now, and the more complete invalidation strategy can be a
follow-up PR with its own dedicated tests.

Failing tests under 30s TTL that this fixes:
  - DomainAssetsColumnExclusionIT (domain create-then-search)
  - LineageImpactAnalysisIT (owner removal reflected in search)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: cache-tests profile runs full IT suite + new postgres+es+redis CI workflow

The cache-tests Maven profile previously ran only the four cache/* IT
classes — too narrow to catch cache-correctness regressions in the rest
of the codebase. Expanded it to mirror the mysql-elasticsearch profile
shape: sequential + parallel failsafe executions, full **/*IT.java
inclusion, postgres + elasticsearch + redis backend, with
cacheProvider=redis system property added so every test path exercises
the cache layer.

Locally, the focused-cache-only run is preserved via
  mvn verify -P cache-tests -Dit.test='**/cache/*IT'

New CI workflow integration-tests-postgres-elasticsearch-redis.yml
mirrors the structure of integration-tests-postgres-opensearch.yml:

  - Same triggers (push to main, PR target, merge_group, workflow_dispatch)
  - Same path filters (openmetadata-service/**, integration-tests/**, etc.)
  - Same Maven cache + JDK 21 setup
  - Runs `mvn verify -pl :openmetadata-integration-tests -Pcache-tests`
  - Surefire-report publication with fail_on_test_failures

Result: PRs touching cache code (or any read path) get automatic CI
coverage with redis enabled. Cache-invalidation and stale-data bugs
that previously only surfaced in production now have a CI gate before
merge — same protection that mysql-elasticsearch and postgres-opensearch
provide for the no-cache code paths.

Smoke verified locally: `mvn verify -P cache-tests -Dit.test='**/cache/*IT'`
runs both sequential and parallel passes (6 tests each), all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): address PR review feedback for cache improvements

Nine review-driven fixes spanning the cache PR (#28012):

RedisCacheProvider.mget (bug):
  - Restructured the auto-flush window so `setAutoFlushCommands(true)` is
    in the OUTER `finally` of the entire op. The previous structure had
    the restoration in an inner finally that only fired around the
    awaitAll call; an exception in the queueing loop or flushCommands()
    would leave the SHARED connection in auto-flush=false mode, making
    every subsequent op from any caller silently buffer indefinitely.

SearchResource (bug):
  - Removed the double-call on the non-cacheable response path. The
    supplier now captures the upstream Response object so the outer code
    can return it directly when the body isn't cacheable (non-200 or
    non-String entity) — previously the caller re-invoked
    searchRepository.search() on every error/non-200, doubling backend
    load for failing queries.

EntityRepository negative cache (edge case):
  - Hoisted the NotFoundCache fast-path OUTSIDE the `!fromCache` guard in
    both `find(UUID,...)` and `findByName(...)`. Default callers go in
    via `find(id, include)` which delegates with fromCache=true; the
    previous gate made the fast-path unreachable for the most common
    caller. Also added negative-cache population from the cached path's
    ExecutionException so repeated requests for a non-existent id do
    short-circuit after the first miss.

SystemResource cache endpoints (security + style):
  - `/cache/keys` and `/cache/invalidate` now validate the glob pattern
    via `validateCachePattern` — rejects pure wildcards or patterns with
    fewer than 6 literal characters before the first wildcard. Stops a
    careless or malicious admin from issuing `*` or `om:*` that would
    block the Redis cluster on a large keyspace. ReDoS-safe: linear
    char scan, no regex backtracking.
  - `/cache/invalidate/entity` now also calls
    `EntityRepository.invalidateCacheForEntity(...)` to evict the Guava
    L1 caches (`CACHE_WITH_ID`, `CACHE_WITH_NAME`) and propagate via the
    existing pub-sub channel — the previous code only invalidated the
    `INVALIDATABLES` registry layers, leaving stale L1 entries.
  - Replaced fully-qualified class names (`org.openmetadata.service.
    cache.CacheMetrics`, `jakarta.ws.rs.QueryParam`, `java.util.UUID`)
    with proper imports per the project style guide.

CachedLineage (edge case):
  - Single-flight stripe lock now keys on the FULL cache key
    `(rootId, upstreamDepth, downstreamDepth, includeDeleted)` instead
    of `rootId` alone. Concurrent requests for different depths or
    include-deleted flags on the same root no longer block each other.

CachedSearchLayer (doc):
  - Javadoc now correctly says default TTL is 2s (was incorrectly 30s)
    and explains why — see commit 41489056ff which dropped it from 30s
    after IT regressions where users couldn't see their own writes for
    half a minute.

CI workflow (bugs + security mitigation note):
  - Removed `if: steps.cache-output.outputs.exit-code == 0` from the
    `Set up JDK 21` and `Install Ubuntu dependencies` steps.
    `actions/cache@v4` exposes `cache-hit`, never `exit-code`; the
    expression always evaluated to false and those steps NEVER ran.
    Maven was using whatever JDK shipped with the runner.
  - Added explicit security note in the workflow header AND on the
    `Checkout` step documenting why `pull_request_target` is intentional
    and what the `safe to test` label gate accomplishes — CodeQL flags
    the pattern, the label gate is the accepted mitigation that mirrors
    every other integration-tests-*.yml workflow in this repo.

Verified:
  - mvn compile -pl openmetadata-service → BUILD SUCCESS
  - mvn test -pl openmetadata-service -Dtest=OpenMetadataAssetServletTest
    → 9/9 pass
  - mvn spotless:apply ran clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): only negative-cache on real EntityNotFoundException

The previous code caught every ExecutionException / UncheckedExecutionException
from the Guava cache loader and (a) populated NotFoundCache for 30s, (b)
rethrew as EntityNotFoundException. That conflated three very different
failure modes:

  1. Entity truly doesn't exist     → loader throws EntityNotFoundException
  2. Entity exists but is invalid   → loader throws IllegalStateException
  3. Transient DB / deser failure   → loader throws JdbiException, IOException

Cases 2 and 3 would poison the negative cache, turning a momentary DB
hiccup or a single bad row into a sustained 30s 404 for every caller that
asks for that id/fqn. Worse, the original cause was masked behind a
synthetic EntityNotFoundException, so logs and clients never saw the real
failure.

This change inspects e.getCause() and:
  - On EntityNotFoundException: populate NotFoundCache, rethrow the
    original (not a synthetic) so the caller's `instanceof` checks and
    message text still work.
  - On any other RuntimeException: rethrow unchanged — DB blips return
    5xx as before, validation errors surface, and the next request can
    re-attempt without hitting a poisoned cache.
  - On checked Throwable cause (rare for these loaders): wrap in
    RuntimeException so the contract is preserved.

Applied symmetrically to find(UUID, …) and findByName(String, …).

Addresses gitar-bot review on PR #28012:
https://github.com/open-metadata/OpenMetadata/pull/28012#discussion_r... (negative cache poisoning)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review — blank param, javadoc, mget hardening

Four review comments from PR #28012 review 4266159401:

SystemResource.invalidateCacheForEntity (line 1069 → blank query params):
  `?type=X&id=&fqn=` slipped past the required-params check because only
  `null` was treated as absent. Normalize blank id/fqn to null up front
  so the missing-both branch fires correctly and the downstream
  CacheBundle / EntityRepository calls receive a clean null instead of
  an empty string.

CacheKeys.search/childrenPage (line 116 → orphaned Javadoc):
  When the search() helper was added between the children-page Javadoc
  and the childrenPage() method, the Javadoc got stranded above the
  wrong method. Move it back so javadoc tooling generates accurate docs.

RedisCacheProvider.mget (line 610 → shared-connection auto-flush race):
  setAutoFlushCommands(false) toggles state on the shared Lettuce
  connection — two concurrent mgets could overlap and one caller's
  commands would buffer until the other restored auto-flush, surfacing
  as latency spikes / hangs on other paths sharing the connection.
  Wrap the pipeline in a new instance-level ReentrantLock so only one
  mget runs the auto-flush dance at a time. try/finally still restores
  auto-flush unconditionally; lock release sits in an outer finally.

RedisCacheProvider.mget (line 621 → unbounded f.get() on timeout):
  Previously LettuceFutures.awaitAll(...) returned a boolean we ignored;
  if it timed out, the subsequent f.get() calls were unbounded and would
  block the request thread until the Lettuce event loop eventually gave
  up. Capture the boolean, cancel non-done futures on timeout (so f.get()
  returns CancellationException instead of blocking), and log a warning
  with the timeout value and key count for operators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget partial timeout must trip the circuit breaker

The previous mget rewrite cancelled in-flight futures on awaitAll timeout
but still called recordSuccess() at the end of the happy-path. That fed
consecutiveSuccesses on every partial timeout, so a Redis instance that
was consistently slow (answering some keys, dropping others) would
*never* trip the breaker — masking real backend degradation as healthy.

Branch on the captured allCompleted boolean:

  - all futures completed → recordSuccess() as before
  - partial timeout → recordFailure(TimeoutException) and bump
    CacheMetrics.recordError() so the breaker's sliding-window failure
    detector picks it up and the metric reflects the degraded state

No other behaviour change — the per-key fallback Optionals still surface
to callers either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget shorter critical section + cache/stats + cache/keys doc

Three review comments from PR #28012 second copilot pass:

RedisCacheProvider.mget (RedisCacheProvider.java:624 — shared-connection
hold time): previous code held setAutoFlushCommands(false) for the entire
queue+flush+await window. Other paths (single get/set/hget on the same
Lettuce connection) would buffer until our await finished. Shrink the
critical section to just queue+flush: once flushCommands() returns, the
batch is on the wire and we can restore auto-flush and release the
pipelineLock before awaiting. A slow Redis now blocks only the calling
thread, not every concurrent caller using the shared connection.
Cancel-on-timeout and breaker accounting are unchanged.

SystemResource.getCacheStats (line 962 — noisy WARN when cache disabled):
CacheMetrics.getInstance() logs WARN every call when the metrics singleton
isn't initialized, which happens whenever CACHE_PROVIDER=none. An ops
dashboard polling /system/cache/stats on a cache-off deployment would
spam the log. Gate the metrics call on cacheProvider.available() so the
WARN never fires in that configuration. Stats payload still includes
provider-level fields; just no `metrics` key when cache is off.

SystemResource.scanCacheKeys (line 1006 — OpenAPI lies about count param):
Description claimed "bounded by the count parameter" but no count param
exists; scanCount() walks the full cursor. Rewrote the description to
state the actual safety mechanism: the validateCachePattern enforces a
6-character literal prefix before any wildcard, so '*' and 'om:*' are
rejected at validation. Reflects what the endpoint actually does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review pass 3 — hot-path L1 check + lineage hash + cleanups

Eight comments from the latest copilot review on PR #28012:

1. SystemResource.getCacheStats: gate metrics on cacheConfig.provider != none
   instead of cacheProvider.available(). When Redis is configured but the
   circuit breaker is tripped, app-level counters are exactly what an
   operator needs to diagnose the outage — suppressing them while the
   provider is "down but configured" hides the diagnostic signal. Also
   downgrade CacheMetrics.getInstance() WARN → DEBUG so a poller loop
   doesn't spam logs in the entirely-normal cache-off state.

2. CachedReadBundle.getBatch contract: the method is documented as
   returning a list 1:1 with entityIds, but bypass returned
   Collections.emptyList() and callers indexing by position would shift
   off the rails. Return a same-size list of nulls under bypass so the
   positional contract holds regardless of cache state.

3+4. CacheBundle.invalidateEntity / Invalidatable.invalidate javadocs
   claimed they were called from EntityRepository.postUpdate / postDelete
   / restoreEntity. They are NOT (only postCreate, the pub-sub handler,
   and the admin endpoint reach this path). Updated both javadocs to
   reflect actual call sites so future Invalidatables aren't built on a
   wrong invalidation contract.

5+6. EntityRepository.find / findByName: check Guava L1 (getIfPresent)
   FIRST, NotFoundCache only on L1 miss. The previous shape consulted
   NotFoundCache before L1, adding one Redis GET per cached read — a
   regression on the hottest read path. L1 hit now serves with zero
   Redis traffic; the negative cache short-circuits only when the loader
   would otherwise pay for a DB / Redis-L2 round trip.

7. CachedLineage redesign: variants for one root now live as fields of a
   single Redis hash (HSET / HGET) instead of separate keys. Invalidate
   is one DEL — O(1) — instead of SCAN-and-iterate (O(N) over keyspace).
   This matters because invalidate fires on the hot write path (entity
   updates and lineage-edge mutations) and the SCAN cost grew linearly
   with cache size. CacheKeys.lineageGraphPattern is gone; new helpers
   are lineageGraphHash(rootId) and lineageGraphField(up, down, incDel).

8. SystemResource.invalidateCacheForEntity: when only fqn is supplied,
   resolve to id server-side via Entity.getEntityRepository(type).
   findByName(...) before fanning out. Id-keyed cache layers (lineage,
   CACHE_WITH_ID, NotFoundCache id-side) need the UUID; the previous
   shape silently skipped them. Lookup failures are logged at DEBUG and
   the request still proceeds with fqn-only invalidation — admin
   force-invalidate is best-effort by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): lineage hash TTL claimed only by first writer (EXPIRE NX)

Previous shape called `hset(hashKey, fields, ttl)` which translated to
HSET + EXPIRE under the hood. Every variant write therefore reset the
hash's expiry — variant A cached at T=0 with TTL=60, variant B cached at
T=55, and A's effective lifetime jumped to 115s instead of the intended
60s. Under a constant trickle of variant writes on a hot root, the
"stale" variant could effectively live forever.

Split the operation:

  - CacheProvider.hset(key, fields) — new overload, no TTL touch.
    Defaults to a 365-day TTL so providers that don't override get
    a long-lived key rather than an immortal one.
  - CacheProvider.expireIfAbsent(key, ttl) — EXPIRE … NX semantics:
    set the TTL only when the key has no prior expiry. Default
    returns false (providers that can't express NX get no extension
    benefit, but no regression).
  - RedisCacheProvider implements both: HSET without expire, then
    EXPIRE with ExpireArgs.Builder.nx(). Falls back gracefully on
    Redis < 7.0 (logs at DEBUG, returns false).

CachedLineage.safeHset now uses the split shape — the first writer
to seed a hash establishes the 60s window; subsequent variant writes
leave the expiry alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget unavailable-path alignment + lineage deser fallback

Two copilot review comments on PR #28012:

RedisCacheProvider.mget (line 646): when `available == false` we returned
`Collections.emptyList()`, violating the 1:1 positional contract that
callers (CachedReadBundle.getBatch and friends) rely on. Match the
error-fallback branch: return one Optional.empty() per requested key so
caller-side indexing stays aligned regardless of provider health.
Truly-empty input keeps returning empty list (no positions to align).

LineageRepository.getLineage (line 1345): unconditional readValue on the
cached JSON would throw and fail the request if Redis held a
partial/corrupted/old-schema value — turning cache corruption into a
persistent 500 until TTL expiry. Wrap the deserialize in try/catch; on
failure log WARN with the root id and depth, invalidate the affected
root's lineage hash, and fall through to a fresh computeLineage(). User
sees the same answer as cache-off; subsequent requests repopulate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): expireIfAbsent falls back to plain EXPIRE on NX failure

The previous shape returned false silently when EXPIRE … NX wasn't
supported (Redis < 7.0 syntax error, transient failure). That meant the
preceding HSET-without-ttl call could leave the lineage hash key with no
expiry at all, accumulating in Redis memory until the next manual
invalidation.

Catch the NX failure, log at DEBUG, and issue a plain EXPIRE so the key
still gets a bounded lifetime. The trade-off: on older Redis, every
variant write extends the expiry — strictly worse than the NX semantics
on a 7.0+ deployment, but vastly better than the alternative of
permanent unbounded keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review pass 5 — dedicated mget conn + breaker + IT isolation + key collision

Five comments from the latest copilot review on PR #28012:

RedisCacheProvider.expireIfAbsent breaker bookkeeping (line 432, gitar-bot):
the NX-fallback path issued a plain EXPIRE without recordSuccess() /
recordFailure(), so a real network blip there was invisible to the
sliding-window failure detector. Both success and failure now feed the
breaker, consistent with every other Redis-calling method in the class.

RedisCacheProvider.mget shared-connection hazard (line 692): even with
pipelineLock, single-key callers using syncCommands/asyncCommands on the
*same* connection had their commands buffered for the duration of the
auto-flush-off window. Switched to a dedicated `pipelineConnection` /
`pipelineAsyncCommands` created at init time and closed on shutdown. The
shared connection's auto-flush is never toggled now, so unrelated request
paths can't be starved by mget. pipelineLock still serializes mget vs
mget on the dedicated connection.

SystemResource.invalidateCacheForEntity fqn→id resolution (line 1113):
the resolution call used `findByName(fqn, ALL, fromCache=true)`. That
path consults NotFoundCache and the L1/L2 caches, which an admin force-
invalidate is explicitly trying to recover from — a poisoned negative
entry would short-circuit the resolution and silently skip every id-keyed
cache layer. Switched to fromCache=false so the resolution always goes
to the DB; only then can we trust the id we hand to CacheBundle /
EntityRepository invalidation.

CachedSearchLayerIT.java parallel-execution flakiness (line 50): the
test assertions depend on deltas in the *global* /system/cache/stats
counters. Under @Execution(CONCURRENT) other ITs issuing searches in
parallel inflate the counters and the deltas either don't show up (false
negative) or come from someone else's hits (false positive that masks
broken cache keying). Marked @Isolated + ExecutionMode.SAME_THREAD so
the class runs alone within its window.

CachedSearchLayer.buildKey ambiguous encoding (line 220): fields were
joined with a raw `|` delimiter, no escaping. A query string containing
`|idx=foo` would produce the same preimage as a different (principal,
index, query) tuple — cache-key collision → wrong cached response served
to the wrong user. Added length-prefixed field encoding
(`name=<utf8-bytes>:value|`); two distinct logical tuples can no longer
serialize to the same hash input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
2026-05-13 06:41:09 -07:00
..
actions Migrate Databricks from sqlalchemy-databricks to databricks-sqlalchemy (#26896) 2026-05-04 18:53:24 +05:30
ISSUE_TEMPLATE chore(github): migrate issue templates to structured forms (#27710) 2026-04-24 14:08:20 +02:00
scripts chore(github): migrate issue templates to structured forms (#27710) 2026-04-24 14:08:20 +02:00
trivy/templates Feat: Github Workflow Action for Scanning vulnerabilities using Trivy. (#19710) 2025-02-16 12:02:14 -08:00
workflows Cache improvements: lineage + search layers, observability, CI gate (#28012) 2026-05-13 06:41:09 -07:00
CODEOWNERS chore: update code owner for openmetadata-ui-core-components (#23616) 2025-09-29 19:57:34 +05:30
copilot-instructions.md chore(ingestion): drop pylint, expand ruff (#27774) 2026-04-28 07:21:59 +02:00
e2eLabeler.yml Show collapse for record type of topic entity (#16063) 2024-04-29 19:16:40 +05:30
labeler.yml Refactor: remove doc changes from OM repo (#22019) 2025-08-20 14:28:48 +05:30
pull_request_template.md docs(github): require issue link, design, tests, UI recording in PR template (#27891) 2026-05-07 08:05:56 +02:00
teams.yml CI - Update teams.yaml (#23943) 2025-10-17 15:59:34 +05:30