mirror of
https://github.com/open-metadata/OpenMetadata
synced 2026-05-24 09:39:11 +00:00
* cache: lineage cache, per-type metrics, invalidation registry, search-cache Add Redis-backed lineage response cache and search response cache, both gated by the existing CACHE_PROVIDER toggle and falling through to direct computation when the cache is unavailable. The cache remains optional — verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and confirming all paths continue to work (just without the L2 hit). Coverage: - CachedLineage wraps LineageRepository.getLineage with hybrid TTL + direct invalidation (60s default). Direct edits invalidate the affected root cache entries; transitive changes fall through to TTL. - CachedSearchLayer wraps /api/v1/search/query with auth-aware caching (cache key includes principal so users with different ACLs don't share results). 30s default TTL. Observability: - /api/v1/system/cache/stats response now includes a metrics block with hits/misses/hitRatio/evictions/errors/writes plus read/write latency Timers, and a byType breakdown so coverage gaps are visible per entity-type and per cache-layer. Correctness: - New Invalidatable interface + CacheBundle registry + invalidateEntity helper so future cache layers plug in by implementing one method instead of editing multiple mutation paths. - Edge mutations in LineageRepository.addLineage/deleteLineage invalidate both endpoints; entity mutations in EntityRepository.postUpdate / postDelete / restoreEntity invalidate the lineage rooted at the entity. - Pub/sub handler in CacheBundle iterates registered Invalidatables so remote-pod evictions flow to all layers automatically. Tooling: - docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for local A/B testing without tearing down DB/ES volumes. - CachedSearchLayerIT exercises hit-on-second-call, distinct-query misses, distinct-page-size misses, and byType shape via the metrics endpoint. Each test gracefully no-ops when the cluster runs cache-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints Builds on the phase 1 commit (c20a29b11b) with operability and correctness items from .context/cache-improvements-design.md. All four pieces respect the optional-cache contract: with CACHE_PROVIDER=none they no-op cleanly. P2.3 — Single-flight on CachedSearchLayer Striped<Lock> keyed by SHA-cache-key. 100 concurrent users hitting the same uncached query collapse to one ES call instead of N. SearchResource now uses loadOrCompute so the lock-and-recheck pattern lives inside the cache layer; the supplier is the actual ES call kept tight to minimize lock-hold time. Non-200 upstreams bypass cache and refetch. P2.6 — Slow cache reads logged RedisCacheProvider.get/hget timing checked against cache.slowReadThresholdMs (default 50ms). Exceeding fires a WARN log and bumps a new cache.reads.slow Micrometer counter exposed in /cache/stats.metrics.slowReads. Leading indicator of Redis pressure / network glitch / hot-key contention. P2.4 — Negative caching for not-found entities NotFoundCache marks "we looked, no such entity" with a short TTL (default 30s) so repeated 404 lookups (typo'd FQNs, references to deleted entities) don't hammer the DB. Wired into EntityRepository.find(UUID) and findByName for the !fromCache path. Implements Invalidatable so the postCreate fan-out drops the marker on entity create — without that, create-then-immediately-read would 404 for up to TTL. Added CacheBundle.invalidateEntity to EntityRepository.postCreate so newly-created entities reach every Invalidatable registry layer. P2.5 — Admin cache ops endpoints GET /api/v1/system/cache/keys?pattern=... — SCAN keys, returns count POST /api/v1/system/cache/invalidate?pattern=.. — SCAN+UNLINK, returns deleted POST /api/v1/system/cache/invalidate/entity?type=&id=&fqn= — fan to all Invalidatables All admin-only. Pattern endpoints document the "no broad globs" rule — we never want a SCAN over om:prod:* on a busy cluster. Per-entity endpoint goes through the existing Invalidatable registry so future cache layers are reachable from ops without ever touching this code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: pipelined mget on CacheProvider + CachedReadBundle.getBatch Adds a foundational batch-read primitive at the provider layer: CacheProvider.mget(List<String>) -> List<Optional<String>> Default implementation does sequential per-key gets (correct, no batching benefit). RedisCacheProvider overrides with a true pipelined version: all GETs are queued under setAutoFlushCommands(false), then flushed once and awaited as a single TCP round-trip. Records hits/misses through the existing CacheMetrics counters and respects the slow-read threshold. Per-key pipelining over true MGET — Redis Cluster requires same-slot keys for MGET; pipelined per-key GETs work transparently across slots without the constraint, at the same network cost. CachedReadBundle.getBatch(entityType, ids) consumes the new primitive for prefetch use cases (UI prefetch on hover, list-then-detail navigation warmup). The list endpoint hot path itself does NOT use this layer — list responses are SQL-batched via EntityRepository.setFieldsInBulk which calls fieldFetchers in bulk, not per-row CachedReadBundle.get. That's why bench3 showed list endpoints at neutral cache_off-to-on ratio: lists already amortize at the SQL layer. The mget primitive is what later phases will plug into when wiring batch-prefetch to specific UI flows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cache): use unique query in sameQueryHitsCacheOnSecondCall to avoid state pollution Sequential test run on postgres-os-redis caught a flake: the test issued 3 identical "q=*" calls expecting at least 1 cold-write. By the time it ran, prior tests in the same JVM session had already cached (q=*, index=table_search_index, size=10), so call 1 was a hit, call 2 hit, call 3 hit — total writes=0, asserts failed. Switching to a per-invocation nonce ensures we always start cold, matching the pattern the other 3 tests in this class already use. Confirmed via subsequent parallel-pass run on the same suite where the test passed (different test ordering, fresh cache for that key). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: drop search cache TTL from 30s to 2s for create-then-search freshness Integration tests on the postgres-os-redis profile caught a real correctness regression: tests that create an entity and Awaitility-poll for it to appear in search timed out at 30s because our 30s search TTL pinned the pre-create empty result for the entire test window. Same issue surfaces in production: a user creates a domain / table / dashboard and immediately searches for it would see "no results" for up to 30s. 2s caps the staleness while still catching the dominant UI access pattern: multiple components in the same render frame fire identical search queries. Those happen within milliseconds, well inside any reasonable TTL. The longer-term fix is search-cache invalidation on entity writes (a generation counter per entity-type, search keys include the generation, writes bump the generation). That's design-doc-tracked in .context/cache-improvements-design.md but deferred — the 2s TTL is good enough for now, and the more complete invalidation strategy can be a follow-up PR with its own dedicated tests. Failing tests under 30s TTL that this fixes: - DomainAssetsColumnExclusionIT (domain create-then-search) - LineageImpactAnalysisIT (owner removal reflected in search) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: cache-tests profile runs full IT suite + new postgres+es+redis CI workflow The cache-tests Maven profile previously ran only the four cache/* IT classes — too narrow to catch cache-correctness regressions in the rest of the codebase. Expanded it to mirror the mysql-elasticsearch profile shape: sequential + parallel failsafe executions, full **/*IT.java inclusion, postgres + elasticsearch + redis backend, with cacheProvider=redis system property added so every test path exercises the cache layer. Locally, the focused-cache-only run is preserved via mvn verify -P cache-tests -Dit.test='**/cache/*IT' New CI workflow integration-tests-postgres-elasticsearch-redis.yml mirrors the structure of integration-tests-postgres-opensearch.yml: - Same triggers (push to main, PR target, merge_group, workflow_dispatch) - Same path filters (openmetadata-service/**, integration-tests/**, etc.) - Same Maven cache + JDK 21 setup - Runs `mvn verify -pl :openmetadata-integration-tests -Pcache-tests` - Surefire-report publication with fail_on_test_failures Result: PRs touching cache code (or any read path) get automatic CI coverage with redis enabled. Cache-invalidation and stale-data bugs that previously only surfaced in production now have a CI gate before merge — same protection that mysql-elasticsearch and postgres-opensearch provide for the no-cache code paths. Smoke verified locally: `mvn verify -P cache-tests -Dit.test='**/cache/*IT'` runs both sequential and parallel passes (6 tests each), all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): address PR review feedback for cache improvements Nine review-driven fixes spanning the cache PR (#28012): RedisCacheProvider.mget (bug): - Restructured the auto-flush window so `setAutoFlushCommands(true)` is in the OUTER `finally` of the entire op. The previous structure had the restoration in an inner finally that only fired around the awaitAll call; an exception in the queueing loop or flushCommands() would leave the SHARED connection in auto-flush=false mode, making every subsequent op from any caller silently buffer indefinitely. SearchResource (bug): - Removed the double-call on the non-cacheable response path. The supplier now captures the upstream Response object so the outer code can return it directly when the body isn't cacheable (non-200 or non-String entity) — previously the caller re-invoked searchRepository.search() on every error/non-200, doubling backend load for failing queries. EntityRepository negative cache (edge case): - Hoisted the NotFoundCache fast-path OUTSIDE the `!fromCache` guard in both `find(UUID,...)` and `findByName(...)`. Default callers go in via `find(id, include)` which delegates with fromCache=true; the previous gate made the fast-path unreachable for the most common caller. Also added negative-cache population from the cached path's ExecutionException so repeated requests for a non-existent id do short-circuit after the first miss. SystemResource cache endpoints (security + style): - `/cache/keys` and `/cache/invalidate` now validate the glob pattern via `validateCachePattern` — rejects pure wildcards or patterns with fewer than 6 literal characters before the first wildcard. Stops a careless or malicious admin from issuing `*` or `om:*` that would block the Redis cluster on a large keyspace. ReDoS-safe: linear char scan, no regex backtracking. - `/cache/invalidate/entity` now also calls `EntityRepository.invalidateCacheForEntity(...)` to evict the Guava L1 caches (`CACHE_WITH_ID`, `CACHE_WITH_NAME`) and propagate via the existing pub-sub channel — the previous code only invalidated the `INVALIDATABLES` registry layers, leaving stale L1 entries. - Replaced fully-qualified class names (`org.openmetadata.service. cache.CacheMetrics`, `jakarta.ws.rs.QueryParam`, `java.util.UUID`) with proper imports per the project style guide. CachedLineage (edge case): - Single-flight stripe lock now keys on the FULL cache key `(rootId, upstreamDepth, downstreamDepth, includeDeleted)` instead of `rootId` alone. Concurrent requests for different depths or include-deleted flags on the same root no longer block each other. CachedSearchLayer (doc): - Javadoc now correctly says default TTL is 2s (was incorrectly 30s) and explains why — see commit41489056ffwhich dropped it from 30s after IT regressions where users couldn't see their own writes for half a minute. CI workflow (bugs + security mitigation note): - Removed `if: steps.cache-output.outputs.exit-code == 0` from the `Set up JDK 21` and `Install Ubuntu dependencies` steps. `actions/cache@v4` exposes `cache-hit`, never `exit-code`; the expression always evaluated to false and those steps NEVER ran. Maven was using whatever JDK shipped with the runner. - Added explicit security note in the workflow header AND on the `Checkout` step documenting why `pull_request_target` is intentional and what the `safe to test` label gate accomplishes — CodeQL flags the pattern, the label gate is the accepted mitigation that mirrors every other integration-tests-*.yml workflow in this repo. Verified: - mvn compile -pl openmetadata-service → BUILD SUCCESS - mvn test -pl openmetadata-service -Dtest=OpenMetadataAssetServletTest → 9/9 pass - mvn spotless:apply ran clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): only negative-cache on real EntityNotFoundException The previous code caught every ExecutionException / UncheckedExecutionException from the Guava cache loader and (a) populated NotFoundCache for 30s, (b) rethrew as EntityNotFoundException. That conflated three very different failure modes: 1. Entity truly doesn't exist → loader throws EntityNotFoundException 2. Entity exists but is invalid → loader throws IllegalStateException 3. Transient DB / deser failure → loader throws JdbiException, IOException Cases 2 and 3 would poison the negative cache, turning a momentary DB hiccup or a single bad row into a sustained 30s 404 for every caller that asks for that id/fqn. Worse, the original cause was masked behind a synthetic EntityNotFoundException, so logs and clients never saw the real failure. This change inspects e.getCause() and: - On EntityNotFoundException: populate NotFoundCache, rethrow the original (not a synthetic) so the caller's `instanceof` checks and message text still work. - On any other RuntimeException: rethrow unchanged — DB blips return 5xx as before, validation errors surface, and the next request can re-attempt without hitting a poisoned cache. - On checked Throwable cause (rare for these loaders): wrap in RuntimeException so the contract is preserved. Applied symmetrically to find(UUID, …) and findByName(String, …). Addresses gitar-bot review on PR #28012: https://github.com/open-metadata/OpenMetadata/pull/28012#discussion_r... (negative cache poisoning) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review — blank param, javadoc, mget hardening Four review comments from PR #28012 review 4266159401: SystemResource.invalidateCacheForEntity (line 1069 → blank query params): `?type=X&id=&fqn=` slipped past the required-params check because only `null` was treated as absent. Normalize blank id/fqn to null up front so the missing-both branch fires correctly and the downstream CacheBundle / EntityRepository calls receive a clean null instead of an empty string. CacheKeys.search/childrenPage (line 116 → orphaned Javadoc): When the search() helper was added between the children-page Javadoc and the childrenPage() method, the Javadoc got stranded above the wrong method. Move it back so javadoc tooling generates accurate docs. RedisCacheProvider.mget (line 610 → shared-connection auto-flush race): setAutoFlushCommands(false) toggles state on the shared Lettuce connection — two concurrent mgets could overlap and one caller's commands would buffer until the other restored auto-flush, surfacing as latency spikes / hangs on other paths sharing the connection. Wrap the pipeline in a new instance-level ReentrantLock so only one mget runs the auto-flush dance at a time. try/finally still restores auto-flush unconditionally; lock release sits in an outer finally. RedisCacheProvider.mget (line 621 → unbounded f.get() on timeout): Previously LettuceFutures.awaitAll(...) returned a boolean we ignored; if it timed out, the subsequent f.get() calls were unbounded and would block the request thread until the Lettuce event loop eventually gave up. Capture the boolean, cancel non-done futures on timeout (so f.get() returns CancellationException instead of blocking), and log a warning with the timeout value and key count for operators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget partial timeout must trip the circuit breaker The previous mget rewrite cancelled in-flight futures on awaitAll timeout but still called recordSuccess() at the end of the happy-path. That fed consecutiveSuccesses on every partial timeout, so a Redis instance that was consistently slow (answering some keys, dropping others) would *never* trip the breaker — masking real backend degradation as healthy. Branch on the captured allCompleted boolean: - all futures completed → recordSuccess() as before - partial timeout → recordFailure(TimeoutException) and bump CacheMetrics.recordError() so the breaker's sliding-window failure detector picks it up and the metric reflects the degraded state No other behaviour change — the per-key fallback Optionals still surface to callers either way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget shorter critical section + cache/stats + cache/keys doc Three review comments from PR #28012 second copilot pass: RedisCacheProvider.mget (RedisCacheProvider.java:624 — shared-connection hold time): previous code held setAutoFlushCommands(false) for the entire queue+flush+await window. Other paths (single get/set/hget on the same Lettuce connection) would buffer until our await finished. Shrink the critical section to just queue+flush: once flushCommands() returns, the batch is on the wire and we can restore auto-flush and release the pipelineLock before awaiting. A slow Redis now blocks only the calling thread, not every concurrent caller using the shared connection. Cancel-on-timeout and breaker accounting are unchanged. SystemResource.getCacheStats (line 962 — noisy WARN when cache disabled): CacheMetrics.getInstance() logs WARN every call when the metrics singleton isn't initialized, which happens whenever CACHE_PROVIDER=none. An ops dashboard polling /system/cache/stats on a cache-off deployment would spam the log. Gate the metrics call on cacheProvider.available() so the WARN never fires in that configuration. Stats payload still includes provider-level fields; just no `metrics` key when cache is off. SystemResource.scanCacheKeys (line 1006 — OpenAPI lies about count param): Description claimed "bounded by the count parameter" but no count param exists; scanCount() walks the full cursor. Rewrote the description to state the actual safety mechanism: the validateCachePattern enforces a 6-character literal prefix before any wildcard, so '*' and 'om:*' are rejected at validation. Reflects what the endpoint actually does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review pass 3 — hot-path L1 check + lineage hash + cleanups Eight comments from the latest copilot review on PR #28012: 1. SystemResource.getCacheStats: gate metrics on cacheConfig.provider != none instead of cacheProvider.available(). When Redis is configured but the circuit breaker is tripped, app-level counters are exactly what an operator needs to diagnose the outage — suppressing them while the provider is "down but configured" hides the diagnostic signal. Also downgrade CacheMetrics.getInstance() WARN → DEBUG so a poller loop doesn't spam logs in the entirely-normal cache-off state. 2. CachedReadBundle.getBatch contract: the method is documented as returning a list 1:1 with entityIds, but bypass returned Collections.emptyList() and callers indexing by position would shift off the rails. Return a same-size list of nulls under bypass so the positional contract holds regardless of cache state. 3+4. CacheBundle.invalidateEntity / Invalidatable.invalidate javadocs claimed they were called from EntityRepository.postUpdate / postDelete / restoreEntity. They are NOT (only postCreate, the pub-sub handler, and the admin endpoint reach this path). Updated both javadocs to reflect actual call sites so future Invalidatables aren't built on a wrong invalidation contract. 5+6. EntityRepository.find / findByName: check Guava L1 (getIfPresent) FIRST, NotFoundCache only on L1 miss. The previous shape consulted NotFoundCache before L1, adding one Redis GET per cached read — a regression on the hottest read path. L1 hit now serves with zero Redis traffic; the negative cache short-circuits only when the loader would otherwise pay for a DB / Redis-L2 round trip. 7. CachedLineage redesign: variants for one root now live as fields of a single Redis hash (HSET / HGET) instead of separate keys. Invalidate is one DEL — O(1) — instead of SCAN-and-iterate (O(N) over keyspace). This matters because invalidate fires on the hot write path (entity updates and lineage-edge mutations) and the SCAN cost grew linearly with cache size. CacheKeys.lineageGraphPattern is gone; new helpers are lineageGraphHash(rootId) and lineageGraphField(up, down, incDel). 8. SystemResource.invalidateCacheForEntity: when only fqn is supplied, resolve to id server-side via Entity.getEntityRepository(type). findByName(...) before fanning out. Id-keyed cache layers (lineage, CACHE_WITH_ID, NotFoundCache id-side) need the UUID; the previous shape silently skipped them. Lookup failures are logged at DEBUG and the request still proceeds with fqn-only invalidation — admin force-invalidate is best-effort by design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): lineage hash TTL claimed only by first writer (EXPIRE NX) Previous shape called `hset(hashKey, fields, ttl)` which translated to HSET + EXPIRE under the hood. Every variant write therefore reset the hash's expiry — variant A cached at T=0 with TTL=60, variant B cached at T=55, and A's effective lifetime jumped to 115s instead of the intended 60s. Under a constant trickle of variant writes on a hot root, the "stale" variant could effectively live forever. Split the operation: - CacheProvider.hset(key, fields) — new overload, no TTL touch. Defaults to a 365-day TTL so providers that don't override get a long-lived key rather than an immortal one. - CacheProvider.expireIfAbsent(key, ttl) — EXPIRE … NX semantics: set the TTL only when the key has no prior expiry. Default returns false (providers that can't express NX get no extension benefit, but no regression). - RedisCacheProvider implements both: HSET without expire, then EXPIRE with ExpireArgs.Builder.nx(). Falls back gracefully on Redis < 7.0 (logs at DEBUG, returns false). CachedLineage.safeHset now uses the split shape — the first writer to seed a hash establishes the 60s window; subsequent variant writes leave the expiry alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget unavailable-path alignment + lineage deser fallback Two copilot review comments on PR #28012: RedisCacheProvider.mget (line 646): when `available == false` we returned `Collections.emptyList()`, violating the 1:1 positional contract that callers (CachedReadBundle.getBatch and friends) rely on. Match the error-fallback branch: return one Optional.empty() per requested key so caller-side indexing stays aligned regardless of provider health. Truly-empty input keeps returning empty list (no positions to align). LineageRepository.getLineage (line 1345): unconditional readValue on the cached JSON would throw and fail the request if Redis held a partial/corrupted/old-schema value — turning cache corruption into a persistent 500 until TTL expiry. Wrap the deserialize in try/catch; on failure log WARN with the root id and depth, invalidate the affected root's lineage hash, and fall through to a fresh computeLineage(). User sees the same answer as cache-off; subsequent requests repopulate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): expireIfAbsent falls back to plain EXPIRE on NX failure The previous shape returned false silently when EXPIRE … NX wasn't supported (Redis < 7.0 syntax error, transient failure). That meant the preceding HSET-without-ttl call could leave the lineage hash key with no expiry at all, accumulating in Redis memory until the next manual invalidation. Catch the NX failure, log at DEBUG, and issue a plain EXPIRE so the key still gets a bounded lifetime. The trade-off: on older Redis, every variant write extends the expiry — strictly worse than the NX semantics on a 7.0+ deployment, but vastly better than the alternative of permanent unbounded keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review pass 5 — dedicated mget conn + breaker + IT isolation + key collision Five comments from the latest copilot review on PR #28012: RedisCacheProvider.expireIfAbsent breaker bookkeeping (line 432, gitar-bot): the NX-fallback path issued a plain EXPIRE without recordSuccess() / recordFailure(), so a real network blip there was invisible to the sliding-window failure detector. Both success and failure now feed the breaker, consistent with every other Redis-calling method in the class. RedisCacheProvider.mget shared-connection hazard (line 692): even with pipelineLock, single-key callers using syncCommands/asyncCommands on the *same* connection had their commands buffered for the duration of the auto-flush-off window. Switched to a dedicated `pipelineConnection` / `pipelineAsyncCommands` created at init time and closed on shutdown. The shared connection's auto-flush is never toggled now, so unrelated request paths can't be starved by mget. pipelineLock still serializes mget vs mget on the dedicated connection. SystemResource.invalidateCacheForEntity fqn→id resolution (line 1113): the resolution call used `findByName(fqn, ALL, fromCache=true)`. That path consults NotFoundCache and the L1/L2 caches, which an admin force- invalidate is explicitly trying to recover from — a poisoned negative entry would short-circuit the resolution and silently skip every id-keyed cache layer. Switched to fromCache=false so the resolution always goes to the DB; only then can we trust the id we hand to CacheBundle / EntityRepository invalidation. CachedSearchLayerIT.java parallel-execution flakiness (line 50): the test assertions depend on deltas in the *global* /system/cache/stats counters. Under @Execution(CONCURRENT) other ITs issuing searches in parallel inflate the counters and the deltas either don't show up (false negative) or come from someone else's hits (false positive that masks broken cache keying). Marked @Isolated + ExecutionMode.SAME_THREAD so the class runs alone within its window. CachedSearchLayer.buildKey ambiguous encoding (line 220): fields were joined with a raw `|` delimiter, no escaping. A query string containing `|idx=foo` would produce the same preimage as a different (principal, index, query) tuple — cache-key collision → wrong cached response served to the wrong user. Added length-prefixed field encoding (`name=<utf8-bytes>:value|`); two distinct logical tuples can no longer serialize to the same hash input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
178 lines
7.3 KiB
YAML
178 lines
7.3 KiB
YAML
# Copyright 2026 Collate
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
|
|
# Runs the full integration test suite with the Redis cache enabled (postgres + elasticsearch +
|
|
# redis), via the cache-tests Maven profile. Catches cache-invalidation and stale-data bugs that
|
|
# only surface when every test path goes through the cache layer.
|
|
#
|
|
# Security note (CodeQL "pull_request_target + checkout untrusted code"):
|
|
# This workflow uses `pull_request_target` so PRs from forks can produce a required check.
|
|
# CodeQL flags the pattern as risky because it checks out PR-controlled code while having
|
|
# access to secrets. The mitigation is the explicit `safe to test` label gate below — the
|
|
# verify-pr-label step rejects the workflow run before any PR code is checked out unless a
|
|
# maintainer has applied the label. This matches the mitigation used by every other
|
|
# integration-tests-*.yml workflow in this repo. If you remove the label gate, you reopen
|
|
# the vulnerability.
|
|
name: Integration Tests - PostgreSQL + Elasticsearch + Redis
|
|
|
|
on:
|
|
merge_group:
|
|
workflow_dispatch:
|
|
push:
|
|
branches:
|
|
- main
|
|
paths:
|
|
- "openmetadata-service/**"
|
|
- "openmetadata-integration-tests/**"
|
|
- "openmetadata-spec/src/main/resources/json/schema/**"
|
|
- "openmetadata-sdk/**"
|
|
- "common/**"
|
|
- "pom.xml"
|
|
- "bootstrap/**"
|
|
# `pull_request_target` is intentional and required so the workflow runs against PRs from
|
|
# forks (which `pull_request` cannot for security reasons). The `safe to test` label gate
|
|
# below is what makes this safe — see security note in the file header.
|
|
pull_request_target:
|
|
types: [labeled, opened, synchronize, reopened, ready_for_review]
|
|
|
|
permissions:
|
|
contents: read
|
|
checks: write
|
|
|
|
concurrency:
|
|
group: integration-tests-pg-es-redis-${{ github.event.pull_request.number || github.run_id }}
|
|
cancel-in-progress: true
|
|
jobs:
|
|
# Detect whether relevant paths changed. When no matching files are modified
|
|
# the downstream job is skipped via its `if` condition.
|
|
# A job skipped by `if` reports as "Success", so required checks still pass.
|
|
changes:
|
|
name: Detect Changes
|
|
runs-on: ubuntu-latest
|
|
if: ${{ !github.event.pull_request.draft }}
|
|
outputs:
|
|
backend: ${{ github.event_name == 'workflow_dispatch' && 'true' || steps.filter.outputs.backend }}
|
|
steps:
|
|
- uses: dorny/paths-filter@v3
|
|
id: filter
|
|
if: ${{ github.event_name != 'workflow_dispatch' }}
|
|
with:
|
|
filters: |
|
|
backend:
|
|
- 'openmetadata-service/**'
|
|
- 'openmetadata-integration-tests/**'
|
|
- 'openmetadata-spec/src/main/resources/json/schema/**'
|
|
- 'openmetadata-sdk/**'
|
|
- 'common/**'
|
|
- 'pom.xml'
|
|
- 'bootstrap/**'
|
|
|
|
integration-tests-postgres-elasticsearch-redis:
|
|
needs: changes
|
|
runs-on: ubuntu-latest
|
|
if: ${{ needs.changes.outputs.backend == 'true' }}
|
|
steps:
|
|
- name: Free Disk Space (Ubuntu)
|
|
uses: jlumbroso/free-disk-space@main
|
|
with:
|
|
tool-cache: true
|
|
android: true
|
|
dotnet: true
|
|
haskell: true
|
|
large-packages: true
|
|
docker-images: false
|
|
swap-storage: true
|
|
|
|
- name: Wait for the labeler
|
|
uses: lewagon/wait-on-check-action@v1.3.4
|
|
if: ${{ github.event_name == 'pull_request_target' }}
|
|
with:
|
|
ref: ${{ github.event.pull_request.head.sha }}
|
|
check-name: Team Label
|
|
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
|
wait-interval: 90
|
|
|
|
- name: Verify PR labels
|
|
uses: jesusvasquez333/verify-pr-label-action@v1.4.0
|
|
if: ${{ github.event_name == 'pull_request_target' }}
|
|
with:
|
|
github-token: '${{ secrets.GITHUB_TOKEN }}'
|
|
valid-labels: 'safe to test'
|
|
pull-request-number: '${{ github.event.pull_request.number }}'
|
|
disable-reviews: true # To not auto approve changes
|
|
|
|
# SECURITY: this step checks out PR-controlled code while the workflow runs with
|
|
# `pull_request_target` privileges (secrets access). The `Verify PR labels` step above
|
|
# gates this — the workflow halts before we get here unless a maintainer has applied
|
|
# the `safe to test` label. CodeQL flags the pattern; the label gate is the accepted
|
|
# mitigation, mirroring how every other integration-tests-*.yml workflow in this repo
|
|
# handles fork PRs.
|
|
- name: Checkout
|
|
uses: actions/checkout@v4
|
|
with:
|
|
ref: ${{ github.event_name == 'merge_group' && github.sha || github.event.pull_request.head.sha }}
|
|
|
|
- name: Cache Maven dependencies
|
|
id: cache-output
|
|
uses: actions/cache@v4
|
|
with:
|
|
path: ~/.m2
|
|
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
|
|
restore-keys: |
|
|
${{ runner.os }}-maven-
|
|
|
|
# Run unconditionally. The previous `if: steps.cache-output.outputs.exit-code == 0` was a
|
|
# bug — `actions/cache@v4` exposes `cache-hit` (boolean) and `cache-primary-key`, never
|
|
# `exit-code`. The expression always evaluated to false and the steps never ran. Maven
|
|
# then ran against whatever JDK the runner happened to ship with, masking the issue.
|
|
- name: Set up JDK 21
|
|
uses: actions/setup-java@v4
|
|
with:
|
|
java-version: '21'
|
|
distribution: 'temurin'
|
|
|
|
- name: Install Ubuntu dependencies
|
|
run: |
|
|
sudo apt-get update
|
|
sudo apt-get install -y unixodbc-dev python3-venv librdkafka-dev gcc libsasl2-dev build-essential libssl-dev libffi-dev \
|
|
librdkafka-dev unixodbc-dev libevent-dev jq
|
|
sudo make install_antlr_cli
|
|
|
|
- name: Build for Integration Tests (PostgreSQL + Elasticsearch + Redis)
|
|
env:
|
|
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
run: mvn -DskipTests clean install -pl :openmetadata-integration-tests -am
|
|
|
|
- name: Free build artifacts
|
|
run: |
|
|
rm -rf openmetadata-service/target/lib openmetadata-service/target/classes
|
|
rm -rf openmetadata-spec/target openmetadata-sdk/target common/target
|
|
rm -rf openmetadata-shaded-deps/*/target
|
|
df -h /
|
|
|
|
- name: Run Integration Tests (PostgreSQL + Elasticsearch + Redis)
|
|
env:
|
|
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
run: mvn verify -pl :openmetadata-integration-tests -Pcache-tests
|
|
|
|
- name: Clean Up
|
|
run: |
|
|
cd ./docker/development
|
|
docker compose down --remove-orphans
|
|
sudo rm -rf ${PWD}/docker-volume
|
|
|
|
- name: Publish Test Report
|
|
if: ${{ always() }}
|
|
uses: scacap/action-surefire-report@v1
|
|
with:
|
|
github_token: ${{ secrets.GITHUB_TOKEN }}
|
|
fail_on_test_failures: true
|
|
report_paths: 'openmetadata-integration-tests/target/failsafe-reports/TEST-*.xml'
|