Commit graph

296 commits

Author SHA1 Message Date
Sriharsha Chintalapani
6128f6a786
Perf/redis cache metrics and indexes (#27499)
* perf(cache): wire Redis metrics, fix REST GET cache path, cache ReadBundle

Three changes that make the Redis cache actually earn its keep on the
hot read path:

PR1: Observability + safety
- Wire CacheMetrics into RedisCacheProvider so hits/misses/errors/latency
  surface on /prometheus (recorders existed but were never called).
- Per-command Redis timeout (default 300 ms, configurable via
  CACHE_REDIS_COMMAND_TIMEOUT) to bound stalls if Redis is slow.
- Pipeline the relationship-invalidate loop into a single DEL.
- Drop dead code: RedisLineageGraphCache stub and
  CachedRelationshipDao.{list, batchGetRelationships}.

PR1.5: Make REST GET consult the cache at all
- EntityResource.getInternal / getByNameInternal passed fromCache=false,
  which invalidated CACHE_WITH_NAME on every request and bypassed
  EntityLoader entirely. Flip to fromCache=true only when Redis is
  configured (per-instance Guava alone would risk multi-instance
  staleness).
- Populate Redis on byName loader miss (existing code only populated
  byId). Cross-instance reads now warm.

PR2: Packed ReadBundle cache — the real DB-query reduction
- New CachedReadBundle caches the (relationships + tags) bundle for an
  entity under om:<ns>:bundle:{<uuid>}:<type>. Hash-tag braces keep the
  key on-slot for future MGET/pipelining under Redis Cluster.
- EntityRepository.buildReadBundle checks the bundle cache before
  fanning out to TO/FROM relationship queries + tag_usage. On miss,
  does the existing DB work and writes the DTO.
- EntityRepository.invalidateCache deletes the bundle key.

Measured on the dev Docker stack (200 seeded tables w/ owners, tags,
domains, followers), 500 iters, 50-table rotation, warm caches:

  no-cache:        p50 7.33 ms  p95 10.79 ms  p99 13.61 ms  128 req/s
  warm+redis (PR2) p50 4.11 ms  p95  5.24 ms  p99  6.31 ms  239 req/s
                   (-44% p50, -51% p95, -54% p99, +86% throughput)

Per-request DB query count 13 -> 2 on warm GETs. Bundle-cache hit rate
~85% during the run. PATCH invalidates the bundle as expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): cross-instance cache invalidation via Redis pub/sub

Per-instance Guava caches (CACHE_WITH_ID, CACHE_WITH_NAME) diverge across
replicas when one instance writes and others keep serving stale data until
the 30 s expireAfterWrite kicks in. Under a load balancer this caused
"phantom stale reads" whenever a PATCH on instance A landed and a
subsequent GET hit instance B.

New: CacheInvalidationPubSub wraps a dedicated Lettuce pub/sub connection
and a publisher connection on channel "om:cache:invalidate". Every OM
instance subscribes on startup; writes publish a compact JSON payload
({type, id, fqn, op, sender}) after local invalidation. Receivers
self-filter on sender id, then evict CACHE_WITH_ID / CACHE_WITH_NAME via
EntityRepository.onRemoteCacheInvalidate and drop the bundle key.

Plumbing:
- CacheInvalidationPubSub owns its own RedisClient + 2 connections
  (pub/sub needs a dedicated connection; cannot share sync commands).
  Modeled after the existing RedisJobNotifier.
- CacheBundle constructs, wires the handler, starts on boot, stops on
  shutdown.
- EntityRepository.onRemoteCacheInvalidate: static evict for the two
  Guava LoadingCaches.
- EntityRepository.invalidateCache (delete path) and
  EntityUpdater.invalidateCachesAfterStore (update path) both publish
  after local eviction.
- Guava expireAfterWrite (30 s) stays as a lost-message backstop.

Verified with two OM instances (new docker-compose.multiserver.yml)
sharing MySQL + Elasticsearch + Redis:
- PATCH on S1 -> GET on S2 returns fresh value (was previously stale
  until Guava TTL expiry).
- PATCH on S2 -> GET on S1 returns fresh value.
- redis-cli MONITOR shows:
    PUBLISH om:cache:invalidate
    {"type":"table","id":"<uuid>","fqn":"<fqn>","op":"update",
     "sender":"<host>:<pid>:<startMs>"}

Known limits this PR does not fix:
- Fire-and-forget delivery; dropped pub/sub messages fall back to the
  30 s Guava TTL. Redis Streams with consumer cursors is the upgrade
  path if we see drops.
- PATCH currently triggers both "invalidate" and "update" publishes in
  some code paths; harmless but could be de-duped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): single-flight stampede protection on bundle cache

A cold bundle miss previously caused 3 DB queries per request. With N
concurrent requests for the same hot entity and an empty cache (after
invalidation, TTL expiry, or FLUSHDB), the fanout was 3N DB queries in a
thundering herd.

CachedReadBundle now exposes three primitives backed by Redis SETNX:

  tryAcquireLoadLock(type, id)     -> SET NX EX loadLockTtlMs
  releaseLoadLock(type, id)        -> DEL
  waitForConcurrentLoad(type, id)  -> poll GET until loadLockWaitMs

buildReadBundle uses them on the cold-miss path:
- Exactly one caller acquires the lock and runs the existing DB fetch +
  cache populate.
- Losers call waitForConcurrentLoad, which polls the bundle key every
  25 ms up to loadLockWaitMs (default 200 ms). On populate they read the
  cached value like any cache hit. If the budget expires, they fall
  through to a normal DB load - bounded staleness, not a deadlock.
- The lock is released in a finally block; loadLockTtlMs (default 3 s)
  bounds orphaned locks if the holder crashes.

Verified with docker compose stack and a 25-way concurrent burst after
FLUSHDB:

  Redis MONITOR during cold burst (excerpted):
    SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX      <-- one wins
    SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX      <-- others
    SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX         lose
    SET om:dev:bundle:{<id>}:table:loading "1" EX 3 NX
    ...
    DEL om:dev:bundle:{<id>}:table:loading                  <-- holder releases

  Cold 25-burst  db_queries=63  (~2.5 per request)
  Warm 25-burst  db_queries=50  (~2 per request, 25 cache hits / 0 misses)

Without single-flight the cold burst would have been ~325 DB queries
(25 * 13 per-request cold cost). Net a 5x reduction on the stampede
scenario.

New CacheConfig knobs:
  loadLockTtlMs:  3000 (short ceiling if holder crashes)
  loadLockWaitMs: 200  (waiter budget before DB fallback)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): rewrite warmup with bulk SQL + pipelined Redis writes

The old CacheWarmupApp took hours on even modest installs because it:
- Iterated entities via repository.find(Include.ALL) (triggers full
  ReadBundle fan-out per row).
- Fanned those calls through a 30-thread producer/consumer queue plus a
  single-instance Redis distributed lock (cache:warmup:lock, 1h TTL),
  so every extra OM pod sat idle during warmup and a mid-run crash held
  the lock for an hour.
- Issued N individual Redis writes per entity with no pipelining.

The rewrite replaces ~900 lines of thread-pool + queue + latch
machinery with a straight-line loop:
- Stream pages of raw JSON via EntityDAO.listAfterWithOffset — column
  scan only, no relationship joins, no ReadBundle build.
- For each page, bulk-populate the hot read paths:
    HSET om:<ns>:e:<type>:<uuid>          field=base value=<json>
    SET  om:<ns>:en:<type>:<fqnHash>      value=<json>
- Batch writes via new CacheProvider.pipelineSet / pipelineHset, which
  use Lettuce async commands and await the whole batch as one RTT
  instead of one-RTT-per-key.
- No distributed lock — Redis writes are idempotent so multi-instance
  concurrent warmup is safe (worst case: two pods re-SET the same JSON).

Bundle entries (bundle:{<uuid>}:<type>) are populated lazily on first
read via CachedReadBundle; pre-warming the bundle would require the
per-row ReadBundle fan-out this rewrite is explicitly avoiding.

Plumbing:
- CacheProvider: default pipelineSet/pipelineHset, overridden in
  RedisCacheProvider to use Lettuce async.
- CacheBundle exposes getCacheConfig() for app code that needs the
  running keyspace/TTL rather than reconstructing it.

Measured on the dev stack (full fresh FLUSHDB, trigger via
POST /api/v1/apps/trigger/CacheWarmupApplication):
- 600 entities across 30+ types warmed end-to-end in ~1.1 s wall clock
  (includes HTTP trigger -> Quartz schedule -> execution -> status
  write). The per-entity-type phase is sub-50 ms for small types.
- 1201 Redis keys populated (600 entities x base + byName).
- Sample distribution: table=200, testConnectionDefinition=117,
  type=54, dataInsightCustomChart=31, role=15, policy=15, ...

Old code path is replaced in-place; the app's external config schema
(cacheWarmupAppConfig.json) and trigger endpoint are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): cache certification + container refs, 0 DB queries per warm GET

Close out the last two DB queries firing on the warm-cache path.

1. Certification cache (bundle)

The AssetCertification lookup used getCertTagsInternalBatch — a second
query on tag_usage that fetched exactly the rows batchFetchTags had
already loaded and then discarded. Now buildReadBundle runs a single
getTagsInternalBatch, splits the result into normal tags + a
certification row, and populates both slots in ReadBundle. Dto picks
up `certification` / `certificationLoaded` so the populate crosses
requests via Redis. getCertification() reads from
ReadBundleContext.getCurrent() on the fast path.

2. Container / parent reference cache

Href assembly for a table GET still fired one findFrom to resolve
"who contains this database" (TableRepository.setDefaultFields when
the table row doesn't have service embedded). Added a dedicated Redis
key per (child, relationship):

  om:<ns>:parent:{<childId>}:<childType>:<relationOrdinal>  -> EntityReference JSON

getFromEntityRef(..., fromEntityType=null, ...) checks the cache,
populates on miss. CachedRelationshipDao gets get/put/invalidate
container helpers. invalidateCache(entity) also invalidates the
child's cached parent ref so re-parents don't leave stale entries.
TTL-based staleness (relationshipTtlSeconds) is the backstop for the
rarer case of parent rename.

3. Bundle Dto

  public AssetCertification certification;
  public boolean certificationLoaded;

Persisted and restored symmetrically with relations/tags.

Measured on the dev stack, 50-table rotation, 500 iters, enriched
with owners+tags+domains+followers:

  Before this commit (warm Redis, bundle cache on):
    p50 4.11 ms  p95 5.24 ms  p99 6.31 ms  239 req/s
    DB queries per warm GET: 2
      1x getCertTagsInternalBatch
      1x findFrom(database) for service lookup

  After this commit (warm Redis):
    p50 2.95 ms  p95 3.76 ms  p99 4.50 ms  331 req/s
    DB queries per warm GET: 0
    cache hit ratio during bench: 100%

  No-cache baseline (unchanged):
    p50 7.26 ms  p95 10.68 ms  p99 13.76 ms  130 req/s

End-to-end from no-cache to this commit: -59% p50, -65% p95, -67% p99,
+155% throughput, 13 -> 0 DB queries per GET on the hot read path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): fix write-through shape + tighten invalidation on updates

Two bugs exposed by a cache-coherence audit on updates:

1. Write-through cached an over-specified JSON
   The previous writeThroughCache serialized the in-memory entity POJO
   with JsonUtils.pojoToJson(entity). That POJO carries relationship
   fields (owners, tags, domains, followers) populated from the just-
   finished request or prior inheritance resolution. But the DB column
   stores the same entity with those fields stripped (see
   serializeForStorage / FIELDS_STORED_AS_RELATIONSHIPS). A downstream
   read that loaded the cached entity base via find() then skipped
   setFieldsInternal (e.g. Entity.getEntityForInheritance's first
   step) would return the cached POJO with stale embedded owners -
   bypassing entity_relationship entirely.

   Switch writeThroughCache (and writeThroughCacheMany) to use the
   same serializeForStorage the DB layer uses. Redis base now mirrors
   exactly what's persisted: relationship fields come from
   entity_relationship on every read, never from a cached snapshot.

2. Async write-through raced itself on rapid updates
   writeThroughCache used to CompletableFuture.runAsync on a shared
   executor, re-reading from the DB. Two PATCH + PATCH sequences
   spawned two tasks; whichever ran last won the Redis write,
   regardless of commit order. Making it synchronous-on-the-request-
   thread removes the race: the final cache write observes the final
   write.

3. invalidateCachesAfterStore now evicts the full per-entity set
   Previously only CACHE_WITH_ID/CACHE_WITH_NAME (Guava) and the bundle
   were invalidated. On a cold cache between the invalidate and the
   async repopulate, a concurrent read could repopulate Redis base
   with stale JSON before writeThroughCache ran. The invalidation now
   also drops:
     - om:<ns>:e:<type>:<id> and om:<ns>:en:<type>:<fqnHash>
     - owners/domains fields on the relationship hash
     - the container-ref cache for this child (parent may have changed)

4. Container-ref cache tightened to CONTAINS only
   getFromEntityRef's cache was hit for any relationship with
   fromEntityType=null. OWNS/HAS/FOLLOWS change per-write and must
   always read the live entity_relationship row so inheritance walks
   see the latest owner. Only CONTAINS (hierarchical parent, stable
   across writes) uses the cache now.

Validation (single-instance, Redis enabled):

  om-cache-validate.sh: 8/8 PASS, including:
    - PATCH description read-after-write (by name and by id)
    - Owner update reflected immediately
    - Add follower visible on next read
    - Table inherits owner from database via schema with no owner
    - Table picks up NEW inherited owner after database owner changes
    - Delete removes entity; subsequent GET returns 404

Known edge case documented: tight-loop alternating PATCH(parent) +
GET(child-inheriting) within a few milliseconds can observe one-step-
old inherited value. Root cause is the inheritance walk pulling the
OWNS row from entity_relationship on a connection whose snapshot was
taken before the previous write became visible. Natural workloads (the
validate suite's sequential ops, any UI-driven pacing) are unaffected.
Fixing this cleanly requires either a per-write fsync barrier on
reads or a deeper MVCC re-architecture; deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cache): add Redis testcontainer support + mysql-elasticsearch-redis profile

Lets integration tests run against an ephemeral Redis so we can surface
any IT that breaks when the cache layer is active.

TestSuiteBootstrap:
- New cacheProvider system property (default: none). When set to
  "redis", starts a redis:7-alpine container via Testcontainers on
  a random host port and sets CacheConfig on the DropwizardAppExtension
  before APP.before() runs.
- Per-run keyspace (om🇮🇹<startMs>) keeps parallel suite runs from
  colliding if they share a Redis host.
- Container is registered in the existing cleanup chain.

pom.xml:
- New profile `mysql-elasticsearch-redis`. Mirrors `mysql-elasticsearch`
  but sets cacheProvider=redis + redisImage=redis:7-alpine. Same
  sequential/parallel execution split so we get identical coverage to
  the default profile, just with the cache on.

Usage:

  mvn -pl openmetadata-integration-tests \
      -Pmysql-elasticsearch-redis verify

Other existing profiles (mysql-elasticsearch, postgres-opensearch,
postgres-elasticsearch, mysql-opensearch, postgres-rdf-tests) are
untouched; they default to cacheProvider=none and no Redis container
is started, so no regression in CI run time for non-cache profiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): invalidate stale cache entries on rename cascade and direct DAO writes

Writes that bypass EntityRepository.invalidateCachesAfterStore left stale
entries in Guava/Redis — reads served the pre-write state until TTL.

Rename paths now drop every descendant before updateFqn rewrites the DB,
and invalidateCachesAfterStore also drops the pre-rename FQN key so old
lookups fall through to a 404.

Direct dao.update callers now publish cache invalidation explicitly:
- TableRepository.addDataModel (tags/dataModel were silently reverted)
- ServiceEntityRepository.addTestConnectionResult
- PersonaRepository.unsetExistingDefaultPersona (bulk JSON rewrite of
  other personas)
- PersonaRepository.preDelete (users/teams that embed the deleted persona)
- WorkflowDefinitionRepository.suspend/resume
- EntityRepository.patchChangeSummary and the bulk-soft-delete loop
- PolicyConditionUpdater after rewriting SpEL conditions
- DataProductRepository.updateName and bulk domain migration (every asset
  with an embedded data-product reference needs its bundle refreshed)

Drops Redis IT-suite cache-coherence failures from 40 to 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): invalidate cache entries on batched CSV import updates

updateManyEntitiesForImport wrote the new JSON straight to Redis but never
dropped the per-instance Guava (CACHE_WITH_ID / CACHE_WITH_NAME) or bundle
caches, so a GET immediately after CSV import could still see the pre-import
tags, owners, and domains until TTL expired.

Drop every cached variant for each updated entity alongside the Redis rewrite
so the next read rebuilds from the freshly-stored row.

Fixes DatabaseSchemaResourceIT.test_importCsv_withApprovedGlossaryTerm_succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): lowercase user FQN in name-based cache loader

UserDAO.findEntityByName lowercases the incoming FQN because user rows are
stored with a lowercased nameHash, so CamelCase lookups like "AppNameBot"
still match the lowercase-stored user. The cache loader called dao.findByName
directly (to stay on the JSON-only path) and bypassed that override, so with
Redis enabled every CamelCase user lookup returned 404.

Mirror the same case-fold in EntityLoaderWithName for user types.

Fixes AppsResourceIT.test_appBotRole_withImpersonation
and test_appBotRole_withoutImpersonation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): raise PrometheusResourceIT timeouts for loaded CI runs

5s read timeout was flaking under concurrent IT load: the admin port
competes for threads with the main app, and collecting full Prometheus
snapshots takes >5s when many tests hit the JVM at once. Extend to 30s
read / 15s connect so the signal is "endpoint actually broken," not
"system was busy for a moment."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): raise TagResourceIT search-index timeout to 90s

test_searchTagByClassificationDisplayName waited 30s for the tag to appear
in the tag_search_index. Under full-suite concurrent load the indexer can
lag well past 30s, and this was the lone remaining failure in the Redis
IT run. Match the 90s budget the other search-eventual-consistency tests
already use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(search): default entityStatus to Unprocessed in search index doc

The generated POJOs don't apply the status.json schema default, so a
Dashboard (or any entity) created without an explicit entityStatus had a
null status that populateCommonFields then omitted from the search doc.
PopulateCommonFieldsTest.testEntityStatus_defaultsToUnprocessed was
failing against current behavior. Emit "Unprocessed" as the explicit
fallback so search consumers and aggregations can filter on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): retry BaseEntityIT testBulkFluentAPI verification under load

The PATCH is synchronous on the server but parallel IT traffic sometimes
stalls the subsequent GET long enough for the test to observe the
pre-update description before the fresh row is served. Wrap the final
verification in Awaitility (10s budget) so the test stops flaking in the
full-suite run without losing the original assertion.

Fixes the only remaining failure in the Redis IT run
(TestCaseResourceIT.testBulkFluentAPI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): raise TestCaseResourceIT awaitility timeouts to 90s

test_incidentReopensAsNewAfterResolveAndNewFailure and other incident/
resolution-status tests used 30s Awaitility windows that were insufficient
under full-suite parallel load. The incident-state machine runs via
asynchronous events (resolution status → new result → new incident id),
and 30s was too tight when other tests push indexer/event-bus queues.

Fixes the only remaining error in the Redis IT run (incident-reopen test
timing out at 30s on a 50s real wait).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): raise BaseEntityIT checkCreatedEntity search-index timeout to 180s

Under full parallel load the ElasticSearch async indexer queue backs up
past the previous 90s budget — the test took 90.7s then timed out on a
real indexing race. Extend to 180s to swallow that tail without dropping
the assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): extend testBulkFluentAPI retry window to 60s

The 10s retry still timed out for NotificationTemplateResourceIT under
full parallel load. Match the 60s budget other inherited IT retries use.
The PATCH itself is sub-second; the budget absorbs pub-sub fan-out and
indexer queue tails, not the write itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(testCase): retry bulk logical-suite insert on MySQL deadlock

addAllTestCasesToLogicalTestSuite runs a full-table SELECT + INSERT IGNORE
that acquires gap locks across test_case. Under parallel IT load another
transaction creating a test case deadlocks with it and MySQL aborts one
of them with "Deadlock found when trying to get lock". The test was
genuinely failing, not just a flaky assertion.

Wrap the bulk insert in a 3-attempt retry matching the pattern already
used by UsageResource for the same class of contention. Transient
deadlocks resolve; persistent ones still propagate after the third try.

Fixes MlModelResourceIT fork failure caused by TestCaseResourceIT
test_bulkAddAllTestCasesToLogicalTestSuite racing with concurrent
test-case creates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(it): raise TestCaseResourceIT awaitility timeouts to 180s

90s was still insufficient under full parallel load for the incident
reopen flow — the test took 110s waiting for the new incident id to
materialize. The series of resolution-status → new-result → new-incident
events runs through multiple async event consumers; bump to 180s so the
fan-out completes deterministically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): address PR review — Postgres portability, single-flight, URI reuse

- listIdFqnByPrefixHash: dual @ConnectionAwareSqlQuery for MySQL
  (JSON_UNQUOTE/JSON_EXTRACT) and Postgres (json->>) so the name-hash
  LIKE scan runs on both backends.
- CachedReadBundle: drop Redis SETNX busy-poll + null-DTO waiter spin.
  Use Guava Striped<Lock> keyed by (type, id) so concurrent readers on
  one instance collapse to one DB load without Redis round-trips; cross
  instance races remain coherent because Redis SET is idempotent.
  EntityRepository.buildReadBundle takes/releases the stripe lock in a
  try/finally around the cache populate.
- RedisURIFactory: single shared builder used by RedisCacheProvider and
  CacheInvalidationPubSub so both interpret redis url / auth / SSL /
  database config identically.
- RedisCacheProvider.awaitAll: use LettuceFutures.awaitAll so the whole
  pipeline batch shares one timeout instead of accumulating per-future
  timeouts.
- mvn spotless:apply follow-ups across a few unrelated files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(cache): address PR review — rediss:// SSL, pipeline error handling, stale comments

- RedisURIFactory: carry parsed.isSsl() forward when rebuilding the
  builder from a redis:// / rediss:// URL. Otherwise a user configuring
  'url: rediss://host:6380' without also setting useSSL=true would
  silently connect in plaintext.
- RedisCacheProvider.awaitAll: capture the LettuceFutures.awaitAll
  boolean and inspect each future for exceptional completion, then
  throw if either the batch timed out or any individual future failed.
  Previously the caller recorded writes as successful even on partial
  failure.
- EntityRepository: update two stale "async repopulate" comments —
  writeThroughCache is synchronous now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(jdbi): extract DeadlockRetry utility with resilience4j backoff

Replace TestCaseRepository's inline retry loop with a reusable
DeadlockRetry helper keyed to the transaction boundary. Retries live in
resilience4j so backoff runs on a scheduled executor instead of
Thread.sleep blocking the request thread. Exponential base 50 ms ×
2^(attempt-1) with 50% jitter over 4 attempts.

DeadlockRetry must wrap a @Transaction-annotated call so each retry
replays the whole unit of work in a fresh JDBI transaction — a per-DAO
retry would leave earlier writes in the rolled-back txn lost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): log root cause of first Redis pipeline failure

awaitAll counted per-future exceptions but never surfaced what actually
broke. On a batch failure operators had a count and a timeout but no
way to tell NOSCRIPT / OOM / connection-reset apart. Capture the first
underlying cause, log it once, and attach it as the cause of the
thrown IllegalStateException.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review — counters, lock leak, txn retry, gating

- CacheWarmupApp: pass per-page deltas to updateEntityStats so stored
  totals don't double-count as cumulative counters grow page-over-page.
- EntityRepository.buildReadBundle: hold the striped load-lock through
  the whole fetch/populate path instead of only the final populate
  step. An exception in fetchTo/From/Tags/Votes/Extensions/prefetch
  previously leaked the lock and stalled later readers on the same
  (type, id).
- TestCaseRepository.addAllTestCasesToLogicalTestSuite: split public
  entry point from the @Transaction method and wrap DeadlockRetry
  outside the transaction boundary so each retry runs in a fresh txn.
- EntityResource.isDistributedCacheEnabled: also check
  CacheProvider.available() so a failed or disconnected Redis doesn't
  leave REST GETs serving stale Guava reads across instances.
- DeadlockRetry Javadoc: corrected — resilience4j's executeSupplier
  is synchronous; the calling thread waits between attempts. Matches
  the SearchRetryUtil pattern already in use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): address review — health-check, pipeline failure accounting, deterministic warmup, by-name invalidation

- RedisCacheProvider: flip `available=false` from command catches + background PING health
  check that recovers the flag when Redis comes back. Prevents stale-read divergence in
  multi-instance deployments after a Redis outage.
- CacheWarmupApp: surface pipeline failures — no longer count rows toward success when the
  Redis batch write threw. Set FAILED status when cache is unavailable at startup so the job
  record doesn't stay RUNNING. Replace "user" string literal with Entity.USER.
- EntityDAO.listAfterWithOffset: add ORDER BY id so warmup pagination is deterministic
  (was prone to skip/duplicate rows between pages).
- RedisURIFactory: normalize bare host/host:port through RedisURI.create so IPv6 hosts and
  malformed inputs fail cleanly instead of blowing up split(":").
- invalidateCacheForEntity(..., null) left by-name cache entries stale in
  Persona/DataProduct/Domain. Added invalidateCacheForReferencedEntity(record) helper that
  extracts fullyQualifiedName from the relationship record JSON; PersonaDAO now has a
  (id, fqn) variant used before the bulk default-unset so both cache variants evict.

* fix(cache): abort warmup when provider flips to unavailable mid-run

A prior batch that trips the Redis provider to available=false causes
pipelineSet/Hset calls in subsequent iterations to silently return (their
`if (!available) return;` guard fires). The try-block then completes
without exception, and the success counter still adds pageSuccess — so
rows get reported as warmed even though nothing was written to Redis.

Check `cacheProvider.available()` at the top of each page iteration and
bail out. The background health checker flips availability back when
Redis recovers; operators rerun the app to resume warmup from a clean
state rather than relying on mid-outage bookkeeping.

* fix(cache): address two new Copilot findings — PubSub leak + deadlock chain walk

- CacheInvalidationPubSub.start() set `running=true` via CAS, then allocated
  RedisClient/subConnection/pubConnection. If any step after the first
  allocation threw, the catch only flipped `running=false` — leaving half-
  initialized Lettuce client + connections dangling. stop() would then
  short-circuit on the flag and never clean them up. Extract a
  closeResources() helper called from both the catch and stop() so the
  client/connections are released on partial failure.
- DeadlockRetry.isDeadlock walked to the deepest cause and only checked that
  leaf. The Javadoc promises "or any cause in its chain". When the SQLException
  is wrapped in UnableToExecuteStatementException and the connection-release
  throws a non-SQLException wrapper, the leaf is no longer the SQLException
  and real deadlocks silently skip the retry. Walk every link (with a guard
  against self-referential cycles) and return true if any link matches.

* fix(cache): two more Copilot findings — user FQN case-fold + awaitAll future cancel

- EntityLoaderWithName lowercased the DB lookup for `user` types but the
  Guava CACHE_WITH_NAME key was still the caller-provided fqn. `Alice@x.com`
  and `alice@x.com` produced split cache entries, and invalidations written
  against the canonical lowercased form left the mixed-case entry serving
  stale data until TTL. Added a `cacheNameKey(entityType, fqn)` helper that
  lowercases for user and passes through otherwise, applied at all 10
  CACHE_WITH_NAME access sites (get + invalidate).
- awaitAll threw on batch timeout but left futures still-in-flight. Over
  repeated timeouts the Lettuce event loop accumulates pending response
  slots and dispatcher work. Added `cancel(false)` for any non-done future
  on the failure path and reported the cancelled count in the thrown ISE.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-04-23 12:18:53 +02:00
Mohit Yadav
25fda478ba
fix: memory hardening to prevent OOMKill under concurrent load (#27397)
* fix: memory hardening to prevent OOMKill under concurrent ingestion load

Convert Guava caches from count-based to weight-based eviction to cap
total heap consumed. Bound unbounded queues and thread pools that could
grow without limit under load. Cap per-request entity cache, strip full
entity data from ChangeEvents, add LIMIT to unbounded SQL queries, and
set a 50MB JSON input size constraint.

Key changes:
- EntityRepository CACHE_WITH_ID/NAME: maximumSize(20K) -> maximumWeight(200MB)
- GuavaLineageGraphCache: maximumSize(100) -> maximumWeight(100MB)
- SubjectCache, SettingsCache, RBAC cache: weight-based eviction
- EntityLifecycleEventDispatcher: bounded queue (5000) + CallerRunsPolicy
- EventPubSub: bounded ThreadPoolExecutor(4-32) replacing unbounded CachedThreadPool
- RequestEntityCache: LRU cap at 50 entries per thread
- ChangeEvent: lightweight entity ref instead of full entity embedding
- CollectionDAO.listUnprocessedEvents: added LIMIT 1000
- JsonUtils: maxStringLength capped at 50MB (was Integer.MAX_VALUE)
- WebSocketManager: cleanup empty user maps on disconnect
- BULK_JOBS: reduced retention from 1h to 5min, capped at 100 concurrent
- Default heap bumped from 1G to 2G with G1GC and HeapDumpOnOOM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* revert: remove createLightweightEntityRef — preserve entity type safety in ChangeEvents

The Map-based lightweight ref broke type safety and downstream code
expecting typed entities. Reverted all .withEntity() calls back to
passing the original entity. The ChangeEvent already carries entityId,
entityType, and entityFullyQualifiedName as separate fields, so the
full entity embedding can be addressed separately with a proper
withEntityRef() approach.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address code review — TOCTOU race, weigher accuracy, serialization cost, event pagination

- BULK_JOBS: synchronized check-then-put to eliminate TOCTOU race
- CacheWeighers.stringWeigher: account for UTF-16 (2 bytes/char + 40B overhead)
- Replace jsonSerializationWeigher with toStringWeigher to avoid full JSON
  serialization on every cache put (was hitting SubjectCache and SettingsCache)
- Revert LIMIT 1000 on listUnprocessedEvents(offset) — the sole caller uses
  it for counting unprocessed events and doesn't paginate, so the LIMIT would
  silently undercount. The paginated overload already exists for bounded fetching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use weight-based 100MB cap for entity caches, delete CacheWeighers, add memory tests

The two entity JSON caches (CACHE_WITH_ID, CACHE_WITH_NAME) are the only
caches storing arbitrarily large values (1KB to 2MB+). A count-based
maximumSize can never be safe — 1000 × 2MB = 2GB, 20K × 2MB = 40GB.

For String values, `length() * 2 + 40` is the exact Java heap cost
(UTF-16 encoding + object header). This is a single field read, zero
allocation, and mathematically precise — not an estimate.

Changes:
- CACHE_WITH_ID/NAME: maximumWeight(100MB) with inline string weigher
- Delete CacheWeighers utility — weigher is now inlined, no indirection
- Other caches: keep maximumSize with conservative counts (values are
  small fixed-size objects where count-based eviction is appropriate)
- Add EntityCacheMemoryTest proving:
  * Count-based cache with 500 × 500KB entities consumes 249MB
  * Weight-based cache correctly evicts to stay within 100MB cap
  * Mixed sizes: 2MB entities correctly evict smaller entries
  * String weigher formula is mathematically exact

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add integration test proving entity cache memory behavior under load

EntityCacheMemoryIT runs against a real server to validate:

1. concurrentLargeTableFetches_heapStaysBounded: Creates 30 tables with
   300 columns each (~100-500KB JSON per entity), then 5 concurrent
   clients hammer GET /api/v1/tables by ID and FQN repeatedly. Asserts
   that >95% of fetches succeed (server stays alive) and heap growth is
   bounded under 500MB (proves cache cap works).

2. largeTableJsonSize_isSignificant: Creates a 300-column table, fetches
   it, serializes to JSON, and measures the size. Asserts JSON > 50KB,
   then projects that 20K entries at this size would consume >500MB —
   proving the old maximumSize(20000) config is dangerous.

Heap measurement uses the /prometheus endpoint (jvm_memory_used_bytes
with area="heap") for real server-side metrics, not client-side Runtime.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: make cache sizes configurable via openmetadata.yaml

Add CacheConfiguration with env-var-overridable settings for all cache
groups. Caches that don't have a specific override fall back to defaults.

Configuration in openmetadata.yaml:
  cache:
    defaultMaxSizeBytes: 50MB        # fallback for unspecified caches
    defaultTTLSeconds: 300
    entityCacheMaxSizeBytes: 100MB   # CACHE_WITH_ID, CACHE_WITH_NAME
    entityCacheTTLSeconds: 30
    lineageCacheMaxEntries: 50       # lineage graph cache
    lineageCacheTTLSeconds: 300
    authCacheMaxEntries: 5000        # SubjectCache (user context + policies)
    authCacheTTLSeconds: 120

Entity caches and auth caches are rebuilt at startup via initCaches()
once the configuration is loaded. Fields are volatile to ensure
visibility across threads during the swap.

Customers with large heap (e.g., Myntra with 12GB) can tune:
  ENTITY_CACHE_MAX_SIZE_BYTES=500000000  # 500MB for better hit rates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: resolve Jackson property name conflict for cache configuration

Rename field/getter from cacheConfiguration/getCacheConfiguration() to
cacheMemoryConfiguration/getCacheMemoryConfiguration() to avoid
conflicting with the existing getCacheConfig() (Redis cache provider).
Jackson infers property name from getter, so both resolved to "cache".

YAML key is now "cacheMemory:" to match.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore SubjectCache TTLs to prevent UserResourceIT flaky failure

The testUserContextCachePerformance test asserts >30% cache hit
improvement. Our initCaches() was replacing the USER_CONTEXT_CACHE TTL
from 15 minutes to 2 minutes (the policies TTL), making cache entries
expire too fast for the test's sub-millisecond timing to detect a
difference.

Fix: keep original TTLs hardcoded (2 min for policies, 15 min for user
context) since they serve different freshness needs. Only max entries
is configurable via authCacheMaxEntries. Restore USER_CONTEXT_CACHE
default to 10000 (User objects are small, original was fine).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address all PR review comments

Review fixes:
- WebSocketManager: use computeIfPresent for atomic disconnect cleanup
- BULK_JOBS: move capacity check before async scheduling, throw
  WebApplicationException(429) instead of RuntimeException(500)
- Entity cache comments: "exact" → "conservative upper-bound" (Java 21
  compact strings may use fewer bytes)
- EntityCacheMemoryTest: @Tag("benchmark") to exclude from CI, replace
  flaky heap assertions with deterministic payload accounting
- EntityCacheMemoryIT: @Isolated + @Tag("benchmark"), sum all heap pool
  samples from Prometheus, remove Runtime fallback, handle unavailable
  metrics gracefully
- JsonUtils: clarify comment as "~50M chars" not "50 MB"
- Remove dead config fields (defaultMaxSizeBytes, defaultTTLSeconds,
  lineageCacheMaxEntries, lineageCacheTTLSeconds) — not wired to code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore GuavaLineageGraphCache to use config.getMaxCachedGraphs()

The hardcoded maximumSize(50) was silently ignoring the
LineageGraphConfiguration setting while the log still reported the
config value — misleading. Restored to config.getMaxCachedGraphs()
(default 100) which is already safe since put() rejects graphs above
the mediumGraphThreshold.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address @pmbrull review — named constants, RBAC cache via config

Pere's review comments:
1. EntityRepository:312 "shouldnt this be part of the config too?"
   → Default values now reference CacheConfiguration.DEFAULT_* constants
   instead of inline magic numbers. initCaches() overrides at startup.

2. CacheConfiguration:37 "how did we come up with this default?"
   → Added Javadoc on each constant explaining the rationale (100MB safe
   for 2-8GB heap, 30s TTL matches original, 5000 entries for small objects).

3. OpenSearchSearchManager:113 "why is this not managed via config?"
   → RBAC cache now configurable via cacheMemory.rbacCacheMaxEntries
   env var RBAC_CACHE_MAX_ENTRIES (default 5000). Added initRbacCache()
   called from app startup.

4. RequestEntityCache:28 "what are the magic numbers?"
   → Extracted INITIAL_CAPACITY, LOAD_FACTOR, ACCESS_ORDER as named
   constants. Added Javadoc on MAX_ENTRIES_PER_REQUEST explaining the
   50-entry cap rationale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address Copilot review — Semaphore for bulk jobs, plain Cache for RBAC, @Valid config

1. BULK_JOBS: Replace synchronized+ConcurrentHashMap with Semaphore for
   thread-safe concurrency limiting. tryAcquire() is atomic, release()
   in whenComplete ensures permits are always returned.

2. RBAC cache: Switch from LoadingCache with null-returning CacheLoader
   to plain Cache<String, Query>. The CacheLoader was dead code — all
   callers use get(key, Callable). Null returns from CacheLoader would
   throw InvalidCacheLoadException.

3. CacheConfiguration: Add @Valid to the cacheMemory field in
   OpenMetadataApplicationConfig and initialize inline so @Min
   constraints are enforced by Bean Validation at startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: rewrite EntityCacheMemoryIT as diagnostic with per-phase heap breakdown

The previous 500MB hard assertion was too tight — total heap growth
includes non-cache overhead (change events, search indexing, request
buffers, thread stacks, GC pressure). 744MB growth for 30 large tables
with concurrent fetching is expected server-wide, not just cache.

New test structure:
- Takes heap snapshots at each phase (baseline, schema setup, table
  creation, sequential fetches, concurrent storm, 5s settle)
- Logs a full diagnostic report with per-phase growth breakdown
- Dumps JVM memory pool details from Prometheus (per-pool used/max,
  buffer memory, GC live data, thread count)
- Asserts only on what matters: >95% fetch success rate (server alive)
- Heap growth is logged for analysis, not hard-asserted

This lets us see WHERE the 744MB goes — is it table creation (change
events), sequential fetches (cache fill), or the concurrent storm
(request amplification)?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* perf: eliminate deepCopy in RequestEntityCache — store JSON strings instead

RequestEntityCache previously called JsonUtils.deepCopy() on both put()
and get(), creating ~990KB of allocation per 247KB entity interaction
(deepCopy on put + deepCopy on get). This was the largest contributor
to the 12.7x memory amplification per entity in the createOrUpdate path.

Fix: store JSON strings (immutable, safe to share) instead of entity
objects. put() serializes once to JSON, get() deserializes back. No
defensive copying needed since strings are immutable.

Measured improvement (30 tables × 300 columns, 5 concurrent fetchers):
  Before (deepCopy):  702MB retained after settle, +407MB total growth
  After (JSON cache): 434MB retained after settle, +325MB total growth
  GC live data:       232MB (vs 200MB cache budget — only 32MB overhead)
  Improvement:        268MB less retained heap (38% reduction)

The table creation phase went from +340MB to -88MB (GC could reclaim
during creation since RequestEntityCache no longer holds deepCopy'd
objects).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add per-entity allocation budget to memory diagnostic report

The diagnostic test now reports exactly where memory goes for each
entity creation and fetch, based on code path tracing:

Per-table create (247KB entity, 300 columns):
  DB storage (serializeForStorage):           ~247KB
  Search indexing (buildSearchIndexDoc):       ~1394KB
    ├─ getMap(entity) full entity→Map:         ~494KB
    ├─ pojoToJson(searchDoc) Map→JSON:         ~247KB
    └─ indexTableColumns (300 cols × 3KB):     ~900KB
  ChangeEvent (entity embedded + serialized):  ~494KB
  Redis write-through (dao.findById):          ~247KB
  RequestEntityCache (pojoToJson):             ~247KB
  Other (relations, inheritance):              ~150KB
  TOTAL PER TABLE:                             ~2.7MB (~11x amplification)

Per-fetch (GET /api/v1/tables):
  Guava cache hit → readValue(JSON):           ~495KB
  setFieldsInternal (10+ DB queries):          ~50KB
  RequestEntityCache put (pojoToJson):         ~247KB
  HTTP response serialization:                 ~247KB
  TOTAL PER FETCH:                             ~1MB

30 creates + 900 fetches = ~81MB creates + ~913MB transient fetch allocs.
GC live data after settle: 247MB (only 47MB above 200MB cache budget).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: RBAC cache null handling and semaphore permit leak on submission failure

1. RBAC cache: Guava Cache forbids null values — Cache.get(key, Callable)
   throws InvalidCacheLoadException if Callable returns null. The RBAC
   evaluator returns null when no RBAC query is needed. Fixed by using
   getIfPresent() + manual put() instead of get(key, Callable), and
   skipping the filter when the query is null.

2. Bulk job semaphore: permit was acquired before supplyAsync() but if
   the executor rejects the task (AbortPolicy + full queue), the permit
   was never released because whenComplete was never registered. Wrapped
   task submission in try/catch to release on failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update docker/docker-compose-openmetadata/env-mysql

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update docker/docker-compose-openmetadata/env-postgres

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-17 14:51:16 +02:00
Sid
0a98f5bf32
test(playwright): add nightly SSO login spec starting (#27164)
* test(playwright): add nightly SSO login spec starting with Okta

Extends Playwright coverage end-to-end for SSO login flows. Today's SSO
coverage (Features/SSOConfiguration.spec.ts) only asserts the config
form UI. This adds a new suite that configures OpenMetadata to an
external identity provider, drives a real login through the provider's
hosted UI, and validates the resulting session against the OM API.

Phase 1 ships Okta only (integrator-9351624.okta.com). Additional
providers (Auth0, Azure, Cognito, SAML, Google) plug into the same
dispatcher by adding a ProviderHelper implementation.

## What's new

- playwright/e2e/Auth/SSOLogin.spec.ts — two-test suite tagged @sso
  1. Asserts the SSO sign-in button renders on /signin with the correct
     brand label and that the basic-auth form is not shown.
  2. Clicks the button, drives the provider's login widget, follows the
     OAuth callback, completes first-run self-signup when needed,
     lands on /my-data, then verifies the JWT by calling
     GET /api/v1/users/loggedInUser and asserting the returned email
     matches SSO_USERNAME.

- playwright/utils/ssoAuth.ts — provider-agnostic orchestration:
  applyProviderConfig (PUT /api/v1/system/security/config),
  restoreBasicAuth, buildAuthContextFromJwt, verifyLoggedInUserMatches.
  Composes existing getApiContext/getAuthContext/getToken helpers — no
  token extraction or HTTP plumbing is reimplemented.

- playwright/utils/sso-providers/{index,okta}.ts — ProviderHelper
  interface plus the Okta Identity Engine widget driver. Defaults the
  dev tenant values from the committed openmetadata.yaml snippet so the
  spec only needs SSO_USERNAME/SSO_PASSWORD to run locally.

- playwright/constant/ssoAuth.ts — env var key constants,
  PROVIDER_BUTTON_TEXT map, and the BASIC_AUTH_CONFIG payload used for
  cleanup.

- playwright.config.ts — new 'sso-auth' project matching
  playwright/e2e/Auth/**/*.spec.ts with its own serial workers, and
  '**/Auth/**' added to the chromium project's testIgnore so these
  tests never run in the default suite.

## How provider switching works

beforeAll logs in as admin via basic auth, captures the admin JWT via
getToken(page) BEFORE the swap, then PUTs the Okta config. The admin
JWT survives the provider swap because OM's internal JWKS stays in
publicKeyUrls and the admin user's isAdmin flag is persisted in the DB.
afterAll rebuilds an API context from that JWT and restores basic auth,
making the spec fully idempotent — the same OM instance can run the
suite repeatedly without any manual cleanup.

## Running locally

    export SSO_PROVIDER_TYPE=okta
    export SSO_USERNAME='<okta-test-user>'
    export SSO_PASSWORD='<okta-test-password>'
    npx playwright test playwright/e2e/Auth/SSOLogin.spec.ts \
      --project=sso-auth --workers=1

Verified end-to-end against integrator-9351624.okta.com — both tests
pass in ~12s on an already-provisioned user, ~14s on first-run
self-signup. Cleanup leaves the server in basic-auth mode.

## Notes for reviewers

- The existing .github/workflows/playwright-sso-tests.yml already wires
  up the CI matrix and secret names; this change intentionally does
  NOT enable the cron schedule. That lands in a follow-up once one
  provider is stable for a few nightly runs.
- OKTA_SSO_CLIENT_ID / OKTA_SSO_DOMAIN / OKTA_SSO_PRINCIPAL_DOMAIN env
  vars can override the baked-in dev tenant defaults if a different
  Okta tenant is used in CI.

* ci: add dedicated SSO Login Nightly workflow

Adds .github/workflows/playwright-sso-login-nightly.yml, a standalone
workflow that runs the new SSOLogin spec nightly at 03:00 UTC instead
of piggy-backing on playwright-sso-tests.yml.

The existing playwright-sso-tests.yml is left untouched — it still
covers the SSO configuration form UI via SSOConfiguration.spec.ts and
its matrix/secrets wiring is unchanged. The new workflow complements
it with a real end-to-end login round-trip:

- Schedule: cron '0 3 * * *'
- Provider matrix: okta only for Phase 1 (extended as helpers ship)
- Invokes playwright/e2e/Auth/SSOLogin.spec.ts under the new
  sso-auth Playwright project with workers=1
- Wires provider credentials via secrets with the existing
  {PROVIDER}_SSO_USERNAME / {PROVIDER}_SSO_PASSWORD convention plus
  optional OKTA_SSO_CLIENT_ID / OKTA_SSO_DOMAIN /
  OKTA_SSO_PRINCIPAL_DOMAIN overrides
- Uses the shared setup-openmetadata-test-environment composite
  action, PostgreSQL, ingestion disabled — matching the existing SSO
  tests workflow
- Uploads the HTML report as an artifact on every run and cleans up
  the docker stack in a final always-run step

* refactor(playwright): simplify ssoAuth helpers

- verifyLoggedInUserMatches now asserts directly on the lowercased
  email field instead of building a candidate array and feeding it a
  long stringified failure message. The assertion failure already
  shows expected vs received, so the wrapper string was just noise.

- Drop buildAuthContextFromJwt — it was a one-line wrapper around
  getAuthContext. The spec calls getAuthContext directly now.

* refactor(playwright): address SSO suite review feedback

- Extract OM_BASE_URL from PLAYWRIGHT_TEST_BASE_URL (with the same
  http://localhost:8585 default as playwright.config.ts) and export
  it from constant/ssoAuth.ts. okta.ts and BASIC_AUTH_CONFIG both
  consume it, so callbackUrl, the OM JWKS entry in publicKeyUrls, and
  the basic-auth restore payload all match the test target — including
  CI runs against non-default hosts.

- Drop PROVIDER_BUTTON_TEXT. It was exported but never imported; the
  ProviderHelper.expectedButtonText field is the only source of truth
  for the SSO sign-in button label and the spec already reads from it.

- Restore the OM convention adminPrincipals: ['admin'] in the Okta
  config (matches conf/openmetadata.yaml's AUTHORIZER_ADMIN_PRINCIPALS
  default). The previous code was granting admin to whichever IdP user
  ran the suite — verifyLoggedInUserMatches only needs an authenticated
  session, not admin, so the elevation was unnecessary. This also drops
  the now-unused requireEnv on SSO_USERNAME inside okta.ts; the spec
  itself still gates on the env var via test.skip.

- Set workers: 1 on the sso-auth Playwright project. fullyParallel:
  false alone wasn't enough — the global workers: 3 on CI could still
  fan out across multiple Auth/**/*.spec.ts files in the future. The
  explicit limit enforces full isolation as more provider specs land.

* ci: avoid CodeQL "Excessive Secrets Exposure" in SSO Login Nightly

Replaces the dynamic secret lookup

    secrets[format('{0}_SSO_USERNAME', upper(matrix.provider))]

with a static reference

    secrets.OKTA_SSO_USERNAME

CodeQL flagged the dynamic indexing because GitHub Actions can only
mask & scope secrets that are referenced statically. With a computed
key, the runner has no way to know which single secret is needed and
conservatively materializes EVERY org and repo secret into the step's
environment — even though the test only reads OKTA_SSO_*. Static
references let GitHub expose only the two credentials this step
actually uses.

Phase 1's matrix is okta-only so the change is two lines. The added
inline comment documents the convention for future providers: add a
sibling step gated by `if: matrix.provider == '<provider>'` with that
provider's static secret references — do not bring back the
secrets[format(...)] pattern.

* refactor(playwright): capture/restore real security config in SSO suite

- Snapshot /system/security/config in beforeAll, restore exact payload in
  afterAll instead of PUTting a hand-rolled basic-auth baseline (preserves
  allowedDomains, forceSecureSessionCookie, adminPrincipals, etc.)
- Strip ldap/saml subtrees from the snapshot: GET returns empty-string
  placeholders the PUT validator rejects
- Require OKTA_SSO_{CLIENT_ID,DOMAIN,PRINCIPAL_DOMAIN} via getRequiredEnv;
  no more hardcoded tenant defaults
- Fail fast in beforeAll if admin JWT capture returns empty string so the
  server is never left stuck in SSO mode
- Shrink Okta provider override to just the fields Okta needs; sibling
  authorizer fields come from the captured snapshot

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): extract per-provider composite action

Restructures the nightly workflow so provider credentials stay statically
referenced for CodeQL while making it trivial to add new providers:

- New composite action .github/actions/sso-login-run bundles all shared
  setup + test-run logic; pulls non-secret provider config from the
  caller's vars context dynamically (${PROVIDER_UPPER}_SSO_*)
- playwright-sso-login-nightly.yml becomes a thin dispatcher with one
  real job per provider. Each job declares environment: test so it can
  resolve its password via a static secrets.<PROVIDER>_SSO_PASSWORD
  reference (no secrets[format(...)] dynamic lookup, CodeQL clean)
- Adding a provider = copy the okta job stanza, swap the secret name,
  add the provider to the dispatch input choices, register the helper
  in sso-providers/index.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(playwright): move Okta tenant config to a repo constant

The Okta tenant identifiers (clientId, domain, principalDomain) are
non-secret OAuth public values — visible on the hosted login page
during any sign-in. Keeping them in GitHub environment variables cost
setup friction (5 env vars to configure locally, each a potential typo)
without any security benefit. Move them back to a committed OKTA_TENANT
constant in okta.ts where a reviewer can see exactly which tenant the
suite is exercising.

Net effect:
- Local runs only need SSO_PROVIDER_TYPE, SSO_USERNAME, SSO_PASSWORD.
- The test environment in GH Actions keeps OKTA_SSO_USERNAME (variable)
  and OKTA_SSO_PASSWORD (secret); the three tenant variables are no
  longer consumed.
- Composite action drops the jq-based dynamic var extraction; the
  caller passes sso_username directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): move timeout-minutes from composite step to job level

Composite actions don't support timeout-minutes on individual steps —
that's a runner job field only. Move the 30-minute test timeout up to
the dispatcher job and bump to 45 minutes to cover docker + maven setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): consolidate dispatcher + composite action into one file

Collapse the dispatcher workflow + composite action split into a single
~115-line workflow using a strategy matrix and dynamic
vars[format(...)] / secrets[format(...)] credential resolution keyed on
the matrix provider name.

Trade-off:
- CodeQL "Excessive Secrets Exposure" (low severity) will re-flag the
  dynamic secret lookup. Accepted in exchange for a single source of
  truth and true zero-workflow-churn multi-provider support.

Onboarding a new provider is now:
  1. Add its name to the matrix array + dispatch options list.
  2. Add <PROVIDER>_SSO_USERNAME (variable) + <PROVIDER>_SSO_PASSWORD
     (secret) in the test environment.
  3. Register the helper in sso-providers/index.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): drop provider-prefix bash step; use case-insensitive lookup

GitHub secret and variable names are case-insensitive, so
format('{0}_SSO_PASSWORD', matrix.provider) with the lowercase matrix
value resolves correctly against the uppercase conventional names like
OKTA_SSO_PASSWORD. That removes the need for a separate "Compute
provider prefix" step and its cross-step env-context plumbing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): drop redundant case-insensitivity comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* ci(sso-login): pin playwright install to 1.57.0 to match package.json

The previous 1.51.1 pin was stale vs. the @playwright/test version in
package.json. The mismatch caused browser cache path divergence — the
install step wrote browsers under 1.51.1's cache and the test run
looked for them under 1.57.0's cache and failed with "browsers not
installed."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(playwright): address SSO suite review comments [skip ci]

- Drive Okta tenant (clientId, domain, principalDomain) from env vars,
  falling back to the existing nightly tenant values as defaults
- Use redirectToHomePage as the final assertion in the SSO login step
- Document why the /signup vs /my-data branch is conditional

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* saml

* test(playwright): add SAML providers to SSO login nightly

Extend the nightly SSO login matrix with Azure AD SAML and a self-contained
Keycloak SAML fixture (Azure-profile + Google-profile realms), so the suite
exercises the full SAML flow end-to-end without relying on a hosted IdP.

- docker/local-sso/keycloak-saml: Keycloak 26.3.3 compose + pre-imported
  realms bound to OM at localhost:8585, port-overridable via
  KEYCLOAK_SAML_PORT.
- playwright sso-providers: azure-saml helper (hosted tenant, non-secret
  federation metadata committed) and keycloak-saml factory that fetches the
  realm's IdP X509 at runtime.
- SSO assertion matches OM's actual SAML sign-in label ("Sign in with
  SAML SSO"), since providerName isn't propagated into the store for the
  SAML provider branch of getAuthConfig.
- Workflow starts/stops the Keycloak stack only for keycloak-* matrix rows
  and injects the fixture credentials inline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(playwright): fetch Azure SAML IdP cert at runtime

Drop the committed Azure Federated SSO X509 certificate and the
AZURE_SAML_IDP_CERTIFICATE env fallback from the azure-saml provider.
The cert now comes from Azure's federation metadata XML endpoint at test
start, mirroring how the Keycloak provider resolves its realm cert, so the
suite stays aligned with Azure's ~3-year cert rotations automatically.

- New saml-metadata.ts exporting fetchIdpX509Certificate(descriptorUrl,
  label), reused by azure-saml and keycloak-saml.
- azure-saml.buildConfigPayload is now async and pulls the cert from
  https://login.microsoftonline.com/<tenantId>/federationmetadata/2007-06/federationmetadata.xml
  before building the SAML payload.
- keycloak-saml drops its inline cert-fetching helpers and delegates to
  the shared util.
- Trim narration comments across the SSO suite to keep only the
  non-obvious rationale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(playwright): drop hosted Azure SAML provider

The nightly Keycloak SAML fixture with Azure-profile attribute claims
exercises the same OM SAML code path as the hosted Azure AD tenant. The
hosted provider added external tenant/cert coupling without unique
coverage, so this removes it.

Drops the azure-saml helper, its env keys (AZURE_SAML_TENANT_ID /
AZURE_SAML_PRINCIPAL_DOMAIN), the dispatcher registration, and the
workflow dispatch option. Keycloak Azure/Google realms remain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(playwright): cover SSO session lifecycle end-to-end

Extends the SSO login spec beyond "can you log in" to the full session
round-trip: reload survives, same-context tabs inherit auth, sidebar
logout (with modal confirm) lands on /signin, and post-logout refresh
stays signed out.

Adds a describe-scoped userContext/userPage created in beforeAll so
tests 2-6 inherit the IdP-backed session; test 1 keeps its fresh
fixture for the unauthenticated assertion. Cleanup closes the user
context before restoring the server security config.

Verified locally against keycloak-azure-saml and keycloak-google-saml
realms: 6 passed each (was 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* remove slow from individual spec

* remove slow from beforeAll

* style(playwright): fix SSOLogin spec prettier issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(playwright): tighten SSO sign-in locator and await logout response

Address Copilot review comments on PR #27164:
- Use button.signin-button to match the pattern in SSOAuthentication.spec.ts.
- Await /api/v1/users/logout POST alongside the /signin navigation in
  the logout test to remove the race against the server response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix

* Update openmetadata-ui/src/main/resources/ui/playwright/e2e/Auth/SSOLogin.spec.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix

* test(playwright): resolve SSO creds via env vars, drop keycloak-google-saml

Route Keycloak credentials through the same `vars[format(...)]` /
`secrets[format(...)]` indirection as Okta via an `env_prefix` matrix
column, removing the hardcoded fixture literals from the workflow.
Password lookup falls back `vars || secrets` so fixture passwords can
live as vars while real provider secrets stay in secrets.

Also drop the keycloak-google-saml variant — same IdP and realm shape
as the Azure variant, so it adds CI cost without meaningful coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(playwright): post SSO login nightly results to Slack

Adds a per-provider Slack notification step mirroring the pattern used
by the postgresql/mysql nightly workflows — reuses the existing
`slack-cli.config.json` and `playwright-slack-report` CLI against the
`results.json` that the global JSON reporter already emits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(playwright): drop logout response wait in SSO spec

OktaAuthenticator.logout clears tokens locally with no backend call, and
GenericAuthenticator (SAML) hits `GET /auth/logout` — neither triggers
the `POST /api/v1/users/logout` the test was waiting on. The listener
never matched, so `Promise.all` hung past the 180s test timeout even
though the page had already navigated to /signin.

Rely on `waitForURL('**/signin')` + the signin button assertion, which
are the actual cross-provider success signals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Siddhant <siddhant@MacBook-Pro-457.local>
Co-authored-by: Siddhant <siddhant@MacBook-Pro-529.local>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Siddhant <siddhant@MacBook-Pro-621.local>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-04-17 13:09:54 +05:30
Sriharsha Chintalapani
bb0daa180e
RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex (#26902)
* RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex

* Update generated TypeScript types

* Address comments from copilot

* Update generated TypeScript types

* fix test issues

* Fix minor UI bugs

* Add the missing filters

* Fix RDF export API error

* Add export functionality

* Fix ui-checkstyle

* Fix java checkstyle

* Fix unit tests

* Fix and increase the coverage for KnowledgeGraph.spec.ts

* Fix tests

* Remove rdf as default in playwright and local docker

* fix ui-checkstyle

* Address comments

* Potential fix for pull request finding 'CodeQL / Artifact poisoning'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* Address copilot comments

* Address copilot comments

* FIx tests

* FIx docker

* Update openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexCoordinator.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Address copilot review comments: license headers, JSON escaping, type safety, border-color, stop semantics

Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1

Co-authored-by: harshach <38649+harshach@users.noreply.github.com>

* Show error toast for unsupported export format in KnowledgeGraph

Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1

Co-authored-by: harshach <38649+harshach@users.noreply.github.com>

* Fix docker

* Fix docker for playwright

* Fix docker for playwright

* Fix tests

* Fix tests

* Fix docker

* Fix docker

* Fix glossary and pagination spec flakiness

* update the missing translations

* Fix docker

* Fix docker

* Fix integration test

* Fix fuseki not starting

* Fixed the run local docker script

* worked on comments

* Fix flakiness in knowledge graph tests

* Fix checkstyle

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: harshach <38649+harshach@users.noreply.github.com>
2026-04-14 13:24:41 -07:00
Pere Miquel Brull
cfd71e8bd3
Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations (#26971)
* Fix k8s operator exit handler pod loop and TTL cleanup, add tolerations support (#26772)

Fix two bugs in the OMJob operator:
- Exit handler pods were recreated indefinitely because findExitHandlerPod()
  lacked the name-based fallback that findMainPod() already had, causing
  label propagation delays to trigger repeated pod creation events
- Terminal phase handler never rescheduled for TTL-based cleanup, so pods
  were never cleaned up after ttlSecondsAfterFinished expired

Add tolerations support for ingestion pod scheduling across the full stack:
- Operator: OMJobPodSpec field, PodManager.buildPod(), CRD schema
- Server: OMJob model, K8sPipelineClientConfig parsing, K8sPipelineClient
  builder, K8sJobUtils serialization

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add K8S_TOLERATIONS env var mapping in openmetadata.yaml

Adds the tolerations config binding so the server picks up the
K8S_TOLERATIONS env var set by the Helm chart secret.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add tolerations to k8s test values for local validation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix cleanup

* Address PR review: remove redundant pod lookup and guard null items

- Remove redundant server-created pod selector fallback in findMainPod()
  since buildPodSelector() now matches all pods by omjob-name alone
- Add null guard for getItems() in deletePods() to prevent NPE
- Update local test values for namespace and image config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 09:42:54 +02:00
Sriharsha Chintalapani
b2b49db75e
MSAL Token Renewal Fix — Safari Session Loss (#27214)
* MSAL Token Renewal Fix — Safari Session Loss

* MSAL Token Renewal Fix — Safari Session Loss

* MSAL Token Renewal Fix — Safari Session Loss

* apply lint

* MSAL Token Renewal Fix — OIDC fix

* wait for token update

* fix unit tests

* Add SSO playwright tests

* Add tests

---------

Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
2026-04-09 17:45:00 -07:00
Mohit Yadav
7bb8e40b65
Fix column filtering on Lineage (#25353)
* Fix Column Filtering and add path preserve

* Preserve only column with matching filter

* Add Test

* update param

* Add UI work

* Lanaguage

* Add proper translations for column-filter locale keys (#25360)

* Initial plan

* Add proper translations for column-filter locale keys across all 18 languages

Co-authored-by: karanh37 <33024356+karanh37@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: karanh37 <33024356+karanh37@users.noreply.github.com>

* fix filtering

* Fix ui : Dropdown filters (Domains, Owners, Tag, Tier, Service, etc.) were not showing in the Impact Analysis view and normal lineage view.

* put back searchbox for column level

* Fix query_filter not working for tag/domain/tier in lineage APIs -> table level filtering

* fix: hasNodeLevelFilters bypassing ES filters causing empty results

* fix: tag filter incorrectly sent to column_filter on table-level page

* Fix Impact Analysis search and filtering with path preservation
  Summary of changes:
  - Backend: Path preservation for search, accurate pagination counts, wildcard query parsing, OR logic for name/displayName
  - Frontend: Column-level search now matches both table names and column names

* Table level: Search → query_filter (matches table names)                    Column level: Search → column_filter only (matches column names)

* Fix column impact analysis: depth-aware filtering, tag aggregation, and nested column support

* address gitar bot feedback : lineage filter — add service to path preservation, fix OR semantics, rename preserve_paths, guard NPE on fromEntity

* fix: use unfiltered depth counts in lineage pagination info, remove 10k doc fetch

* fix: Impact Analysis — fix upstream BFS, always run BFS unfiltered and apply query filter as in-memory post-filter to support multi-depth traversal, fix column
  filter OR-within-type semantics, rename preserve_paths param, and add integration tests

   instead of passing queryFilter into the BFS (which blocked traversal through non-matching intermediate nodes), we now run BFS with no
  filter to discover the full graph topology, then apply the filter after all nodes and edges are collected using the existing
  applyInMemoryFiltersWithPathPreservationForEntityCount.

* fix: lineage Impact Analysis — unfiltered BFS with post-filter for multi-depth traversal, upstream BFS direction fix, remove dead ES query column filter code, fix stale useCallback deps, add SDK methods and integration tests

* fix: remove column_filter from UI calls where backend doesn't support it (exportAsync, platformLineage, dataQualityLineage, paginationInfo), fix stale useCallback deps in LineageProvider

* fix: Impact Analysis — unfiltered BFS for multi-depth filter traversal, upstream direction fix, table/column tag separation, dead code cleanup, stale UI deps, node depth dropdown fix

* fix: remove dead columnFilter plumbing from CustomControls, clear column filters on Table mode switch, fix QueryFilterParser search+filter OR logic, add search combo integration tests, log warn on tag fetch failure

* fix: depth-based pagination sort

* ui: performance optimization — avoids redundant lookups

* handle matchesMultipleFiltersWithMetadata

* fix: upstream/downstream count not updating in table view

* fix UI changes

* fix api issues

* fix: Impact Analysis — move to ES-native filtering with unfiltered BFS, filtered pagination counts, tag name enrichment

* address comments

* fix: Impact Analysis — ES-native filtered traversal, batch tag enrichment, depth filtering with filters, SDK entityType support

* fix tests

* fix failing tests

* fix backend test

* add tests for code coverage

* add tests for code coverage

* fix: add id.keyword sub-field to  ES index mappings to fix lineage filter dropdowns for topics, dashboards, and other non-table entities

* address comments

* fix service type filter case

* address gitar bot feedback

* fix tests

* fix build

* Fix the bugs

* Fix the bugs

* Fix all things related to  Lineag, Impact Analysis

* Update generated TypeScript types

* Fix all things related to  Lineag, Impact Analysis

* Fix Mapping for ids for container and test suite

* test: enhance lineage spec to cover all the missing cases (#26796)

* test: enhance lineage spec to cover all the missing cases

* fix searchIndex mapping

* fix tests

* added filter spec

* fix filter issues

* fix lineageSearchSelect

* update database service filter tests

* iterate over all the entity for service filter

* update impact analysis fixes

* update tests management

* add missing test case

* fix tests

* fix column level lineage tests

* fix apiEndpoint issue

* improved lineage connection assertion

* fix tests

* fix column level linage issues

* fix missing import

* update test import from pages

* fix mlModel spell issue

* fix node pagination and right panel spec

* refactor lineage tests to improve entity creation and visibility checks

* fix license header

* fix build

* fix tests

* fix tests

* UI linter fixes

* address comments

* fix unit tests

* remove redundant method

* improve tests

* fix impact analysis tests

* fix impact analysis

* Fix Export via Async and add tests

* update tests

* fix issues

* Spotless fix

* fix impact analysis

* Fix issue with lineage export

* Fix serviceType filtering

* fix multiple calls issue

* fix lint issues

* fix uni tests

* fix test issues

* fix lineage settings spec

* fix all the tests

* Remove fix me

* fix lint issue

* fix failing specs

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: karanh37 <33024356+karanh37@users.noreply.github.com>
Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
Co-authored-by: sonika-shah <58761340+sonika-shah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
Co-authored-by: Sriharsha Chintalapani <harshach@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-04-06 09:01:15 -07:00
Sriharsha Chintalapani
410c852f4a
Add Json Logging (#26357)
* Add Json Logging

* Fix comments

* Fix tests

* Centralize junit.platform.version in root pom

* Fix test-config-mcp.yaml - update to JSON logging

* Fix logback.xml to use LOG_LEVEL for backward compatibility

* Reverted to text format for test env  test-config-mcp.yaml

* Add the ability to switch between text/json logging

* Fix comments

* Fix json logging

* Address Comments

* Address Comments

---------

Co-authored-by: sonika-shah <58761340+sonika-shah@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-03-31 16:15:07 -07:00
Sriharsha Chintalapani
b7797fe3ef
Airflow 3.x API based connector (#26624)
* Add Airflow Connector with API integration

* Add Airflow Connector with API integration

* Update generated TypeScript types

* Add Airflow Connector with API integration improvements

* fix: username password flow for airflow 3, example yaml file, & sidebar docs

* fix type in UI

* Fix integration tests, fixed UI rendering and docs, improved OpenLineageResolver

* Fix pytests

* move connector

* Update generated TypeScript types

* fix: response parsing for astronomer airflow

* feat: added service account auth for airflow rest connection when composer managed airflow along with token

* fix: airflow rest api connection class converter and airflow.md

* feat: add mwaa config support for authentication

* s3 & column lineage

* Update generated TypeScript types

* fix: test airflow mwaa client

* fix: removed unused method, and extra code for parsing response

* fix: git pr checks

* fix: removed airflowapi integration tests that requires real host instance and added test with mocking

* fix test

* improve test coverage

* push coverage

* fix: gitar comments

* fix: removed redundant files

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Keshav Mohta <68001229+keshavmohta09@users.noreply.github.com>
Co-authored-by: Keshav Mohta <keshavmohta09@gmail.com>
Co-authored-by: ulixius9 <mayursingal9@gmail.com>
2026-03-26 17:15:41 +01:00
Mohit Yadav
b59aa7fc44
Improve indexing (#26154)
* Add Prometheus metrics for reindexing pipeline via Micrometer                                                       Bridge the existing reindexing atomic counters to Prometheus so operators     can alert on failures, latency spikes, and backpressure without relying      solely on database-flushed stats.

  - Add ReindexingMetrics singleton (initialize/getInstance pattern matching
    CacheMetrics) with job lifecycle counters, stage success/failed/warnings
    counters, bulk request timers with SLA buckets, payload size distribution,
    backpressure and promotion counters, and active/pending gauges
  - Register in MicrometerBundle after StreamableLogsMetrics
  - Instrument ReindexingOrchestrator.run() with job started/completed/failed/stopped
  - Bridge StageStatsTracker.flush() deltas to Prometheus per stage and entity type
  - Add bulk request latency timer and payload size recording in OpenSearchBulkSink
  - Record backpressure events in SearchIndexExecutor.handleBackpressure()
  - Record promotion success/failure in DefaultRecreateHandler
  - Add ReindexingMetricsTest with 24 tests covering all metric types

* Add Improvements

* Auto Gene

* Use Auto Config in distributed

* Fix Partition Claim Spread

* Make partition use config

* Correct total count

* Fix Wait time to 5 mins

* Revert om yaml

* Fix Sink sync

* Add Failure Handling at different stages

* Update script to create entities

* Move to scripts

* Add usage and fix script

* Fix Script

* Update generated TypeScript types

* Fix Staging miss

* Fix Stats reconcilation issue

* Revert workflow handler

* Fix Partition worker early sync

* Update Logs

* Update logs EntityRepository

* Error failure test

* Review Comments fix

* Fix Non Distributed live feed

* Fix Non Distributed stats feed

* Fix Review comments

* Fix Time Series cutt off

* Update generated TypeScript types

* Md

* Benchmark addition

* Fix date time warning

* Update load test to do benchmark analysis

* Disagnostic and update perf test

* Move load test to bin

* Fix Review Comments

* Add numeric values

* Move to localhost by default

* Fix Perf test issues

* Review Comments

* Add Preflight Fixes

* Add Preflight fixes for stale entry

* Remove stale entry on ApplicationHandler

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-03-03 16:39:27 +05:30
Mohit Yadav
fa3b7b9305
[Search] Upgrade Clients (#25719)
* Upgrade Clients

* Update clients in docker files

* Fix Tests

* Fix integration test

* Fix Review Comments

* Fix More review comments :-
  1. ElasticSearchClient.java - Added keep-alive timeout configuration
  2. OpenSearchClient.java - Added keep-alive timeout configuration
  3. OpenMetadataOperations.java - Added logging for caught exception
  4. SigV4Hc5RequestSigningInterceptor.java - Now throws exception instead of silently returning

* Fix More review comments :-
  1. ElasticSearchClient.java - Added keep-alive timeout configuration
  2. OpenSearchClient.java - Added keep-alive timeout configuration
  3. OpenMetadataOperations.java - Added logging for caught exception
  4. SigV4Hc5RequestSigningInterceptor.java - Now throws exception instead of silently returning

Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com>

* upgrade to 9.3.0 vs 3.4.0 server since earlier had bug

* fix version in pom

* Fix Review Comments

* FIX IAM OpenSearch FIx

---------

Co-authored-by: Gitar <noreply@gitar.ai>
Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com>
2026-02-07 18:54:13 +05:30
Pere Miquel Brull
7a3746c00f
FIX - Server passes secret prefixes to ingestion (#25527)
* FIX Query Runner - Server passes secret prefixes to ingestion

* FIX Query Runner - Server passes secret prefixes to ingestion
2026-01-28 10:35:13 +01:00
Mohit Yadav
0129f274ed
ReApply changes Fix Stats Issue and Add Tests (#25521)
* Fix Issue and Add Tests

* Update generated TypeScript types

* Fix CI jest failure

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-01-26 21:10:23 +05:30
mohitdeuex
c006bdb2b0 Revert "Fix stats and Improve Search with Insights (#25495)"
This reverts commit 19725a7130.
2026-01-24 11:53:51 +05:30
Mohit Yadav
19725a7130
Fix stats and Improve Search with Insights (#25495)
* Fix Stats

* Add Warning logs and reindex failure analysis

* Add Search Insights in Preferences

* Add Label

* Fix Full Error not available

* Add check for reindex run
2026-01-24 10:27:46 +05:30
Sriharsha Chintalapani
89f627da81
Distributed Search Indexing with Push Notifications (#24939)
* Add Distributed Indexing in Multi-Server scenarios

* Add Distributed Indexing in Multi-Server scenarios

* Update generated TypeScript types

* Handle Servers leaving and joining

* Update generated TypeScript types

* spotless fix

* Refactor Code for Single Server and Multiple Server

* Add Metrics and Search Index Orphaned Cleanup

* Add Language

* Add Test settings

* Add Test data

* Add Test data

* Update generated TypeScript types

* Add Load Test for more entities

* Add Stats fix

* Add server information

* Fix Staging INdex unavailable to DistributedJobParticipant

* Fix Stats issue

* Align Tests

* Fix Stats and Error Handling

* participant stat fix

* Fix coordinator stats

* Add E2E failure tests

* Fix Stats for Reader and Sink

* Added flush for sinking stats

* Add language label

* Fix Entity Build Errors

* Missing commit

* Update generated TypeScript types

* Change runId to serverId

* Fix test failures

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
2026-01-23 06:12:05 +05:30
Pere Miquel Brull
fa4373054e
Finish K8sPipelineClient Implementation (#25172)
* config cleanup

* add missing configs

* fix auto pilot

* fix lifecycle

* fix logs and tests

* fix test

* move integration tests

* fix

* fix

* Address code review feedback

- Fix UsageWorkflowConfig to set stageFileLocation instead of queryLogFilePath
- Add error handling for parseInt in IngestionLogHandler to catch NumberFormatException

* fix

* fix lifecycle

* prepare cronOMJob

* remove PR target

* fix

* fix

* fix

* fix

* fix

* fix tests

* fix review

* fix review

* fix review

* fix

---------

Co-authored-by: Gitar <gitar@gitar.ai>
Co-authored-by: Gitar <noreply@gitar.ai>
Co-authored-by: pmbrull <pmbrull@users.noreply.github.com>
2026-01-15 08:17:55 +01:00
Sriharsha Chintalapani
f5cf3190c4
Add OpenSearch IAM auth; Add multi host listing capability in the existing config for search (#25204)
* Add OpenSearch IAM auth; Add multi host listing capability in the existing config for search

* Update generated TypeScript types

* Issue #22768: OpenSearch IAM auth; multi-host config

* Update generated TypeScript types

* Unify AWS config across different services

* Update generated TypeScript types

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-01-14 12:35:53 +05:30
Sriharsha Chintalapani
2c8a45d2a8
Upgrade to Dropwizard 5x and Jetty 12.1 (#24776)
* Add support for Dropwizard 5.0 and Jetty 12.1.x

* Dropwizard 5x and Jetty 12.1 upgrade

* Fix test behavior

* Fix rdf tests

* revert enableVirtualThreads

* fix tests

* Fix Tests

* Fix tests

* Switch to jersey-jetty-connector for Jetty 12 compatibility

- Replace jersey-apache-connector with jersey-jetty-connector
- Jersey 3.1.4+ jersey-jetty-connector supports Jetty 12.0.x+
- Use JettyConnectorProvider and JettyHttpClientSupplier for HTTP client
- Keep reasonable timeouts (30s connect, 2min read) to prevent CI hangs
- Set SYNC_LISTENER_RESPONSE_MAX_SIZE for large responses

This fixes the 1,093 InterruptedException test failures caused by
using the default Jersey client (HttpURLConnection-based) which doesn't
handle concurrent test execution properly.

* Fix: Start Jetty HttpClient before use

Jetty 12 HttpClient implements LifeCycle and must be explicitly
started with httpClient.start() before use. This fixes the 163
InterruptedException test failures.

* Fix: Force jetty-client to 12.1.1 for jersey-jetty-connector

jersey-jetty-connector brings transitive jetty-client:12.0.22 but
Dropwizard 5.0 uses Jetty 12.1.1. The ClientConnector.newTransport()
API changed between 12.0.x and 12.1.x, causing NoSuchMethodError.

Fix: Exclude transitive jetty-client and add explicit 12.1.x dependency.

* Use Java 11+ HttpClient connector for tests (jersey-jnh-connector)

Switch from the broken jersey-jetty-connector (incompatible with Jetty 12.1.x)
to jersey-jnh-connector which uses Java's built-in java.net.http.HttpClient.
This connector:
- Natively supports all HTTP methods including PATCH
- Works with Java 21
- No external dependencies required
- Avoids compatibility issues with Jetty versions

* Use Apache HttpClient 5.x connector for tests (jersey-apache5-connector)

Switch from jersey-jetty-connector (incompatible with Jetty 12.1.x)
to jersey-apache5-connector which uses Apache HttpClient 5.x.
This connector:
- Supports all HTTP methods including PATCH
- Lenient with empty PUT request bodies
- Has proper timeout support to prevent indefinite hangs
- Works with Jetty 12.1.x

* Fix  tests

* Fix  docker compose

* Fix tests

* Fix tests - make url compatible

* Add URL parsing

* Fix URL decode

* fix tests

* fix test

* fix tests

* Fix integration with new dropwizard-5x changes

---------

Co-authored-by: Karan Hotchandani <33024356+karanh37@users.noreply.github.com>
Co-authored-by: karanh37 <karanh37@gmail.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-01-12 12:18:29 -08:00
Suman Maharana
2741d277ad
Fix Trivy scans (#24867)
* Fix Trivy scans

* remove comms

* fixes

* fixed incompatible changes

* revert dependency conflicts

* update airflow to 3.1.5

* fix airflow not showing debug logs

---------

Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
Co-authored-by: Keshav Mohta <68001229+keshavmohta09@users.noreply.github.com>
2025-12-19 16:27:12 +01:00
Sriharsha Chintalapani
e71715ad6c
Single RDF knowledge graph for all entities (#24839)
* Single RDF knowledge graph for all entities

* Fix RDF Resource Test

* fix test

* fix test

* Add support for TagLabel objects

---------

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
Co-authored-by: lautel <laura92cp2@gmail.com>
2025-12-18 16:33:15 +01:00
Pere Miquel Brull
6fdc3539bb
MINOR - Prepare extra validations for system repository health (#24846)
* MINOR - Prepare extra validations for system repository health

* Update generated TypeScript types

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-12-18 07:37:37 +01:00
Karan Hotchandani
c8501f2f4f
preparing 1.12 branch (#24870) 2025-12-17 18:36:03 +05:30
Mayur Singal
cee0fd2c77
MINOR: Add option to disable ingestion in run_local_docker (#24676) 2025-12-03 12:35:29 +00:00
Teddy
11c2d2f6a9
MINOR - Airflow serialized limit (#24617)
* chore: set lower DAG serialization defaults

* chore: increase timeout for testSuite

* chore: update DAG processor interval postgres

* chore: lower DAG parse interval and delay

* fix: remove internal parsing and trigger dag parsing automatically on deploy
2025-12-01 11:09:19 +01:00
Mayur Singal
acb1be97f4
Fix #23096: Add Airflow 3.x support (#24338)
* Fix #23096: Add Airflow 3.x support

* airflow auth fixes

* fix airflow tests

* fix airflow 3 ingestion

* pyformat

* fix pytest

* pyformat

* bump version

* fix version

* fix mlflow

* custom pydoris

* fix airflow tests

* fix spotless

* final test fixs

* playwrite debug

* fix pytests

* checkstyle fix

* fit get status api and revert playwrite debug

* fix airflow version

---------

Co-authored-by: Ashish Gupta <ashish@getcollate.io>
2025-11-21 12:28:28 +01:00
Eugenio
f528616b2f
Identify when Airflow DAG import errors in CI (#23589) 2025-09-26 15:30:15 +02:00
Himanshu Khairajani
8c4cebea13
fix #21555: Automator - Separating terms and tags in action config (#22970)
* Separating terms and tags in action config

* Update generated TypeScript types

* add: migration files for separate tags and terms

* chore: java formatting

* yaml formatting

* Chore: updated the release number

* updated to v194 as per release cycle

* updated to v195 as per release cycle

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: --global <--global>
2025-08-28 10:50:48 +02:00
Mohit Yadav
c0d7a574d7
chore(release): Prepare Branch for 1.10.0-SNAPSHOT (#23034)
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
2025-08-21 21:43:01 +05:30
Pere Miquel Brull
dbe8d5ed00
CI - YAML formatting issue (#23047)
* CI

* CI
2025-08-21 18:05:15 +02:00
Sriharsha Chintalapani
a6d544a5d8
RDF Ontology, Json LD, DCAT vocabulary support by mapping OM Schemas to RDF (#22852)
* Support for RDF, SPARQL, SQL-TO-SPARQL

* Tests are working

* Add  RDF relations tests

* improve Knowledge Graph UI, tags , glossary term relations

* Lang translations

* Fix level depth querying

* Add semantic search interfaces , integration into search

* cleanup

* Update generated TypeScript types

* Fix styling

* remove duplicated ttl file

* model generator cleanup

* Update OM - DCAT vocab

* Update DataProduct Schema

* Improve JsonLD Translator

* Update generated TypeScript types

* Fix Tests

* Fix java checkstyle

* Add RDF workflows

* fix unit tests

* fix e2e

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>
2025-08-17 18:36:26 -07:00
Mohit Yadav
b92e9d0e06
chore(release): Prepare Branch for 1.9.0-SNAPSHOT (#22742)
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
2025-08-04 20:00:25 +05:30
Mohit Yadav
0b2321e976
Added Session Age for Cookies (#22166)
* - Added Session Age for Cookies

* Make OIDC Session Expiry Configurable

* Update generated TypeScript types

* Updated Docker Files

* Update Session to 7 days

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-07-08 15:07:52 +05:30
Sriharsha Chintalapani
d10178f050
Fix java 21 in docker (#21746) 2025-06-12 18:24:25 -07:00
Sriharsha Chintalapani
8bb055fc9e
Fix #21506: Upgrade to Java 21 (#21507)
* Fix #21506: Upgrade to Java 21

* Fix #1655: Upgrade to Java 21
2025-06-11 22:06:08 -07:00
Akash Jain
799e3ca900
chore(docker-compose): Bump indices.query.bool.max_clause_count=4096 (#21301) 2025-05-27 14:29:05 +02:00
Imri Paran
d91273a30d
Fix 20325: Trigger external apps with config (#20397)
* wip

* feat: trigger external apps with override config

- Added in openmetadata-airflow-apis functionality to trigger DAG with feature.
- Modified openmetadata-airflow-apis application runner to accept override config from params.
- Added overloaded runPipeline with `Map<String,Object> config` to allow triggering apps with configuration. We might want to expand this to all ingestion pipelines. For now its just for apps.
- Implemented an example external app that can be used to test functionality of external apps. The app can be enabled by setting the `ENABLE_APP_HelloPipelines=true` environment variable.

* fix class doc for application

* fixed README for airflow apis

* fixes

* set HelloPipelines to disabeld by default

* fixed basedpywright errros

* fixed app schema

* reduced airflow client runPipeline to an overload with null config
removed duplicate call to runPipeline in AppResource

* Update openmetadata-docs/content/v1.7.x-SNAPSHOT/developers/applications/index.md

Co-authored-by: Matias Puerta <matias@getcollate.io>

* deleted documentation file

---------

Co-authored-by: Matias Puerta <matias@getcollate.io>
2025-05-06 17:41:24 +07:00
Mohit Yadav
20f17a3367
Fixes #16062: Added prompt config to allow config (#20959)
* Fixes #16062
Make prompt=login as optional

* update null or empty
2025-04-25 08:37:25 +05:30
Ashish Gupta
73aaa34b75
update the snapshot to 1.8.0 (#20925) 2025-04-24 10:46:36 +05:30
Akash Jain
0f6d0523d8
feat: Bump Versions to 1.7.0-SNAPSHOT on Main Branch (#20847)
* feat: Bump Versions to 1.7.0-SNAPSHOT on Main Branch

* fix(script): Add a condition for "-SNAPSHOT" is version update script
2025-04-16 15:21:01 +05:30
Mohit Yadav
3a01ad7da5
[Fix-20125] OIDC: Allow max_age to be optional (#20721)
* Make Max Age Optional

* spotless fix
2025-04-09 15:09:57 +05:30
Mohit Yadav
c28f3274d1
Adds new param to docker files (#20338) 2025-03-19 18:13:22 +05:30
Pere Miquel Brull
69c9102da1
MINOR - Bump Ingestion versions (#19836)
* MINOR - Bump Ingestion versions

* MINOR - Bump Ingestion versions

* fix

* fix db_scheme for airflow +2.9.1

* fix
2025-02-18 07:56:46 +01:00
Ethan
48700ae9ea
Fixes #18075: Dockerfile lint warning (#18077)
* fix docker warning

* for running actions

---------

Co-authored-by: Akash Jain <15995028+akash-jain-10@users.noreply.github.com>
2025-02-04 15:28:36 +05:30
Chirag Madlani
a43835df32
Revert "fixes #18820: updated docker compose files (#18821)" (#19297)
This reverts commit 69dd8b99f9.
2025-01-09 15:34:07 +05:30
tarunpandey23
69dd8b99f9
fixes #18820: updated docker compose files (#18821) 2025-01-09 10:50:00 +05:30
Ethan
e708a3242e
feat: update version (#18259) 2024-10-17 16:18:37 -07:00
Imri Paran
be82086e25
MINOR: add column case sensitivity parameter (#18115)
* fix(data-quality): table diff

- added handling for case-insensitive columns
- added handling for different numeric types (int/float/Decimal)
- added handling of boolean test case parameters

* add migrations for table diff

* add migrations for table diff

* removed cross type diff for now. it appears to be flaky

* fixed migrations

* use casefold() instead of lower()

* - implemented utils.get_test_case_param_value
- fixed params for case sensitive column

* handle bool test case parameters

* format

* testing

* format

* list -> List

* list -> List

* - change caseSensitiveColumns default to fase
- added migration to stay backward compatible

* - removed migration files
- updated logging message for table diff migration

* changed bool test case parameters default to always be false

* format

* docs: data diff

- added the caseSensitiveColumns parameter

requires: https://github.com/open-metadata/OpenMetadata/pull/18115

* fixed test_get_bool_test_case_param
2024-10-15 16:29:43 +02:00
Chirag Madlani
b0563ccf98
revert quicktype bump since for CI issue (#17934)
* enable logging for debugging

* remove node-gyp

* add node-gyp globally before installing deps

* reduce quick type to 10

* revert quicktype and node-gyp changes for CI

* fix unit tests
2024-09-20 19:30:06 +05:30
Pere Miquel Brull
6a1cd0ef8b
GEN-1493 - Fix paginate_es in opensearch (#17858)
* GEN-1493 - Fix opensearch pagination

* GEN-1494 - Add CI for py-tests with Postgres and Opensearch

* GEN-1494 - Add CI for py-tests with Postgres and Opensearch
2024-09-17 14:21:10 +02:00