OpenMetadata

mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

Author	SHA1	Message	Date
mohitdeuex	e90a48fadd	Update name	2026-05-21 18:37:04 +05:30
mohitdeuex	4f78c6cf05	test(review): address pmbrull nits on PR #28008 - nightly workflow: reformat the Topology comment block (drop the column-aligned space padding that read as "weird spaces"). - nightly workflow: hoist the stress cohort sizes (simpleReindex tables/topics/dashboards/pipelines, searchAvailable tables) into workflow_dispatch inputs with the current values as defaults, so they're tunable from the Actions UI per run. - remove openmetadata-integration-tests/REINDEX_TEST_PLAN.md — a planning/tracking doc that doesn't belong in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 18:33:43 +05:30
mohitdeuex	b933eee132	test(review): address PR #28008 bot review comments Real bugs: - UiTestServer: external mode (OM_URL+OM_ADMIN_TOKEN) now honours the operator token instead of minting a local one the external server won't trust; no TokenRefresher for the static external token. - UiSession.uiUrl(): strip the /api REST base before appending UI paths instead of relying on URI.resolve (fragile for relative paths / trailing-slash bases → /api/<route> 404s). - CpuSampler.percentile(): index off (length-1); floor(p*length) returned the max for small n, overstating p95. - OidcEnvBuilder: keep OM's own JWKS in AUTHENTICATION_PUBLIC_KEYS alongside the mock IdP's — SSO mode still validates OM-minted internal/bot tokens. - DataQualityDashboardPage.tryClickDimensionCard: stop swallowing click/navigation failures as "card absent"; only true absence skips. - UiSessionExtension: don't save a trace for TestAbortedException (a skipped assumption is not a failure). Robustness / cleanup: - GoogleSsoBootstrapUIIT: build expected authority from MockOidcServer.NETWORK_ALIAS/PORT instead of a hardcoded :1080. - EntityLoaderSmokeUIIT: log load duration instead of asserting a wall-clock bound (flaky on shared runners). - ReindexHelpers.stopAppAndWait: drop unused stopRequestedAt. - nightly workflow: dedupe apt package list. - Javadoc fixes (UiSessionExtension AuthStrategy ref, IncidentManager seed count 18 -> 20). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 18:08:46 +05:30
Mohit Yadav	23ccb2dcff	Merge branch 'main' into java-playwrights	2026-05-21 16:44:15 +05:30
IceS2	14f880636a	ci(airflow-apis-tests): migrate Sonar step to sonarqube-scan-action@v7 with retry + add workflow_dispatch (#28292 ) * ci(airflow-apis-tests): retry Sonar PR scan on JRE-provisioning flake Mirror the py-tests pattern: migrate from the deprecated sonarsource/sonarcloud-github-action@master to SonarSource/sonarqube-scan-action@v7, mark the PR scan continue-on-error, and add a sleep+retry step so a transient 'Failed to query JRE metadata' from Sonar's JRE-provisioning endpoint no longer fails the job on first attempt. Hoist the shared sonar args into a workflow-level SONAR_OPTS env. * ci(airflow-apis-tests): allow workflow_dispatch + run Sonar step on it Add workflow_dispatch trigger so the Sonar retry path can be exercised from the Actions UI without opening a PR, and extend the Sonar PR step (plus its wait+retry siblings) to run on the dispatch event. * ci(airflow-apis-tests): scope Sonar steps to pull_request_target only Drop workflow_dispatch from the Sonar PR/retry step conditions so manual runs don't fire the scanner with empty -Dsonar.pullrequest.* flags (would create a branch entry in SonarCloud, per gitar-bot review). Dispatch trigger stays for re-running the build/test surface; Sonar will only fire on a real PR where the pull-request context exists.	2026-05-20 10:33:47 +02:00
mohitdeuex	e1d6734acb	Update webhook	2026-05-20 00:28:14 +05:30
mohitdeuex	3895cd1f70	Speed up nightly Playwright workflow + fix flaky reindex assertions Workflow: - Split into a build-image job that bakes openmetadata-server:jpw-snapshot via docker/build-push-action with a shared GHA layer cache, then exports the loaded image as a workflow artifact. Both matrix entries (elasticsearch + opensearch) download + `docker load` the same image and set OM_TEST_IMAGE so ContainerizedServer skips its own `docker build`. Result: one image build per workflow run (was 2x duplicated dist build + per-launch in-test builds) and stable cross-matrix correctness — the binary both shards run against comes from the exact same source SHA. - workflow_dispatch input `includeSsoBootstrap` toggles whether the @Tag("sso-bootstrap") SSO bootstrap UIITs run; default off because they spin up their own ContainerizedServer (second OM lifecycle) for an env wiring check that doesn't change between most runs. - Slack notification migrated to slackapi/slack-github-action@v2 with the incoming-webhook payload shape it now requires, and guarded behind `env.SLACK_WEBHOOK_URL != ''` so a missing secret no-ops instead of failing the post-step. - Publish Test Report step set fail_on_test_failures=false — mvn verify already gates the job conclusion, and a flake in the report action shouldn't cascade into the Slack step. Test fixes: - SearchAvailableDuringReindexUIIT: baseline probe now asserts `>= seeded.countOf(TABLE)` instead of strict equality. The OM container is shared across the suite so the index can legitimately have residual entities from earlier tests; assertEventualConsistency already checks that none of our baseline entities go missing across the recreate. - SimpleReindexTriggerUIIT: assertExploreCount now polls via Awaitility with a 2-minute budget, re-opening the Explore page on every tick. Playwright's `hasText` polled only the DOM, which wedges against a stale aggregation cache; re-issuing the search aggregation on each retry lets ES catch up after the alias swap. Tagging: - @Tag("sso-bootstrap") on GoogleSsoBootstrapUIIT + MockIdpSmokeUIIT, and the `ui-it` profile now reads `ui.it.excludedGroups` (default `sso-bootstrap`) so default `mvn verify -P ui-it` skips them. Pass `-Dui.it.excludedGroups=` to include them.	2026-05-19 22:39:27 +05:30
mohitdeuex	db481fdeac	Address remaining PR review comments Bugs - IndexFieldExplosionIT: SCHEMA_ALIAS was `databaseSchema_search_index`; the canonical indexMapping.json name is `database_schema_search_index`. - ExplorePage: - `tabTestId(GLOSSARY_TERMS)` produced `glossaries-tab`, but the UI builds the testid from the i18n label (`Glossary Terms` → `glossary terms-tab`). - `Tab.DASHBOARD_DATA_MODELS` path was `dashboardDataModels`; the Explore route segment is singular (`dashboardDataModel`). - Javadoc {@link} now points at the correct `openWithSearch` overload. - UiSessionExtension: split video lifecycle so the `Video` handles are snapshotted before `context.close()` (pages() is empty after close) but `video.path()` is resolved AFTER close (Playwright finalises the file on close — calling .path() earlier blocks/fails). - GoogleSsoSignInUIIT: removed the empty alternative from the `(my-data\|explore\|)` regex; it matched almost any post-auth URL and weakened the assertion. - MockOidcServer: still requires a single fixed port (token `iss` claim has to match across container/host/browser), but the port is now overridable via `-Dom.mockOidc.port=NNNN` and a fast pre-flight `ServerSocket` probe fails clearly when the chosen port is busy. GoogleSsoSignInUIIT now reads the port from `MockOidcServer.PORT` instead of hard-coding 1080. Test hygiene - SearchAvailableDuringReindexUIIT: replaced `Thread.sleep` polling with Awaitility (`.atMost(REINDEX_TIMEOUT).pollInterval(PROBE_INTERVAL)`), giving the loop a real deadline and removing the antipattern. - ClipboardHelper: replaced the fixed `waitForTimeout(300)` with bounded paste-retries until the hidden textarea has a non-empty value; textarea cleanup moved to a `finally` block. - SimpleReindexTriggerUIIT / SearchAvailableDuringReindexUIIT: defaults are now PR-friendly (200/100/100/100 tables/topics/dashboards/pipelines and 500 tables respectively) overridable via system properties; the nightly workflow sets the historical 5k stress numbers. Quality - DistributedAutoTuneReindexUIIT.distributedAutoTuneConfig now returns `Map.of(...)` instead of a mutable `HashMap`. - SearchQueryHelper.SearchProbe defensively copies `ids` / `uniqueIds` to immutable collections in the canonical constructor. - EntityLoader: every parameter and local that doesn't change is now `final`. - AuthAssumptions: `toLowerCase` calls now pin `Locale.ROOT` to stay stable under Turkish / other surprising locales. Docs - PageObject javadoc: list of rules updated to reflect actual contract (Page Objects may expose `Locator`-returning accessors, `rawPage()` is a documented escape hatch). - UI_TEST_CONVENTIONS.md: layering diagram now lists the real packages (`playwright.scenarios`, `playwright.ui.pages`, `it.auth`, `it.server`). Rule about Locator/Page softened to match the real contract. Headed-debug recipe points at `:openmetadata-integration-tests` (the `:openmetadata-java-playwright` module was removed). Stale references to MIGRATION_TRACKING.md and SearchAfterReindexUIIT replaced with REINDEX_TEST_PLAN.md and SimpleReindexTriggerUIIT. - REINDEX_TEST_PLAN.md: helpers table now flagged as a planning shape with an explicit list of what's shipped today vs. what's still aspirational.	2026-05-19 22:03:29 +05:30
mohitdeuex	c6682464df	Remove playwright-or	2026-05-19 21:24:47 +05:30
mohitdeuex	b015277df3	Address PR review comments: antlr CLI, URL encoding, catch-block split - Install antlr4 CLI + native build deps in java-playwright PR and nightly workflows (yarn install of openmetadata-ui runs the .g4 → JS codegen, which fails with `antlr4: not found` otherwise). - SearchClient: split combined IOException\|InterruptedException catch so only InterruptedException re-sets the interrupt flag; an IOException shouldn't make unrelated higher-level code think the thread was interrupted. - SearchQueryHelper.probeIndex: URL-encode `query` and `indexAlias` before splicing into the query string. - OidcBackend.acquireToken: URL-encode DEFAULT_USER (contains `@`) and DEFAULT_PASSWORD in the password-grant form body. - openmetadata-integration-tests/pom.xml: mark Playwright dependency as test-scoped.	2026-05-19 21:24:21 +05:30
mohitdeuex	0f766ac2dc	Fix Java Playwright CI: build local image before runMigrations Two coupled fixes for the ContainerFetchException seen in https://github.com/open-metadata/OpenMetadata/actions/runs/26087848414: 1. ContainerizedServer.launch() now materialises the openmetadata-server image at the very top via a new ensureServerImageAvailable() helper. Previously runMigrations() ran first and tried to start a container using the jpw-snapshot tag — testcontainers then attempted a registry pull, fell over with ContainerFetchException, and the whole run failed before newServer()/buildLocalImageContainer() had a chance to build anything. The image build is now done once, before any container needs the tag. Honors OM_TEST_IMAGE override (skips local build). 2. Nightly workflow gets an explicit "Build openmetadata-dist tarball" step. The previous `mvn install -pl :openmetadata-integration-tests -am` doesn't transitively build openmetadata-dist (not a test dependency), so openmetadata-*.tar.gz was never produced — meaning ensureServerImageAvailable() would still fail in CI at locateDistTarball(). Added before the test run, after deps build.	2026-05-19 15:02:41 +05:30
Mohit Yadav	fb954a9141	ci: add Java Playwright UIIT workflow (dispatch-only) (#28251 ) Lands java-playwright-nightly.yml on main so the workflow becomes dispatchable. workflow_dispatch only registers when the workflow file exists on the default branch; once merged, the suite can be run on demand against any branch ref. Tracks EPIC #3731. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:37:37 +05:30
mohitdeuex	ddfc275b6c	ci: disable nightly UIIT cron, keep workflow_dispatch only Run the UI integration suite on demand while it stabilises; re-add the schedule trigger once it is green on main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 14:07:14 +05:30
mohitdeuex	609269db07	Workflow changes	2026-05-19 14:01:41 +05:30
Mohit Yadav	6463edbb5a	Merge branch 'main' into java-playwrights	2026-05-18 20:03:01 +05:30
Harsh Vador	286a26f81f	ci(security-scan): post Snyk summary to Slack + fail on high/critical (#28200 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details * ci(security-scan): post Snyk summary to Slack + fail on high/critical * fix slack post channel * mention repo name * address gitar	2026-05-17 10:36:11 -07:00
Harsh Vador	d5bc00d1da	ci(security-scan): readable Snyk job summary + consolidated Slack alert (#28170 ) * generate snyk summary * address gitar * address gitar * generate summary * remove duplicate notification	2026-05-16 07:05:10 -07:00
Sriharsha Chintalapani	5696286b27	Address Transitive vulnerabilities (#28169 ) * Address transitive vulnerabilities * Address transitive vulnerabilities * fix(deps): resolve pyOpenSSL/cryptography conflict and align constraint pins CI dependency resolution failed because pyOpenSSL~=24.1.0 caps cryptography at <43, conflicting with the cryptography>=44.0.1 bump. Widens pyOpenSSL to >=24.3.0 (first version compatible with cryptography 44.x) and aligns the airflow constraint file pins for cryptography and GitPython with the upstream setup.py bumps so pip install -c can resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 00:02:49 -07:00
Harsh Vador	bb5c64658e	ci: consolidate security scan Slack notifications into single combined alert (#28135 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details * ci: consolidate security scan Slack notifications into single combined alert * address gitar * add env	2026-05-15 21:40:05 -07:00
Sriharsha Chintalapani	64f49c1747	Cache improvements: lineage + search layers, observability, CI gate (#28012 ) * cache: lineage cache, per-type metrics, invalidation registry, search-cache Add Redis-backed lineage response cache and search response cache, both gated by the existing CACHE_PROVIDER toggle and falling through to direct computation when the cache is unavailable. The cache remains optional — verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and confirming all paths continue to work (just without the L2 hit). Coverage: - CachedLineage wraps LineageRepository.getLineage with hybrid TTL + direct invalidation (60s default). Direct edits invalidate the affected root cache entries; transitive changes fall through to TTL. - CachedSearchLayer wraps /api/v1/search/query with auth-aware caching (cache key includes principal so users with different ACLs don't share results). 30s default TTL. Observability: - /api/v1/system/cache/stats response now includes a metrics block with hits/misses/hitRatio/evictions/errors/writes plus read/write latency Timers, and a byType breakdown so coverage gaps are visible per entity-type and per cache-layer. Correctness: - New Invalidatable interface + CacheBundle registry + invalidateEntity helper so future cache layers plug in by implementing one method instead of editing multiple mutation paths. - Edge mutations in LineageRepository.addLineage/deleteLineage invalidate both endpoints; entity mutations in EntityRepository.postUpdate / postDelete / restoreEntity invalidate the lineage rooted at the entity. - Pub/sub handler in CacheBundle iterates registered Invalidatables so remote-pod evictions flow to all layers automatically. Tooling: - docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for local A/B testing without tearing down DB/ES volumes. - CachedSearchLayerIT exercises hit-on-second-call, distinct-query misses, distinct-page-size misses, and byType shape via the metrics endpoint. Each test gracefully no-ops when the cluster runs cache-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints Builds on the phase 1 commit (`c20a29b11b`) with operability and correctness items from .context/cache-improvements-design.md. All four pieces respect the optional-cache contract: with CACHE_PROVIDER=none they no-op cleanly. P2.3 — Single-flight on CachedSearchLayer Striped<Lock> keyed by SHA-cache-key. 100 concurrent users hitting the same uncached query collapse to one ES call instead of N. SearchResource now uses loadOrCompute so the lock-and-recheck pattern lives inside the cache layer; the supplier is the actual ES call kept tight to minimize lock-hold time. Non-200 upstreams bypass cache and refetch. P2.6 — Slow cache reads logged RedisCacheProvider.get/hget timing checked against cache.slowReadThresholdMs (default 50ms). Exceeding fires a WARN log and bumps a new cache.reads.slow Micrometer counter exposed in /cache/stats.metrics.slowReads. Leading indicator of Redis pressure / network glitch / hot-key contention. P2.4 — Negative caching for not-found entities NotFoundCache marks "we looked, no such entity" with a short TTL (default 30s) so repeated 404 lookups (typo'd FQNs, references to deleted entities) don't hammer the DB. Wired into EntityRepository.find(UUID) and findByName for the !fromCache path. Implements Invalidatable so the postCreate fan-out drops the marker on entity create — without that, create-then-immediately-read would 404 for up to TTL. Added CacheBundle.invalidateEntity to EntityRepository.postCreate so newly-created entities reach every Invalidatable registry layer. P2.5 — Admin cache ops endpoints GET /api/v1/system/cache/keys?pattern=... — SCAN keys, returns count POST /api/v1/system/cache/invalidate?pattern=.. — SCAN+UNLINK, returns deleted POST /api/v1/system/cache/invalidate/entity?type=&id=&fqn= — fan to all Invalidatables All admin-only. Pattern endpoints document the "no broad globs" rule — we never want a SCAN over om:prod:* on a busy cluster. Per-entity endpoint goes through the existing Invalidatable registry so future cache layers are reachable from ops without ever touching this code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: pipelined mget on CacheProvider + CachedReadBundle.getBatch Adds a foundational batch-read primitive at the provider layer: CacheProvider.mget(List<String>) -> List<Optional<String>> Default implementation does sequential per-key gets (correct, no batching benefit). RedisCacheProvider overrides with a true pipelined version: all GETs are queued under setAutoFlushCommands(false), then flushed once and awaited as a single TCP round-trip. Records hits/misses through the existing CacheMetrics counters and respects the slow-read threshold. Per-key pipelining over true MGET — Redis Cluster requires same-slot keys for MGET; pipelined per-key GETs work transparently across slots without the constraint, at the same network cost. CachedReadBundle.getBatch(entityType, ids) consumes the new primitive for prefetch use cases (UI prefetch on hover, list-then-detail navigation warmup). The list endpoint hot path itself does NOT use this layer — list responses are SQL-batched via EntityRepository.setFieldsInBulk which calls fieldFetchers in bulk, not per-row CachedReadBundle.get. That's why bench3 showed list endpoints at neutral cache_off-to-on ratio: lists already amortize at the SQL layer. The mget primitive is what later phases will plug into when wiring batch-prefetch to specific UI flows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cache): use unique query in sameQueryHitsCacheOnSecondCall to avoid state pollution Sequential test run on postgres-os-redis caught a flake: the test issued 3 identical "q=" calls expecting at least 1 cold-write. By the time it ran, prior tests in the same JVM session had already cached (q=, index=table_search_index, size=10), so call 1 was a hit, call 2 hit, call 3 hit — total writes=0, asserts failed. Switching to a per-invocation nonce ensures we always start cold, matching the pattern the other 3 tests in this class already use. Confirmed via subsequent parallel-pass run on the same suite where the test passed (different test ordering, fresh cache for that key). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: drop search cache TTL from 30s to 2s for create-then-search freshness Integration tests on the postgres-os-redis profile caught a real correctness regression: tests that create an entity and Awaitility-poll for it to appear in search timed out at 30s because our 30s search TTL pinned the pre-create empty result for the entire test window. Same issue surfaces in production: a user creates a domain / table / dashboard and immediately searches for it would see "no results" for up to 30s. 2s caps the staleness while still catching the dominant UI access pattern: multiple components in the same render frame fire identical search queries. Those happen within milliseconds, well inside any reasonable TTL. The longer-term fix is search-cache invalidation on entity writes (a generation counter per entity-type, search keys include the generation, writes bump the generation). That's design-doc-tracked in .context/cache-improvements-design.md but deferred — the 2s TTL is good enough for now, and the more complete invalidation strategy can be a follow-up PR with its own dedicated tests. Failing tests under 30s TTL that this fixes: - DomainAssetsColumnExclusionIT (domain create-then-search) - LineageImpactAnalysisIT (owner removal reflected in search) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: cache-tests profile runs full IT suite + new postgres+es+redis CI workflow The cache-tests Maven profile previously ran only the four cache/* IT classes — too narrow to catch cache-correctness regressions in the rest of the codebase. Expanded it to mirror the mysql-elasticsearch profile shape: sequential + parallel failsafe executions, full */IT.java inclusion, postgres + elasticsearch + redis backend, with cacheProvider=redis system property added so every test path exercises the cache layer. Locally, the focused-cache-only run is preserved via mvn verify -P cache-tests -Dit.test='*/cache/IT' New CI workflow integration-tests-postgres-elasticsearch-redis.yml mirrors the structure of integration-tests-postgres-opensearch.yml: - Same triggers (push to main, PR target, merge_group, workflow_dispatch) - Same path filters (openmetadata-service/, integration-tests/, etc.) - Same Maven cache + JDK 21 setup - Runs `mvn verify -pl :openmetadata-integration-tests -Pcache-tests` - Surefire-report publication with fail_on_test_failures Result: PRs touching cache code (or any read path) get automatic CI coverage with redis enabled. Cache-invalidation and stale-data bugs that previously only surfaced in production now have a CI gate before merge — same protection that mysql-elasticsearch and postgres-opensearch provide for the no-cache code paths. Smoke verified locally: `mvn verify -P cache-tests -Dit.test='*/cache/IT'` runs both sequential and parallel passes (6 tests each), all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): address PR review feedback for cache improvements Nine review-driven fixes spanning the cache PR (#28012): RedisCacheProvider.mget (bug): - Restructured the auto-flush window so `setAutoFlushCommands(true)` is in the OUTER `finally` of the entire op. The previous structure had the restoration in an inner finally that only fired around the awaitAll call; an exception in the queueing loop or flushCommands() would leave the SHARED connection in auto-flush=false mode, making every subsequent op from any caller silently buffer indefinitely. SearchResource (bug): - Removed the double-call on the non-cacheable response path. The supplier now captures the upstream Response object so the outer code can return it directly when the body isn't cacheable (non-200 or non-String entity) — previously the caller re-invoked searchRepository.search() on every error/non-200, doubling backend load for failing queries. EntityRepository negative cache (edge case): - Hoisted the NotFoundCache fast-path OUTSIDE the `!fromCache` guard in both `find(UUID,...)` and `findByName(...)`. Default callers go in via `find(id, include)` which delegates with fromCache=true; the previous gate made the fast-path unreachable for the most common caller. Also added negative-cache population from the cached path's ExecutionException so repeated requests for a non-existent id do short-circuit after the first miss. SystemResource cache endpoints (security + style): - `/cache/keys` and `/cache/invalidate` now validate the glob pattern via `validateCachePattern` — rejects pure wildcards or patterns with fewer than 6 literal characters before the first wildcard. Stops a careless or malicious admin from issuing `` or `om:` that would block the Redis cluster on a large keyspace. ReDoS-safe: linear char scan, no regex backtracking. - `/cache/invalidate/entity` now also calls `EntityRepository.invalidateCacheForEntity(...)` to evict the Guava L1 caches (`CACHE_WITH_ID`, `CACHE_WITH_NAME`) and propagate via the existing pub-sub channel — the previous code only invalidated the `INVALIDATABLES` registry layers, leaving stale L1 entries. - Replaced fully-qualified class names (`org.openmetadata.service. cache.CacheMetrics`, `jakarta.ws.rs.QueryParam`, `java.util.UUID`) with proper imports per the project style guide. CachedLineage (edge case): - Single-flight stripe lock now keys on the FULL cache key `(rootId, upstreamDepth, downstreamDepth, includeDeleted)` instead of `rootId` alone. Concurrent requests for different depths or include-deleted flags on the same root no longer block each other. CachedSearchLayer (doc): - Javadoc now correctly says default TTL is 2s (was incorrectly 30s) and explains why — see commit `41489056ff` which dropped it from 30s after IT regressions where users couldn't see their own writes for half a minute. CI workflow (bugs + security mitigation note): - Removed `if: steps.cache-output.outputs.exit-code == 0` from the `Set up JDK 21` and `Install Ubuntu dependencies` steps. `actions/cache@v4` exposes `cache-hit`, never `exit-code`; the expression always evaluated to false and those steps NEVER ran. Maven was using whatever JDK shipped with the runner. - Added explicit security note in the workflow header AND on the `Checkout` step documenting why `pull_request_target` is intentional and what the `safe to test` label gate accomplishes — CodeQL flags the pattern, the label gate is the accepted mitigation that mirrors every other integration-tests-.yml workflow in this repo. Verified: - mvn compile -pl openmetadata-service → BUILD SUCCESS - mvn test -pl openmetadata-service -Dtest=OpenMetadataAssetServletTest → 9/9 pass - mvn spotless:apply ran clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fix(cache): only negative-cache on real EntityNotFoundException The previous code caught every ExecutionException / UncheckedExecutionException from the Guava cache loader and (a) populated NotFoundCache for 30s, (b) rethrew as EntityNotFoundException. That conflated three very different failure modes: 1. Entity truly doesn't exist → loader throws EntityNotFoundException 2. Entity exists but is invalid → loader throws IllegalStateException 3. Transient DB / deser failure → loader throws JdbiException, IOException Cases 2 and 3 would poison the negative cache, turning a momentary DB hiccup or a single bad row into a sustained 30s 404 for every caller that asks for that id/fqn. Worse, the original cause was masked behind a synthetic EntityNotFoundException, so logs and clients never saw the real failure. This change inspects e.getCause() and: - On EntityNotFoundException: populate NotFoundCache, rethrow the original (not a synthetic) so the caller's `instanceof` checks and message text still work. - On any other RuntimeException: rethrow unchanged — DB blips return 5xx as before, validation errors surface, and the next request can re-attempt without hitting a poisoned cache. - On checked Throwable cause (rare for these loaders): wrap in RuntimeException so the contract is preserved. Applied symmetrically to find(UUID, …) and findByName(String, …). Addresses gitar-bot review on PR #28012: https://github.com/open-metadata/OpenMetadata/pull/28012#discussion_r... (negative cache poisoning) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review — blank param, javadoc, mget hardening Four review comments from PR #28012 review 4266159401: SystemResource.invalidateCacheForEntity (line 1069 → blank query params): `?type=X&id=&fqn=` slipped past the required-params check because only `null` was treated as absent. Normalize blank id/fqn to null up front so the missing-both branch fires correctly and the downstream CacheBundle / EntityRepository calls receive a clean null instead of an empty string. CacheKeys.search/childrenPage (line 116 → orphaned Javadoc): When the search() helper was added between the children-page Javadoc and the childrenPage() method, the Javadoc got stranded above the wrong method. Move it back so javadoc tooling generates accurate docs. RedisCacheProvider.mget (line 610 → shared-connection auto-flush race): setAutoFlushCommands(false) toggles state on the shared Lettuce connection — two concurrent mgets could overlap and one caller's commands would buffer until the other restored auto-flush, surfacing as latency spikes / hangs on other paths sharing the connection. Wrap the pipeline in a new instance-level ReentrantLock so only one mget runs the auto-flush dance at a time. try/finally still restores auto-flush unconditionally; lock release sits in an outer finally. RedisCacheProvider.mget (line 621 → unbounded f.get() on timeout): Previously LettuceFutures.awaitAll(...) returned a boolean we ignored; if it timed out, the subsequent f.get() calls were unbounded and would block the request thread until the Lettuce event loop eventually gave up. Capture the boolean, cancel non-done futures on timeout (so f.get() returns CancellationException instead of blocking), and log a warning with the timeout value and key count for operators. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget partial timeout must trip the circuit breaker The previous mget rewrite cancelled in-flight futures on awaitAll timeout but still called recordSuccess() at the end of the happy-path. That fed consecutiveSuccesses on every partial timeout, so a Redis instance that was consistently slow (answering some keys, dropping others) would never trip the breaker — masking real backend degradation as healthy. Branch on the captured allCompleted boolean: - all futures completed → recordSuccess() as before - partial timeout → recordFailure(TimeoutException) and bump CacheMetrics.recordError() so the breaker's sliding-window failure detector picks it up and the metric reflects the degraded state No other behaviour change — the per-key fallback Optionals still surface to callers either way. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget shorter critical section + cache/stats + cache/keys doc Three review comments from PR #28012 second copilot pass: RedisCacheProvider.mget (RedisCacheProvider.java:624 — shared-connection hold time): previous code held setAutoFlushCommands(false) for the entire queue+flush+await window. Other paths (single get/set/hget on the same Lettuce connection) would buffer until our await finished. Shrink the critical section to just queue+flush: once flushCommands() returns, the batch is on the wire and we can restore auto-flush and release the pipelineLock before awaiting. A slow Redis now blocks only the calling thread, not every concurrent caller using the shared connection. Cancel-on-timeout and breaker accounting are unchanged. SystemResource.getCacheStats (line 962 — noisy WARN when cache disabled): CacheMetrics.getInstance() logs WARN every call when the metrics singleton isn't initialized, which happens whenever CACHE_PROVIDER=none. An ops dashboard polling /system/cache/stats on a cache-off deployment would spam the log. Gate the metrics call on cacheProvider.available() so the WARN never fires in that configuration. Stats payload still includes provider-level fields; just no `metrics` key when cache is off. SystemResource.scanCacheKeys (line 1006 — OpenAPI lies about count param): Description claimed "bounded by the count parameter" but no count param exists; scanCount() walks the full cursor. Rewrote the description to state the actual safety mechanism: the validateCachePattern enforces a 6-character literal prefix before any wildcard, so '' and 'om:' are rejected at validation. Reflects what the endpoint actually does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review pass 3 — hot-path L1 check + lineage hash + cleanups Eight comments from the latest copilot review on PR #28012: 1. SystemResource.getCacheStats: gate metrics on cacheConfig.provider != none instead of cacheProvider.available(). When Redis is configured but the circuit breaker is tripped, app-level counters are exactly what an operator needs to diagnose the outage — suppressing them while the provider is "down but configured" hides the diagnostic signal. Also downgrade CacheMetrics.getInstance() WARN → DEBUG so a poller loop doesn't spam logs in the entirely-normal cache-off state. 2. CachedReadBundle.getBatch contract: the method is documented as returning a list 1:1 with entityIds, but bypass returned Collections.emptyList() and callers indexing by position would shift off the rails. Return a same-size list of nulls under bypass so the positional contract holds regardless of cache state. 3+4. CacheBundle.invalidateEntity / Invalidatable.invalidate javadocs claimed they were called from EntityRepository.postUpdate / postDelete / restoreEntity. They are NOT (only postCreate, the pub-sub handler, and the admin endpoint reach this path). Updated both javadocs to reflect actual call sites so future Invalidatables aren't built on a wrong invalidation contract. 5+6. EntityRepository.find / findByName: check Guava L1 (getIfPresent) FIRST, NotFoundCache only on L1 miss. The previous shape consulted NotFoundCache before L1, adding one Redis GET per cached read — a regression on the hottest read path. L1 hit now serves with zero Redis traffic; the negative cache short-circuits only when the loader would otherwise pay for a DB / Redis-L2 round trip. 7. CachedLineage redesign: variants for one root now live as fields of a single Redis hash (HSET / HGET) instead of separate keys. Invalidate is one DEL — O(1) — instead of SCAN-and-iterate (O(N) over keyspace). This matters because invalidate fires on the hot write path (entity updates and lineage-edge mutations) and the SCAN cost grew linearly with cache size. CacheKeys.lineageGraphPattern is gone; new helpers are lineageGraphHash(rootId) and lineageGraphField(up, down, incDel). 8. SystemResource.invalidateCacheForEntity: when only fqn is supplied, resolve to id server-side via Entity.getEntityRepository(type). findByName(...) before fanning out. Id-keyed cache layers (lineage, CACHE_WITH_ID, NotFoundCache id-side) need the UUID; the previous shape silently skipped them. Lookup failures are logged at DEBUG and the request still proceeds with fqn-only invalidation — admin force-invalidate is best-effort by design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): lineage hash TTL claimed only by first writer (EXPIRE NX) Previous shape called `hset(hashKey, fields, ttl)` which translated to HSET + EXPIRE under the hood. Every variant write therefore reset the hash's expiry — variant A cached at T=0 with TTL=60, variant B cached at T=55, and A's effective lifetime jumped to 115s instead of the intended 60s. Under a constant trickle of variant writes on a hot root, the "stale" variant could effectively live forever. Split the operation: - CacheProvider.hset(key, fields) — new overload, no TTL touch. Defaults to a 365-day TTL so providers that don't override get a long-lived key rather than an immortal one. - CacheProvider.expireIfAbsent(key, ttl) — EXPIRE … NX semantics: set the TTL only when the key has no prior expiry. Default returns false (providers that can't express NX get no extension benefit, but no regression). - RedisCacheProvider implements both: HSET without expire, then EXPIRE with ExpireArgs.Builder.nx(). Falls back gracefully on Redis < 7.0 (logs at DEBUG, returns false). CachedLineage.safeHset now uses the split shape — the first writer to seed a hash establishes the 60s window; subsequent variant writes leave the expiry alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): mget unavailable-path alignment + lineage deser fallback Two copilot review comments on PR #28012: RedisCacheProvider.mget (line 646): when `available == false` we returned `Collections.emptyList()`, violating the 1:1 positional contract that callers (CachedReadBundle.getBatch and friends) rely on. Match the error-fallback branch: return one Optional.empty() per requested key so caller-side indexing stays aligned regardless of provider health. Truly-empty input keeps returning empty list (no positions to align). LineageRepository.getLineage (line 1345): unconditional readValue on the cached JSON would throw and fail the request if Redis held a partial/corrupted/old-schema value — turning cache corruption into a persistent 500 until TTL expiry. Wrap the deserialize in try/catch; on failure log WARN with the root id and depth, invalidate the affected root's lineage hash, and fall through to a fresh computeLineage(). User sees the same answer as cache-off; subsequent requests repopulate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): expireIfAbsent falls back to plain EXPIRE on NX failure The previous shape returned false silently when EXPIRE … NX wasn't supported (Redis < 7.0 syntax error, transient failure). That meant the preceding HSET-without-ttl call could leave the lineage hash key with no expiry at all, accumulating in Redis memory until the next manual invalidation. Catch the NX failure, log at DEBUG, and issue a plain EXPIRE so the key still gets a bounded lifetime. The trade-off: on older Redis, every variant write extends the expiry — strictly worse than the NX semantics on a 7.0+ deployment, but vastly better than the alternative of permanent unbounded keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): copilot review pass 5 — dedicated mget conn + breaker + IT isolation + key collision Five comments from the latest copilot review on PR #28012: RedisCacheProvider.expireIfAbsent breaker bookkeeping (line 432, gitar-bot): the NX-fallback path issued a plain EXPIRE without recordSuccess() / recordFailure(), so a real network blip there was invisible to the sliding-window failure detector. Both success and failure now feed the breaker, consistent with every other Redis-calling method in the class. RedisCacheProvider.mget shared-connection hazard (line 692): even with pipelineLock, single-key callers using syncCommands/asyncCommands on the same connection had their commands buffered for the duration of the auto-flush-off window. Switched to a dedicated `pipelineConnection` / `pipelineAsyncCommands` created at init time and closed on shutdown. The shared connection's auto-flush is never toggled now, so unrelated request paths can't be starved by mget. pipelineLock still serializes mget vs mget on the dedicated connection. SystemResource.invalidateCacheForEntity fqn→id resolution (line 1113): the resolution call used `findByName(fqn, ALL, fromCache=true)`. That path consults NotFoundCache and the L1/L2 caches, which an admin force- invalidate is explicitly trying to recover from — a poisoned negative entry would short-circuit the resolution and silently skip every id-keyed cache layer. Switched to fromCache=false so the resolution always goes to the DB; only then can we trust the id we hand to CacheBundle / EntityRepository invalidation. CachedSearchLayerIT.java parallel-execution flakiness (line 50): the test assertions depend on deltas in the global /system/cache/stats counters. Under @Execution(CONCURRENT) other ITs issuing searches in parallel inflate the counters and the deltas either don't show up (false negative) or come from someone else's hits (false positive that masks broken cache keying). Marked @Isolated + ExecutionMode.SAME_THREAD so the class runs alone within its window. CachedSearchLayer.buildKey ambiguous encoding (line 220): fields were joined with a raw `\|` delimiter, no escaping. A query string containing `\|idx=foo` would produce the same preimage as a different (principal, index, query) tuple — cache-key collision → wrong cached response served to the wrong user. Added length-prefixed field encoding (`name=<utf8-bytes>:value\|`); two distinct logical tuples can no longer serialize to the same hash input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>	2026-05-13 06:41:09 -07:00
Harshit Shah	77a85bffde	[CI] Add on-demand Playwright search-nightly workflow (#27908 ) * test(ci): add on-demand playwright search-nightly workflow Create a manual Playwright search-nightly workflow with the same bootstrap, reporting, Slack notification, and cleanup structure as the SSO nightly job. Add a dedicated search-nightly Playwright project and a basic nightly search smoke spec without using issue-closing keywords for #3792. * address comments * revert changes * minor updates	2026-05-13 12:18:31 +05:30
Mohit Yadav	8ac53bfecc	Merge branch 'main' into java-playwrights	2026-05-11 19:48:37 +05:30
mohitdeuex	74bda64340	Merge openmetadata-java-playwright into openmetadata-integration-tests Folds the UI integration test module into the canonical integration-tests module under a `ui-it` Maven profile. One test home, one classpath, no more test-jar reinstall dance or cross-module IntelliJ classpath quirks. Why: most of the value the UI test module shipped was reusable backend infra (factories, search helpers, server harness) that worked fine without a browser. Keeping it in a separate module forced multiple unnecessary boundaries — test-jar publication, IntelliJ test-classes- jar-tests phantom paths, src/test placement for AuthBackend code that should have been in src/main, "where does this test go?" friction. Layout in integration-tests: org/openmetadata/it/auth/ JwtAuthProvider + AuthBackend / OidcBackend / AuthSession / TokenRefresher / ... org/openmetadata/it/server/ ContainerizedServer / ServerHandle / ExternalServer / sso/ profile records org/openmetadata/it/search/ ReindexHelpers / SearchClient / SearchAssertions / SearchQueryHelper org/openmetadata/it/ui/ SessionBrowser / UiSession / UiSessionExtension / TraceRecorder / ClipboardHelper / pages/ org/openmetadata/it/scenarios/ UIIT.java tests org/openmetadata/it/util/ SdkClients + UiTestServer / OssTestServer org/openmetadata/it/factories/ existing + EntityLoader Build: - integration-tests pom gains com.microsoft.playwright:playwright (test scope). Other testcontainers / jwt deps already there. - test-jar publish-test-harness includes pattern expanded to ship server/, search/, ui/ packages alongside auth/, util/, factories/, bootstrap/. Downstream consumers (collate) inherit the full UI test harness, not just backend factories. - New `ui-it` profile runs `/UIIT.java` with skip.embedded.bootstrap =true, PW_VIDEO=true, per-method parallel @ 0.5 factor. Mirrors the failsafe execution from the old playwright module. - Existing parallel-tests executions across all profiles gain a `*/UIIT.java` exclude so embedded-mode IT runs don't pick up UI tests they can't run. Module removal: - openmetadata-java-playwright/ deleted. - parent pom <modules> entry removed. - .github/workflows/java-playwright-nightly.yml updated to build and test `openmetadata-integration-tests -P ui-it` instead. Docs: - MIGRATION_TRACKING.md and CONVENTIONS.md from the old module are UI_MIGRATION_TRACKING.md / UI_TEST_CONVENTIONS.md at the integration-tests root. No test code semantics changed — pure reorganization. The 4-5 backend- flavored UIIT.java tests we identified as misplaced (running against SDK with vestigial UI checks) still live under scenarios/ for now; a follow-up will rename them to IT.java and have them target the embedded TestSuiteBootstrap directly to drop their ~3-minute Docker boot overhead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 19:13:29 +05:30
Sriharsha Chintalapani	d3bbbefe37	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 ) * fix(rdf): dedupe lineage edges and broaden PROV-O coverage The RDF Knowledge Graph endpoint was emitting two edges per lineage relationship — once as `om:UPSTREAM` (forward) and once as `prov:wasDerivedFrom` (reverse) — because the parser preserved each predicate's native subject/object orientation instead of canonicalizing both into a single `(upstream, downstream)` edge. Also extend PROV-O coverage so external SPARQL clients can use the W3C Provenance vocabulary directly: - `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on datasets / pipelines / users - `prov:wasAttributedTo` mirror of `om:owners` - `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used` on lineageDetails so the Entity → Activity → Entity chain is complete - `prov:hadPlan` + `prov:Plan` for SQL transformation recipes - `prov:startedAtTime` / `prov:endedAtTime` on Activity instances - `prov:wasAssociatedWith` Activity → Agent linking - `prov:invalidatedAtTime` on soft-deleted entities Other RDF cleanups in the same area: - LineageDetails URIs are now deterministic (driven by from/to ids instead of a timestamp), so re-indexing collapses duplicate Activity resources via the existing DELETE+INSERT idempotency - Skip emitting the redundant `om:owners` JSON-string literal — the mapped path already produces clean `om:hasOwner <agent>` triples - Skip empty `[]` array literals in the unmapped path - Propagate failures from `RdfRepository.{addRelationship, addLineageWithDetails, bulkAddRelationships, bulkAddGlossaryTermRelations}` instead of silently swallowing them, so downstream callers can surface the failure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): surface Fuseki failures in app run record Per-entity and per-batch failures from the RDF index app used to be logged via SLF4J only — they never made it into the AppRunRecord, so the UI/run history showed "completed" even when every entity had silently failed to write to Fuseki. - `RdfBatchProcessor.processEntities` now captures the last error per entity, returns it in `BatchProcessingResult.lastError`, and accumulates relationship-processing failures into the same result. - Relationship and lineage processing methods (`processBatchRelationships`, `processLineageRelationship`, `processGlossaryTermRelations`) return structured results with failure counts and last-error messages instead of `void`, so failures are visible to the partition worker. - `RdfIndexApp` records the failure on `jobData` for both the distributed and non-distributed code paths, so users see a real error message in the run history (e.g. "Failed to write entity X to Fuseki: ConnectException"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * perf(rdf-index-app): port distributed-mode improvements from SearchIndex The RDF distributed-indexing fork was lagging behind several SearchIndex improvements that addressed concrete reliability and throughput issues. Port them across: Core perf / reliability - Precomputed partition start cursors: coordinator walks each entity once via keyset pagination at job init and caches the boundary cursor per (jobId, entityType, rangeStart). Workers consult the cache before falling back to the OFFSET-based path. Eliminates the previous O(N²) per-partition cursor lookup. - `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion` on the coordinator. Stop now cancels both PENDING and PROCESSING partitions in a single SQL update and immediately drives the job status from STOPPING → STOPPED, so the UI status no longer hangs while workers drain. - Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset` uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of `List.of("")`, avoiding expensive fetchers (e.g. fetchAndSetOwns) per batch. - Partition heartbeat thread: virtual thread refreshes `lastUpdateAt` every 30s for partitions actively being processed by this server, so the stale reclaimer no longer interrupts active work. - `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path rejects when the server already holds 5 PROCESSING partitions, giving fair distribution across pods. Verified the existing claim DAO uses `FOR UPDATE SKIP LOCKED` for both MySQL and Postgres. - Gate WebSocket stat broadcasts during the STOPPING phase so the Quartz-scheduler-driven STOPPED status push isn't overwritten. Multi-server scaffolding (single-pod is unaffected) - `RdfPollingJobNotifier`: DB-polling discovery for other server pods to find an in-flight RDF reindex they can join. - `RdfEntityCompletionTracker`: per-entity-type partition tracking with callback firing once all partitions for an entity complete, foundation for early per-entity index promotion. Tests: precomputed-cursor cache lookup, in-flight backpressure, cancelInFlight delegation, completion tracker callback semantics, notifier start/stop. DAO additions on `rdf_index_partition`: - `cancelInFlightPartitions(jobId, now)` — covers both PENDING and PROCESSING in one statement - `countInFlightPartitionsForServer(jobId, serverId)` — backpressure - `countPartitionsByStatus(jobId, status)` — used by completion check Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row When an app has no run history, AppRunsHistory fabricated a synthetic placeholder row that looked like a real run — `runType: "CurrentConfig"`, a fake `Run At` timestamp pulled from `appData.updatedAt`, an ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button that targeted nothing. Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows, and hide the `Stop` button so users no longer see "Run now → 19-minute Running with Stop button" when the actual job never registered. Real app runs are unaffected — they still display `runType` from the backend (OnDemandJob, Hourly, Daily, Custom, etc.). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address PR review findings Four issues raised in PR #27999 review: - Cursor format consistency in walkAndRecord (bug): The defensive branch produced cursors via a custom `{name, id}` map while the regular path used `repo.getCursorValue()`. For entities with quoted names these encodings diverge — a quoted-name entity could land in the cache with a cursor incompatible with what the worker fetches via keyset pagination. Track the last seen entity reference and run it through `repo.getCursorValue()` in both paths. `encodeBoundaryCursor` is removed. - Adaptive scheduling in RdfPollingJobNotifier (perf): The previous implementation woke the scheduler thread every 1s and short-circuited inside the poll method when idle. Reschedule the task at the appropriate interval (1s active / 30s idle) when `setParticipating` flips, so the thread genuinely sleeps when idle. - Cursor cache cleanup on startup recovery (edge case): `partitionStartCursors` was only evicted by `refreshAggregatedJob` / `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job and never reached either, the cache entry leaked until process restart. Add `evictStaleCursorCacheEntries()` invoked by `performStartupRecovery` that drops entries for jobs that no longer exist in the DB or are already terminal. - Consolidate describeError helpers (quality): `describeError`, `describeBulkError`, and `describeLineageError` in `RdfBatchProcessor` all walked the cause chain and formatted a prefixed message with the same logic. Reduced to a single `describeError(prefix, error)` plus a thin `describeEntityError` adapter for the per-entity call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop() stop() called workerExecutor.shutdownNow() inline AND through cleanupLocalExecution -> shutdownWorkerExecutor, which broke the DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the inline call — cleanupLocalExecution is the single owner of the shutdown path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: drop redundant DB matrix from openmetadata-service unit tests The {mysql, postgresql} strategy matrix on openmetadata-service unit tests doubled CI cost without adding signal: both jobs ran the same surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single `test.database` property, and that property is consumed exclusively by the failsafe plugin (integration tests `IT.java` / `IntegrationTest.java`), which only runs under `-Pintegration-tests` — not enabled here. `openmetadata-service` itself has zero tests that read `test.database` or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The only testcontainer-based DB code in the repo lives in `openmetadata-integration-tests`, a different module that this workflow doesn't build. Run the unit suite once. The `openmetadata-service-unit-tests-status` required-check aggregator is unaffected (it depends on the renamed job which still has the same name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address Copilot PR review findings Six correctness issues raised on PR #27999: - Lineage-details DELETE was too broad (RdfRepository): the cleanup step deleted all `<fromUri> om:hasLineageDetails ?d` triples, so reindexing one (fromId, toId) edge wiped lineage-details links for every other downstream of the same source entity. Pin the delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>` triple. Same with prov:generated cleanup — anchor it to the specific detailsUri instead of any details resource. - Predicate not flipped during canonicalization (RdfRepository): `parseEntityGraphEdgesFromResults` swapped subject/object for reverse-direction predicates (`prov:wasDerivedFrom`, `prov:wasInfluencedBy`) but kept the original predicate URI on the resulting EdgeInfo. Exported graphs could carry semantically invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`. Add `forwardEquivalentPredicate` to substitute the OM-native forward predicate when the direction flips. - `dct:modified` was an invalid xsd:dateTime (RdfPropertyMapper): `entity.getUpdatedAt().toString()` returns the epoch-millis Long as a string, but the literal was tagged `xsd:dateTime`. Convert via `Instant.ofEpochMilli(...).toString()` so the lexical form matches the type — same fix already in place for prov:invalidatedAtTime. - Unmapped EntityReference arrays were dropped entirely (RdfPropertyMapper): the previous fix to skip noisy JSON-string literals also dropped fields like `domains`, `reviewers`, `voters` for entity contexts that don't have a JSON-LD mapping for them — the unmapped path was the only path emitting them, so nothing landed in RDF. Expand each array element through `addEntityReference` so the data still produces proper `om:<fieldName> <ref>` triples; mapped-path duplicates are collapsed by Jena's Model dedupe. - Partition failure detection missed reader errors (DistributedRdfIndexExecutor): the EntityCompletionTracker was fed `result.errorMessage() != null`, but `RdfPartitionWorker` can increment `failedCount` from `readerErrors` without ever setting `lastError`. Use `result.failedCount() > 0` so partitions whose failures came from `ResultList.getErrors()` are also marked as failed when promoting an entity. - `COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0 (RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS whenever any partition is FAILED or CANCELLED, including for user-initiated stops where no record-level failures accrued. The monitor's `completedWithErrors` gate required `failedRecords > 0`, so those terminal states never hit `jobData.setFailure(...)` and the run record showed success. Drop the failedRecords precondition and tailor the fallback message based on whether there are record-level failures or partition-level only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): separate relationship failures + type lineage as prov:Activity Two more PR review findings on #27999: - Relationship failures inflated failedRecords stat: `processEntities` was folding relationship/lineage edge failures into `failedCount`, which becomes `failedRecords` in the index stats. Records there mean entities, computed from entity counts in `totalRecords`. Counting per-edge relationship failures could push `failedRecords` above `processedRecords`/`totalRecords` and produce nonsensical per-entity stats. Track them separately: add `relationshipFailureCount` to `BatchProcessingResult` and `PartitionResult`. `failedCount` now stays entity-level. The completion tracker is fed the broader `result.hasAnyFailure()` so partitions where relationship triples failed don't get prematurely promoted as success even though their entity writes succeeded. - `detailsResource` wasn't typed as prov:Activity: the resource carries Activity-shaped predicates (prov:startedAtTime, prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy, prov:wasAssociatedWith) but only the OM-specific `om:LineageDetails` rdf:type. Add an explicit `rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL clients recognize it as an Activity without having to learn the OM type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): label lineage edges relative to focal node The Knowledge Graph view was labeling every edge with relation type "upstream" as "Upstream" regardless of direction relative to the focal node. For a focal node F, the raw stored relation `(F, X, upstream)` means "F is upstream of X" — i.e. X is downstream of F. The previous output labeled both `F → X` and `X → F` edges as "Upstream", which made bidirectional lineage look like a duplicated relation. Re-orient the label in `convertEdgesToGraphData` based on whether the focal is the edge's source or target: - focal → X → "Downstream" - X → focal → "Upstream" - non-focal-touching edges keep the raw relation label. Reported on a sample-data table with a circular lineage cycle (`dim_customer ↔ fact_orders`) where both directions showed "Upstream". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close remaining Copilot review gaps Three findings from PR #27999's third review pass — all about failure signals being silently dropped between layers: - `RdfIndexApp.processTask` ignored relationship failures: only `result.failedCount() > 0` was treated as a failure, so partitions whose Fuseki relationship/lineage writes failed (incrementing `relationshipFailureCount` but not `failedCount`) never wrote `jobData.failure`. Switch to `result.hasAnyFailure()` and report the combined count. - `checkAndUpdateJobCompletion` ignored partition `lastError`: a partition can finish COMPLETED with `lastError` set when a relationship bulk write was caught and recorded but didn't bump `failedRecords` or flip the partition to FAILED. The job would then go to COMPLETED even though there were real failures. Treat the presence of any `rdf_index_partition.lastError` as an error signal — promote to COMPLETED_WITH_ERRORS and aggregate sample errors into the job's errorMessage if it was blank. - `forwardEquivalentPredicate` mapped to a non-existent `om:DOWNSTREAM` URI: OpenMetadata only stores lineage with `om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O pair); there is no `om:DOWNSTREAM` predicate written anywhere — the downstream view is derived by reading the same UPSTREAM edge from the other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy` to `om:UPSTREAM` (both are reverse-direction causation predicates: in `B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and effect is B, so the canonical forward predicate is the same). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix RDF tag mapper * Fix all the comments Cherry-picked from #27562 (without bin/ autogenerated noise). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Align RdfPropertyMapper tests with refactor and isolate ontology export IT RdfPropertyMapperTest still referenced the removed addVotes helper and expected addStructuredProperty to dispatch votes — both gone after votes was added to IGNORED_PROPERTIES. Update the assertions accordingly. GlossaryOntologyExportIT timed out on the full suite because it flips a global RDF singleton in @BeforeAll and each test blocks a server thread on synchronous Fuseki writes. SAME_THREAD only serialized methods within the class — concurrent classes still raced for server threads. Adding @Isolated matches the pattern already used by RdfResourceIT for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): align addCertification typing + relationType after predicate flip Two findings on PR #27999 from the post-cherry-pick review pass: - `addCertification` mis-typed glossary-source certifications and skipped skos:Concept: it always emitted `om:Tag` regardless of source, even though `resolveTagResource` returns a glossaryTerm URI when the certification points at a glossary term. It also didn't add `skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for classification tags), so SPARQL queries filtering certification targets by `a skos:Concept` missed them while `addTagLabel`-emitted tags were findable. Mirror `addTagLabel`: branch on source (`Glossary` vs `Classification`), emit the right primary type plus `skos:Concept` (glossary) or `om:Tag` (classification), and include `om:tagSource`. - `relationType` left stale after predicate flip: when `parseEntityGraphEdgesFromResults` flipped subject/object for a reverse-direction predicate and rewrote `canonicalPredicate` to `om:UPSTREAM`, it kept the original `relationType` derived from the reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo with `relationType=downstream` + `predicate=om:UPSTREAM` — internally inconsistent, and the mismatched `edgeKey` prevented dedup against an existing UPSTREAM edge with the same endpoints. Re-derive `relationType` from the canonical predicate after the flip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings + add parser-helper unit tests Two outstanding Copilot findings on PR #27999 plus targeted unit coverage for the helpers that drive lineage canonicalization. Findings: - `colLineageUri` collision risk (RdfRepository): the deterministic key replaced non-alphanumerics in `toColumn` with `_`, so distinct column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which would lose / overwrite column-lineage resources during reindex. Append the loop index as a tiebreaker so distinct columns keep distinct URIs. - `createTypeResource` missing dprod prefix (RdfPropertyMapper): the `getNamespace` switch didn't recognize `dprod`, so `RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`) produced an invalid `dprod:DataProduct` URI on the wire. Added the `DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod` case in the switch. Coverage: - New `RdfParserHelpersTest` exercises the canonicalization helpers via reflection: `isReverseDirectionPredicate` (recognizes PROV-O causation predicates, ignores forward predicates), `forwardEquivalentPredicate` (both `wasDerivedFrom` and `wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works), `relativeRelationLabel` (focal-relative Upstream/Downstream flipping with all the boundary cases — non-focal edges, non-lineage relations, null focal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): merge array contexts before per-field resolution The third (low-confidence "suppressed") finding on review 4256830399 turned out to be a real duplication: when a field is mapped in one context map of an array context but absent from another, the previous processArrayContext ran processContextMappings once per map. The pass where the field IS mapped emits the proper `om:hasOwner <ref>` triples (plus `prov:wasAttributedTo`); the pass where the field is absent falls through to processUnmappedField and emits an additional `om:owners <ref>` triple. Net: two predicates for the same logical relationship. Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112 `om:owners` triples — one set per pass. Fix: flatten all context maps in the array into a single merged map once, then iterate entity fields exactly once against that combined view (later contexts win on key conflicts, matching JSON-LD context merge semantics). Each field is resolved against the union of mappings, so the unmapped fallback only fires for fields truly absent from every context. Net effect: `prov:wasAttributedTo` count is unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners` triples disappear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings on coordinator finalization race Two findings from PR #27999 review 4259628860: - `checkAndUpdateJobCompletion` early-returned before lastError check could promote: `refreshAggregatedJob` already marks the job COMPLETED when partitions all finish without `failedRecords`/`failedPartitions`, so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())` short-circuit silently dropped the lastError signal. Move the partition-lastError check INTO `refreshAggregatedJob` so both code paths produce consistent terminal status — a partition that finished COMPLETED but carries a non-null lastError now correctly promotes the job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the race. - `completePartition` / `failPartition` overwrote CANCELLED state: the unconditional partition row update lost a concurrent Stop's CANCELLED status if a worker finished its batch after the Stop request landed but before noticing it. Add a status-guarded `updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND status = 'PROCESSING') and have both completion paths use it; if 0 rows update, log and skip the side effects (no server-stat increment, no refreshAggregatedJob call) so the authoritative CANCELLED status stays. Mirrors the pattern SearchIndex's coordinator uses for the same race. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>	2026-05-11 06:14:50 -07:00
Harsh Vador	86e1d88386	security: Include branch name in security scan Slack alerts and fail only on high vulnerabilities (#27977 ) * Add branch context to security scan Slack alerts and upload CSV findings summary * change failing severity from medium to high & address gitar * fix csv formatting * revert flattening changes	2026-05-11 10:41:48 +05:30
Sriharsha Chintalapani	b837ade95a	docs(github): require issue link, design, tests, UI recording in PR template (#27891 ) Expands `.github/pull_request_template.md` to require a linked issue, a high-level design (for large PRs), a structured Tests section (use cases, unit + coverage %, backend/ingestion integration tests, Playwright, manual steps), and a UI screen recording for any UI change. Adds a `/pr-checklist` skill that walks the template, gathers evidence, and drafts the PR body before opening via `gh pr create`. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 08:05:56 +02:00
mohitdeuex	fefa998b0a	Add MockOidcServer testcontainer for SSO test infrastructure (S0 spike) First step of the SSO-flow testing initiative. Wraps navikt/mock-oauth2-server as a testcontainer wired into the OM Docker network under alias om-mock-idp on port 1080. The same URL — http://om-mock-idp:1080/<issuer> — is used by: - the OM container (via Docker network alias) - the Playwright browser on the host (via /etc/hosts loopback entry) - the iss claim in tokens issued by the mock IdP so token validation, browser redirects, and OIDC discovery all line up against one source of truth — required for the public/id_token flow where the browser receives the token directly and iss is derived from the URL it hit. Setup cost: one /etc/hosts line (127.0.0.1 om-mock-idp), added once per machine. CI workflow does it automatically. MockOidcServer.launch() throws with a clear remediation message if the entry is missing. MockIdpSmokeUIIT validates the network premise end-to-end: starts the container standalone and confirms discovery + JWKS endpoints respond from the host JVM with the expected om-mock-idp issuer URL. Next (S1): SsoProfile sealed interface, ContainerizedServer.launch(profile) overload, and the first Google SSO end-to-end test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 00:49:21 +05:30
mohitdeuex	a0e501ba11	Add openmetadata-java-playwright scenario test module Phase 1 of EPIC #3731: Java-driven E2E scenarios for reindex + UI tests. Reuses TestSuiteBootstrap as a test-jar dependency. Three execution modes: - embedded (in-JVM, fast, backend-only) - containerized (Testcontainers + prod server image, UI capable) - external (connect to a running stack) Includes 3 backend reindex scenarios (full / incremental / orphan cleanup), 1 Playwright UI scenario (search-after-reindex in Explore), and 2 CI workflows (PR path-filter + nightly cron). Satisfies #3767 and #3792. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:16:51 +05:30
Ariel Schulz	297c01cea7	Fix (#27660 ): Re-enable Exasol cli-e2e-tests after fixing issues (#27661 ) * Re-enable Exasol cli-e2e-tests after fixing issues * Revert accidental changes from branch switch * Adapt exasol.yml for tests * Add get_table_comment setup and re-enable test_vanilla_ingestion * Add type hints to maintain signature * SQLA-E does not include get_all_table_comments and will come later, so ignore for now * Add return type too	2026-05-06 17:11:53 +05:30
Mayur Singal	60a2e6546e	Migrate Databricks from sqlalchemy-databricks to databricks-sqlalchemy (#26896 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details * Update Databricks Dependency to databricks-sqlalchemy * Update generated TypeScript types * address comments and pyformat * pyformat * fix log filtering * address comments * fix static unit tests * fix rule for static type * pyformat * update baseline * revert basepyright changes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com>	2026-05-04 18:53:24 +05:30
Sid	ca2d0122db	test(playwright): add nightly SAML session renewal coverage (#27619 ) * test(playwright): add nightly SAML session renewal spec Covers OM's JWT refresh behavior for SAML sessions end-to-end against the local Keycloak fixture: silent refresh after expiry, concurrent 401s queuing behind a single refresh call, and forced re-login when the server-side SAML HttpSession is gone. Reuses the snapshot/restore mechanism and keycloak-azure-saml provider helper introduced in #27164; shortens samlConfiguration.security.token Validity to 10s so the suite observes multiple expiry cycles in <60s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update openmetadata-ui/src/main/resources/ui/playwright/utils/sessionRenewal.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * test(playwright): drop expiry wait from refresh-on-reload SSO specs The reactive 401 refresh path races with the AuthProvider useEffect that wires tokenService.renewToken from authenticatorRef — if the 401 from /users/loggedInUser lands before that effect commits the populated ref, refreshToken() returns null and the user is logged out instead of refreshed. With tokenValidity=10s (< EXPIRY_THRESHOLD_MILLES=60s), the UI's proactive timer in startTokenExpiryTimer fires immediately on every mount, so /auth/refresh is exercised on each reload regardless of expiry state. Assertions on token rotation and session continuity still cover "silent refresh works end-to-end". The SAML-session-gone case still waits for expiry — it needs to. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(playwright): trigger refresh via SPA nav in SSO renewal specs page.reload() remounts React and re-races the axios interceptor setup in AuthProvider — the useEffect that wires authenticatorRef.renewIdToken onto TokenService has a ref-typed dependency that doesn't reliably re-run, so the first 401 after reload sometimes finds renewToken=null and the interceptor silently logs the user out instead of refreshing. Click the Explore sidebar link instead. The click triggers authenticated API calls while staying inside the already-mounted React tree, so the interceptor always reaches the wired TokenService. Spec now passes 10/10 locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Siddhant <siddhant@MacBook-Pro-621.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-05-04 11:48:45 +05:30
Chirag Madlani	d095413ed1	fix(ci): nightly workflow running stale project getting failed [skip-ci] (#27849 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details	2026-05-04 10:53:16 +05:30
miriann-uu	7b01731754	GEN-5164: Add cherry pick matrix (#27674 )	2026-04-29 10:39:31 +05:30
Teddy	11e5ac95d4	chore: update sqlalchemy to 1.0.0 (#27776 ) Some checks are pending Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions Details Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run Details	2026-04-28 11:07:26 -07:00
IceS2	e9c87c6adb	chore(ingestion): drop pylint, expand ruff (#27774 ) * chore(ingestion): drop pylint, expand ruff to Stage 2c Replace pylint with a coherent ruff-only stack (Stage 2c of the modernize roadmap). Pylint is dropped from dev deps and CI workflows; ruff selected ruleset expanded to ~22 families covering style, bug catchers, hygiene, and the pylint port (PLE/PLC/PLW/PLR with the noisy "too-many-X" complexity caps + magic-value disabled). What's selected (with rationale in pyproject.toml): E, W, F, I, N — style + correctness baseline + naming UP — pyupgrade (py>=3.10 modernizations) B, C4, C90, RET, SIM, TRY — bug catchers PIE, ICN, T20, TC, TID, PTH, PERF — hygiene PLE, PLC, PLW, PLR — pylint port (PLR complexity caps ignored) RUF — ruff-native (incl. RUF100 unused-noqa) What's removed: - .pylintrc (root) — duplicate of the ingestion pylint config - [tool.pylint.] block in ingestion/pyproject.toml (~140 lines) - ingestion/plugins/{print_checker,import_checker}.py + tests + README (replaced by built-in T20 + TID251 banned-api respectively) - pylint dep from ingestion/setup.py and openmetadata-airflow-apis/pyproject.toml - `make lint` Makefile target + the pylint invocation in py_format_check - dead pylint TODO comment + ignored test entry in noxfile.py Cwd-stable config: ruff is invoked both from the repo root (pre-commit, CI) and from ingestion/ (`make py_format_check`). The `src`, `extend-exclude`, and per-file-ignores entries are listed twice — once relative to ingestion/ and once with the `ingestion/` prefix — so first-party isort detection and exclusions match in both invocations. Grandfathering: ran `ruff check --add-noqa` once + format-stable iteration. ~12,130 noqa directives across ~1,400 files. Cleanup is deferred to follow-up PRs that drop noqas one rule at a time. Documentation sweep: replaced `make lint` references in CLAUDE.md, AGENTS.md, DEVELOPER.md, copilot-instructions, and 6 SKILL files with the apply+verify shape `make py_format && make py_format_check`. `make py_format` is NOT a strict superset of pylint — it only applies auto-fixable violations; `make py_format_check` catches the rest. Basedpyright baseline regenerated: ruff format reflowed multi-line signatures in ~70 files, shifting type-error column positions. The basedpyright baseline matches by (file path, error code, range), so column shifts caused 19 entries to mis-align. Net diff is small (154 lines in/out of the 13MB baseline.json) — purely positional. Verified locally: - make py_format_check → All checks passed - nox --no-venv -s static-checks → 0 errors, 0 warnings, 0 notes chore(ingestion): finish ruff swap — nox lint session + skill docs Three remaining stale-tooling references after Stage 2c: - `ingestion/noxfile.py` `lint` session was still calling `black --check`, `isort --check-only`, `pycln --diff`. Those tools aren't installed anywhere (we dropped them from dev deps). Replace with the ruff equivalents that mirror `make py_format_check`. - `skills/standards/code_style.md`: stack listed as `black + isort + pycln`; line length claimed 88 (black default). Both wrong: stack is ruff, line length is 120. - `skills/connector-building/SKILL.md`: `make py_format` comment said `# black + isort + pycln`. Same swap. * chore(ingestion): keep main's baseline + globally ignore TRY400 Per gitar-bot's review on PR #27774: 1. Main's PR #27728 promoted ~60 `logger.warning()` → `logger.error()` inside `except` blocks. Those changes landed on main with their own baseline updates. Our PR doesn't promote anything — the merge from origin/main brought those `error` calls along with their baseline entries. The bot interpreted the `# noqa: TRY400` we added next to those lines as us silencing the rule case-by-case. Cleaner: globally ignore TRY400 in pyproject.toml, with a comment explaining why the codebase's `logger.error(...)` + separate `logger.debug(traceback.format_exc())` pattern is intentional. Strip ~430 per-line `# noqa: TRY400` markers from source. 2. Document that `S101` in `per-file-ignores` is a forward-looking entry — flake8-bandit (`S`) is not yet selected, so the rule is no-op today; the entry stays so when `S` lands later, tests don't immediately error. Reverts the platform pin and Linux Docker–generated baseline. Keep main's baseline intact and let CI surface the exact column-shifted entries; the team will decide whether to fix in-place (revert format on affected files) or add per-line `# pyright: ignore` markers. * chore(ingestion): regen baseline for new connector type debt Main's baseline was stale relative to recently-added connectors (McpConnection, CustomDriveConnection) that lack common attributes like `hostPort`, `database`, `catalog` etc. — all sites that access those attributes via the union-typed `serviceConnection.root.config` fire `reportAttributeAccessIssue` errors that aren't baselined. 71 errors + 58 warnings absorbed. Local macOS regen; pushing to see CI's drift count. Per the basedpyright-baseline-and-ci PR experience, macOS↔Linux column drift on this size of regen has historically been 1-7 residuals.	2026-04-28 07:21:59 +02:00
IceS2	84ed278720	chore(ingestion): enable basedpyright across the codebase via baseline (#27755 ) * chore(ingestion): enable basedpyright across the codebase via baseline Removes the ~25 paths from `[tool.basedpyright] ignore` (which excluded roughly 90% of the codebase from type checking) and grandfathers the existing violations into a baseline file. New violations in any previously-ignored file now fail CI. Changes: - ingestion/pyproject.toml: drop the entire `ignore = [...]` block - ingestion/setup.py: bump `basedpyright~=1.14` to `~=1.39.0` - ingestion/.basedpyright/baseline.json (new, ~13MB): captures the starting violation set (~18.8K errors + ~37.4K warnings) so the migration is behavior-preserving. Regenerate with `cd ingestion && basedpyright -p pyproject.toml --baselinefile .basedpyright/baseline.json --writebaseline`. basedpyright analysis has minor non-determinism (similar to ruff's), so re-running --writebaseline a few times converges the baseline. - ingestion/noxfile.py: pass `--baselinefile .basedpyright/baseline.json` to the basedpyright invocation in the `static-checks` session so CI honors the grandfathering. CI already runs the session via `cd ingestion && nox --no-venv -s static-checks` (py-tests.yml). - ingestion/Makefile: `make static-checks` now delegates to `nox -s static-checks` so local invocations match CI exactly. Also drops the dead Python 3.9 / OM_SKIP_SDK_PY39 branch (we require Python >=3.10 since the previous modernization PR). - .gitignore: add `.serena/` (local language-server cache) * chore(ingestion): add nox to the dev dependency set The static-checks Makefile target and the py-tests CI job both delegate to `nox -s static-checks`, but nox was being installed as a separate side step (`pip install nox` in `install_dev_env`, `uv pip install nox` in the test-environment composite action). Listing it in dev extras means a plain `pip install ingestion[dev]` brings it in. * chore(ingestion): pin basedpyright analysis to py3.10; CI runs once Following the basedpyright + multi-Python-version research: - ingestion/pyproject.toml: add `pythonVersion = "3.10"` to [tool.basedpyright] so type-checking always analyzes for the lowest supported Python version. Forward-incompatible code (tomllib usage, PEP 695 generics, etc.) is caught at type-check time regardless of which Python interpreter runs the checker. - .github/workflows/py-tests.yml: gate the "Run Static Checks" step on `matrix.py-version == '3.10'`. With pythonVersion pinned, results are identical across the matrix; running once avoids redundant work and keeps the baseline file deterministic. Unit tests still run on the full 3.10/3.11/3.12 matrix to verify runtime compatibility. - ingestion/.basedpyright/baseline.json: regenerated cleanly with the new pythonVersion config (~18.8K errors / ~37.3K warnings, similar scale to the previous baseline). Aligns with the canonical type-check-on-floor / test-on-matrix pattern used by Pydantic, CPython, and other major Python projects. * chore(ingestion): pin basedpyright pythonPlatform to Linux + regen baseline CI's previous run still surfaced ~9 issues (2 errors + 7 warnings) that weren't in the baseline. Root cause: my local environment differs from CI's in three ways that affect type inference — Python interpreter (3.11 vs 3.10), platform (Darwin vs Linux), and pip-resolved package versions (couchbase, avro, trino, sqlalchemy stubs all differ slightly). This commit closes the platform gap and regenerates the baseline from a fresh CI-equivalent environment: - ingestion/pyproject.toml: add `pythonPlatform = "Linux"` to [tool.basedpyright] so type-checking uses the Linux subset of stdlib / third-party stubs regardless of where the analyzer runs. - ingestion/.basedpyright/baseline.json: regenerated against a fresh Python 3.10 venv installed via `uv pip install ingestion[test]` (the same install path CI's setup-openmetadata-test-environment composite action uses). New scale: ~18.7K errors / ~37.5K warnings — same ballpark as the previous baseline, with column positions now matching CI's environment. Local-developer note: when running `make static-checks` from a venv that doesn't mirror CI exactly (e.g. macOS, Python 3.11, different package versions), you may see drift errors. The supported workflow for regenerating the baseline is to mirror CI: python3.10 -m venv /tmp/ci-mirror source /tmp/ci-mirror/bin/activate uv pip install --upgrade pip "setuptools<81" uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9" uv pip install -e "ingestion[test]" uv pip install "basedpyright~=1.39.0" nox cd ingestion && basedpyright -p pyproject.toml \ --baselinefile .basedpyright/baseline.json --writebaseline * chore(ingestion): drop pythonPlatform pin and regen baseline from CI-mirror The previous attempt added `pythonPlatform = "Linux"` thinking it would make the local-generated baseline match CI. It did the opposite — Linux platform stubs activate additional conditional code paths that weren't analyzed before, so CI saw 101 errors instead of the prior 2 errors. Reverting: - Drop `pythonPlatform = "Linux"` from [tool.basedpyright]. Without it, basedpyright analyzes for the host platform; on CI's ubuntu-latest runner that's Linux automatically, but type-stub coverage stays the same as before (matching the `d9196dff6b` baseline). - Regenerate ingestion/.basedpyright/baseline.json against a fresh Python 3.10 venv installed via `uv pip install ingestion[test]` (mirroring CI's setup-openmetadata-test-environment composite action). ~18.8K errors / 37.7K warnings captured — same scale as the working `d9196dff6b` version. Local-developer note: any baseline regeneration done on macOS will drift from CI's Linux env (different transitive package versions, different stubs). The supported local mirror procedure: python3.10 -m venv /tmp/ci-mirror source /tmp/ci-mirror/bin/activate uv pip install --upgrade pip "setuptools<81" uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9" uv pip install -e "ingestion[test]" uv pip install "basedpyright~=1.39.0" nox cd ingestion && basedpyright -p pyproject.toml \\ --baselinefile .basedpyright/baseline.json --writebaseline * chore(ingestion): regen baseline from full CI install (mac arm64 mirror) Prior CI-mirror only installed [test], skipping [all] and the four --no-deps SA pins (sqlalchemy-redshift/databricks/ibmi, pydoris-custom). That left ~75 connector packages out of the analysis env, so basedpyright couldn't resolve types from databricks.sqlalchemy, GE 0.18 Batch, sklearn BaseEstimator, airflow SQLAlchemy models, pandas/numpy stubs, etc. CI saw 129 errors absent from the baseline. Regenerated against a fresh py3.10 venv that mirrors .github/actions/setup-openmetadata-test-environment exactly: uv pip install ./ingestion[dev] make generate uv pip install "setuptools<81" uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9" uv pip install --no-deps sqlalchemy-redshift==0.8.14 \ sqlalchemy-databricks==0.2.0 \ sqlalchemy-ibmi==0.9.3 \ pydoris-custom==1.1.0 uv pip install ./ingestion[all] uv pip install ./ingestion[test] uv pip install nox First run: 128 errors, 272 warnings — within 1 error of CI's 129/272. Wrote baseline with 56,100 entries across 1,035 files. Verify run with the new baseline reports 0/0/0. macOS arm64 vs Linux x86_64 wheel resolution may leave a small residual (~3-7 errors per the `d9196dff6b` precedent). Re-run --writebaseline 2-3x if any show up in CI. * chore(ingestion): silence avro.py:95 basedpyright residual CI's Linux fastavro stub returns Schema as `str \| List[Any]`, while the macOS arm64 wheel narrows to `str` — the only error not absorbed by the regenerated baseline. Add a targeted pyright: ignore on the parse_avro_schema call instead of broadening behavior. * chore(ingestion): tolerate cross-platform pyright ignore drift CI's `--baselinemode=lock` (default) requires the baseline to match exactly — neither up nor down. Two related issues: 1. The avro.py noqa silenced not just the surfaced error but 10 cascading entries at line 95 (sub-errors propagating from the unresolved `schema` arg type). Baseline went `down by 10` → lock violated → exit 3 even with `0 errors` reported. Regenerate baseline so the 10 stale entries are dropped. 2. The macOS arm64 fastavro stub doesn't surface that error in the first place, so basedpyright treats the noqa as `reportUnnecessaryTypeIgnoreComment` locally — causing the opposite lock mismatch on CI (a warning entry that doesn't exist there). Disable the rule so platform-specific residuals can land without flapping between local and CI. * chore(ingestion): use --baselinemode=discard for cross-platform tolerance CI's implicit default is `lock`, which fails on any baseline change in either direction (errors going up or down) via console.error → exit 3. That cannot accommodate macOS arm64 vs Linux x86_64 stub drift: a baseline regenerated locally always carries some entries that don't fire on CI (and vice versa). `auto` would tolerate the drift but silently overwrites the baseline file — unacceptable in CI, where unreviewed changes never get committed back. `discard` is the right balance: - New errors not in the baseline still fail the run (early-return path in BaselineHandler.write before the lock/discard branch). - Stale baseline entries (errors that no longer fire on the current platform) print an info message and exit 0. - The baseline file is never modified.	2026-04-27 17:15:44 +02:00
IceS2	1fa0c79d27	chore(github): migrate issue templates to structured forms (#27710 ) * chore(github): migrate issue templates to structured forms - Convert bug_report, feature_request, doc_update to GitHub issue forms (YAML) - Add connector_bug form with free-text Connector field - Drop epic and feature_task templates (stale since 2022, no usage evidence) - Add auto-label workflow that maps the Connector field to a namespaced connector:<name> label, falling back to connector:other on 0 or 2+ matches - Labels are applied exclusively and auto-created with a grey "Connector" description when missing * chore(github): drop redundant pipeline type field from connector_bug form Feature area already covers metadata/lineage/profiler/usage distinction. * fix(github): address PR review feedback - bug_report.yml: add labels: ["bug"] for pattern consistency - label-connector.yml: add contents: read permission (needed by checkout) - label_connector.py: raise on unexpected HTTP status; accept 404 for idempotent GET-label and DELETE-label-from-issue; stop echoing the raw Connector field value into workflow logs	2026-04-24 14:08:20 +02:00
Mayur Singal	878421a644	fix: enable subprocess coverage tracking for CLI E2E tests (#27329 ) * fix: enable subprocess coverage tracking for CLI E2E tests CLI E2E tests run connectors via `subprocess.Popen("metadata ingest")` but the subprocess coverage data was silently lost. Two issues: 1. Missing `parallel = true` in coverage config — parent pytest process and child subprocess both wrote to the same `.coverage` file, causing data collision. With parallel mode, each process writes to its own `.coverage.<pid>` file that `coverage combine` can merge. 2. `COVERAGE_PROCESS_START` used a relative path (`ingestion/pyproject.toml`) in sitecustomize.py. Resolved to absolute using `GITHUB_WORKSPACE`. Evidence: Metabase (zero unit tests, only E2E) shows 53.6% on SonarCloud with client.py at 17.2% — inspection of .coverage.metabase confirms only import-time + in-process setup lines are present, with zero method body coverage from the subprocess execution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove -a (append) flags incompatible with parallel coverage mode `coverage run -a` and `coverage combine -a` conflict with `parallel = true` in the coverage config. In parallel mode each process writes to its own `.coverage.<pid>` file, and `coverage combine` merges them — no append needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * MINOR: Fix snowflake e2e (#26677) * MINOR: Fix snowflake e2e * fix pyformat * improve snowflake test * fix count * mark flaky auto classification test * improve test address comment --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-23 06:57:30 +02:00
Sid	0a98f5bf32	test(playwright): add nightly SSO login spec starting (#27164 ) * test(playwright): add nightly SSO login spec starting with Okta Extends Playwright coverage end-to-end for SSO login flows. Today's SSO coverage (Features/SSOConfiguration.spec.ts) only asserts the config form UI. This adds a new suite that configures OpenMetadata to an external identity provider, drives a real login through the provider's hosted UI, and validates the resulting session against the OM API. Phase 1 ships Okta only (integrator-9351624.okta.com). Additional providers (Auth0, Azure, Cognito, SAML, Google) plug into the same dispatcher by adding a ProviderHelper implementation. ## What's new - playwright/e2e/Auth/SSOLogin.spec.ts — two-test suite tagged @sso 1. Asserts the SSO sign-in button renders on /signin with the correct brand label and that the basic-auth form is not shown. 2. Clicks the button, drives the provider's login widget, follows the OAuth callback, completes first-run self-signup when needed, lands on /my-data, then verifies the JWT by calling GET /api/v1/users/loggedInUser and asserting the returned email matches SSO_USERNAME. - playwright/utils/ssoAuth.ts — provider-agnostic orchestration: applyProviderConfig (PUT /api/v1/system/security/config), restoreBasicAuth, buildAuthContextFromJwt, verifyLoggedInUserMatches. Composes existing getApiContext/getAuthContext/getToken helpers — no token extraction or HTTP plumbing is reimplemented. - playwright/utils/sso-providers/{index,okta}.ts — ProviderHelper interface plus the Okta Identity Engine widget driver. Defaults the dev tenant values from the committed openmetadata.yaml snippet so the spec only needs SSO_USERNAME/SSO_PASSWORD to run locally. - playwright/constant/ssoAuth.ts — env var key constants, PROVIDER_BUTTON_TEXT map, and the BASIC_AUTH_CONFIG payload used for cleanup. - playwright.config.ts — new 'sso-auth' project matching playwright/e2e/Auth/*/.spec.ts with its own serial workers, and '/Auth/' added to the chromium project's testIgnore so these tests never run in the default suite. ## How provider switching works beforeAll logs in as admin via basic auth, captures the admin JWT via getToken(page) BEFORE the swap, then PUTs the Okta config. The admin JWT survives the provider swap because OM's internal JWKS stays in publicKeyUrls and the admin user's isAdmin flag is persisted in the DB. afterAll rebuilds an API context from that JWT and restores basic auth, making the spec fully idempotent — the same OM instance can run the suite repeatedly without any manual cleanup. ## Running locally export SSO_PROVIDER_TYPE=okta export SSO_USERNAME='<okta-test-user>' export SSO_PASSWORD='<okta-test-password>' npx playwright test playwright/e2e/Auth/SSOLogin.spec.ts \ --project=sso-auth --workers=1 Verified end-to-end against integrator-9351624.okta.com — both tests pass in ~12s on an already-provisioned user, ~14s on first-run self-signup. Cleanup leaves the server in basic-auth mode. ## Notes for reviewers - The existing .github/workflows/playwright-sso-tests.yml already wires up the CI matrix and secret names; this change intentionally does NOT enable the cron schedule. That lands in a follow-up once one provider is stable for a few nightly runs. - OKTA_SSO_CLIENT_ID / OKTA_SSO_DOMAIN / OKTA_SSO_PRINCIPAL_DOMAIN env vars can override the baked-in dev tenant defaults if a different Okta tenant is used in CI. * ci: add dedicated SSO Login Nightly workflow Adds .github/workflows/playwright-sso-login-nightly.yml, a standalone workflow that runs the new SSOLogin spec nightly at 03:00 UTC instead of piggy-backing on playwright-sso-tests.yml. The existing playwright-sso-tests.yml is left untouched — it still covers the SSO configuration form UI via SSOConfiguration.spec.ts and its matrix/secrets wiring is unchanged. The new workflow complements it with a real end-to-end login round-trip: - Schedule: cron '0 3 * * ' - Provider matrix: okta only for Phase 1 (extended as helpers ship) - Invokes playwright/e2e/Auth/SSOLogin.spec.ts under the new sso-auth Playwright project with workers=1 - Wires provider credentials via secrets with the existing {PROVIDER}_SSO_USERNAME / {PROVIDER}_SSO_PASSWORD convention plus optional OKTA_SSO_CLIENT_ID / OKTA_SSO_DOMAIN / OKTA_SSO_PRINCIPAL_DOMAIN overrides - Uses the shared setup-openmetadata-test-environment composite action, PostgreSQL, ingestion disabled — matching the existing SSO tests workflow - Uploads the HTML report as an artifact on every run and cleans up the docker stack in a final always-run step refactor(playwright): simplify ssoAuth helpers - verifyLoggedInUserMatches now asserts directly on the lowercased email field instead of building a candidate array and feeding it a long stringified failure message. The assertion failure already shows expected vs received, so the wrapper string was just noise. - Drop buildAuthContextFromJwt — it was a one-line wrapper around getAuthContext. The spec calls getAuthContext directly now. * refactor(playwright): address SSO suite review feedback - Extract OM_BASE_URL from PLAYWRIGHT_TEST_BASE_URL (with the same http://localhost:8585 default as playwright.config.ts) and export it from constant/ssoAuth.ts. okta.ts and BASIC_AUTH_CONFIG both consume it, so callbackUrl, the OM JWKS entry in publicKeyUrls, and the basic-auth restore payload all match the test target — including CI runs against non-default hosts. - Drop PROVIDER_BUTTON_TEXT. It was exported but never imported; the ProviderHelper.expectedButtonText field is the only source of truth for the SSO sign-in button label and the spec already reads from it. - Restore the OM convention adminPrincipals: ['admin'] in the Okta config (matches conf/openmetadata.yaml's AUTHORIZER_ADMIN_PRINCIPALS default). The previous code was granting admin to whichever IdP user ran the suite — verifyLoggedInUserMatches only needs an authenticated session, not admin, so the elevation was unnecessary. This also drops the now-unused requireEnv on SSO_USERNAME inside okta.ts; the spec itself still gates on the env var via test.skip. - Set workers: 1 on the sso-auth Playwright project. fullyParallel: false alone wasn't enough — the global workers: 3 on CI could still fan out across multiple Auth/*/.spec.ts files in the future. The explicit limit enforces full isolation as more provider specs land. * ci: avoid CodeQL "Excessive Secrets Exposure" in SSO Login Nightly Replaces the dynamic secret lookup secrets[format('{0}_SSO_USERNAME', upper(matrix.provider))] with a static reference secrets.OKTA_SSO_USERNAME CodeQL flagged the dynamic indexing because GitHub Actions can only mask & scope secrets that are referenced statically. With a computed key, the runner has no way to know which single secret is needed and conservatively materializes EVERY org and repo secret into the step's environment — even though the test only reads OKTA_SSO_. Static references let GitHub expose only the two credentials this step actually uses. Phase 1's matrix is okta-only so the change is two lines. The added inline comment documents the convention for future providers: add a sibling step gated by `if: matrix.provider == '<provider>'` with that provider's static secret references — do not bring back the secrets[format(...)] pattern. refactor(playwright): capture/restore real security config in SSO suite - Snapshot /system/security/config in beforeAll, restore exact payload in afterAll instead of PUTting a hand-rolled basic-auth baseline (preserves allowedDomains, forceSecureSessionCookie, adminPrincipals, etc.) - Strip ldap/saml subtrees from the snapshot: GET returns empty-string placeholders the PUT validator rejects - Require OKTA_SSO_{CLIENT_ID,DOMAIN,PRINCIPAL_DOMAIN} via getRequiredEnv; no more hardcoded tenant defaults - Fail fast in beforeAll if admin JWT capture returns empty string so the server is never left stuck in SSO mode - Shrink Okta provider override to just the fields Okta needs; sibling authorizer fields come from the captured snapshot Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): extract per-provider composite action Restructures the nightly workflow so provider credentials stay statically referenced for CodeQL while making it trivial to add new providers: - New composite action .github/actions/sso-login-run bundles all shared setup + test-run logic; pulls non-secret provider config from the caller's vars context dynamically (${PROVIDER_UPPER}_SSO_) - playwright-sso-login-nightly.yml becomes a thin dispatcher with one real job per provider. Each job declares environment: test so it can resolve its password via a static secrets.<PROVIDER>_SSO_PASSWORD reference (no secrets[format(...)] dynamic lookup, CodeQL clean) - Adding a provider = copy the okta job stanza, swap the secret name, add the provider to the dispatch input choices, register the helper in sso-providers/index.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> refactor(playwright): move Okta tenant config to a repo constant The Okta tenant identifiers (clientId, domain, principalDomain) are non-secret OAuth public values — visible on the hosted login page during any sign-in. Keeping them in GitHub environment variables cost setup friction (5 env vars to configure locally, each a potential typo) without any security benefit. Move them back to a committed OKTA_TENANT constant in okta.ts where a reviewer can see exactly which tenant the suite is exercising. Net effect: - Local runs only need SSO_PROVIDER_TYPE, SSO_USERNAME, SSO_PASSWORD. - The test environment in GH Actions keeps OKTA_SSO_USERNAME (variable) and OKTA_SSO_PASSWORD (secret); the three tenant variables are no longer consumed. - Composite action drops the jq-based dynamic var extraction; the caller passes sso_username directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): move timeout-minutes from composite step to job level Composite actions don't support timeout-minutes on individual steps — that's a runner job field only. Move the 30-minute test timeout up to the dispatcher job and bump to 45 minutes to cover docker + maven setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): consolidate dispatcher + composite action into one file Collapse the dispatcher workflow + composite action split into a single ~115-line workflow using a strategy matrix and dynamic vars[format(...)] / secrets[format(...)] credential resolution keyed on the matrix provider name. Trade-off: - CodeQL "Excessive Secrets Exposure" (low severity) will re-flag the dynamic secret lookup. Accepted in exchange for a single source of truth and true zero-workflow-churn multi-provider support. Onboarding a new provider is now: 1. Add its name to the matrix array + dispatch options list. 2. Add <PROVIDER>_SSO_USERNAME (variable) + <PROVIDER>_SSO_PASSWORD (secret) in the test environment. 3. Register the helper in sso-providers/index.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): drop provider-prefix bash step; use case-insensitive lookup GitHub secret and variable names are case-insensitive, so format('{0}_SSO_PASSWORD', matrix.provider) with the lowercase matrix value resolves correctly against the uppercase conventional names like OKTA_SSO_PASSWORD. That removes the need for a separate "Compute provider prefix" step and its cross-step env-context plumbing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): drop redundant case-insensitivity comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(sso-login): pin playwright install to 1.57.0 to match package.json The previous 1.51.1 pin was stale vs. the @playwright/test version in package.json. The mismatch caused browser cache path divergence — the install step wrote browsers under 1.51.1's cache and the test run looked for them under 1.57.0's cache and failed with "browsers not installed." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(playwright): address SSO suite review comments [skip ci] - Drive Okta tenant (clientId, domain, principalDomain) from env vars, falling back to the existing nightly tenant values as defaults - Use redirectToHomePage as the final assertion in the SSO login step - Document why the /signup vs /my-data branch is conditional Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * saml * test(playwright): add SAML providers to SSO login nightly Extend the nightly SSO login matrix with Azure AD SAML and a self-contained Keycloak SAML fixture (Azure-profile + Google-profile realms), so the suite exercises the full SAML flow end-to-end without relying on a hosted IdP. - docker/local-sso/keycloak-saml: Keycloak 26.3.3 compose + pre-imported realms bound to OM at localhost:8585, port-overridable via KEYCLOAK_SAML_PORT. - playwright sso-providers: azure-saml helper (hosted tenant, non-secret federation metadata committed) and keycloak-saml factory that fetches the realm's IdP X509 at runtime. - SSO assertion matches OM's actual SAML sign-in label ("Sign in with SAML SSO"), since providerName isn't propagated into the store for the SAML provider branch of getAuthConfig. - Workflow starts/stops the Keycloak stack only for keycloak-* matrix rows and injects the fixture credentials inline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(playwright): fetch Azure SAML IdP cert at runtime Drop the committed Azure Federated SSO X509 certificate and the AZURE_SAML_IDP_CERTIFICATE env fallback from the azure-saml provider. The cert now comes from Azure's federation metadata XML endpoint at test start, mirroring how the Keycloak provider resolves its realm cert, so the suite stays aligned with Azure's ~3-year cert rotations automatically. - New saml-metadata.ts exporting fetchIdpX509Certificate(descriptorUrl, label), reused by azure-saml and keycloak-saml. - azure-saml.buildConfigPayload is now async and pulls the cert from https://login.microsoftonline.com/<tenantId>/federationmetadata/2007-06/federationmetadata.xml before building the SAML payload. - keycloak-saml drops its inline cert-fetching helpers and delegates to the shared util. - Trim narration comments across the SSO suite to keep only the non-obvious rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(playwright): drop hosted Azure SAML provider The nightly Keycloak SAML fixture with Azure-profile attribute claims exercises the same OM SAML code path as the hosted Azure AD tenant. The hosted provider added external tenant/cert coupling without unique coverage, so this removes it. Drops the azure-saml helper, its env keys (AZURE_SAML_TENANT_ID / AZURE_SAML_PRINCIPAL_DOMAIN), the dispatcher registration, and the workflow dispatch option. Keycloak Azure/Google realms remain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(playwright): cover SSO session lifecycle end-to-end Extends the SSO login spec beyond "can you log in" to the full session round-trip: reload survives, same-context tabs inherit auth, sidebar logout (with modal confirm) lands on /signin, and post-logout refresh stays signed out. Adds a describe-scoped userContext/userPage created in beforeAll so tests 2-6 inherit the IdP-backed session; test 1 keeps its fresh fixture for the unauthenticated assertion. Cleanup closes the user context before restoring the server security config. Verified locally against keycloak-azure-saml and keycloak-google-saml realms: 6 passed each (was 2). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * remove slow from individual spec * remove slow from beforeAll * style(playwright): fix SSOLogin spec prettier issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(playwright): tighten SSO sign-in locator and await logout response Address Copilot review comments on PR #27164: - Use button.signin-button to match the pattern in SSOAuthentication.spec.ts. - Await /api/v1/users/logout POST alongside the /signin navigation in the logout test to remove the race against the server response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix * Update openmetadata-ui/src/main/resources/ui/playwright/e2e/Auth/SSOLogin.spec.ts Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix * test(playwright): resolve SSO creds via env vars, drop keycloak-google-saml Route Keycloak credentials through the same `vars[format(...)]` / `secrets[format(...)]` indirection as Okta via an `env_prefix` matrix column, removing the hardcoded fixture literals from the workflow. Password lookup falls back `vars \|\| secrets` so fixture passwords can live as vars while real provider secrets stay in secrets. Also drop the keycloak-google-saml variant — same IdP and realm shape as the Azure variant, so it adds CI cost without meaningful coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(playwright): post SSO login nightly results to Slack Adds a per-provider Slack notification step mirroring the pattern used by the postgresql/mysql nightly workflows — reuses the existing `slack-cli.config.json` and `playwright-slack-report` CLI against the `results.json` that the global JSON reporter already emits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(playwright): drop logout response wait in SSO spec OktaAuthenticator.logout clears tokens locally with no backend call, and GenericAuthenticator (SAML) hits `GET /auth/logout` — neither triggers the `POST /api/v1/users/logout` the test was waiting on. The listener never matched, so `Promise.all` hung past the 180s test timeout even though the page had already navigated to /signin. Rely on `waitForURL('**/signin')` + the signin button assertion, which are the actual cross-provider success signals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Siddhant <siddhant@MacBook-Pro-457.local> Co-authored-by: Siddhant <siddhant@MacBook-Pro-529.local> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Siddhant <siddhant@MacBook-Pro-621.local> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-04-17 13:09:54 +05:30
Aniket Katkar	12ce3b614d	Chore(UI): consolidated UI checkstyle fix commands and modify workflow comment (#27402 ) * feat: add consolidated UI checkstyle commands for all and changed files * update prt to pr * test commit to fail ui-checkstyle * update the comment * Revert "test commit to fail ui-checkstyle" This reverts commit `ed056f0629`. * Revert "update prt to pr" This reverts commit `0666fa51a3`. * Worked on comments * pull request target remove * Revert "pull request target remove" This reverts commit `b61e98c16b`. * Worked on comments	2026-04-16 17:18:22 +05:30
Teddy	50c17502cf	MINOR - Enable merge group GH event (#27371 ) * chore: added merge_group for github merge queue * chore: remove unnecessary merger group on team labeler * fix: added gates for merge queue and pull request events	2026-04-15 07:42:08 -07:00
Pere Miquel Brull	1dedc0cf15	Add k8s-operator unit tests to PR CI (#27387 ) * Add k8s-operator unit tests to PR CI pipeline The k8s operator tests only ran during manual release builds. Add a path-filtered job so they run on PRs touching openmetadata-k8s-operator/*, following the same Detect Changes pattern used by the service unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Remove -DfailIfNoTests=false — we want to catch missing tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix k8s-operator tests: add surefire includes and remove unnecessary stub Parent POM surefire includes only match org.openmetadata.service., so operator tests under org.openmetadata.operator. were silently skipped. Override with */Test.java in the operator pom.xml. Also remove unused KubernetesClient mock stub from CronOMJobReconcilerTest.setUp — no test reaches the code path that calls context.getClient(), causing UnnecessaryStubbingException. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename k8s-operator to k8s_operator in workflow outputs Hyphens in output names are parsed as subtraction in GitHub Actions expressions dot notation, so the job condition would never trigger. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix filesystem paths — underscore rename only applies to output keys The replace_all incorrectly changed directory names from openmetadata-k8s-operator to openmetadata-k8s_operator. Only the GitHub Actions output key needs the underscore; all file paths must use the actual hyphenated directory name. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Drop -am flag from k8s-operator test command openmetadata-service is a provided-scope dependency, so -am tries to compile it including shaded ES/OS jars that aren't available in a clean CI environment. The operator module compiles fine on its own. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix invalid YAML in conf/openmetadata.yaml The CSP policy line has unescaped colons inside the value which the YAML parser interprets as mapping indicators. Use a folded block scalar (>-) so the value is parsed as a plain string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Build k8s-operator deps before running tests The operator depends on openmetadata-service (provided scope) which won't be in the Maven cache on a cold CI runner. Build with -am -DskipTests first, then run operator tests separately — same pattern as docker-k8s-operator.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Reintroduce lenient client mock to prevent flaky NPE The reconcile flow is time-dependent — tests using "0 * * * " can reach context.getClient() near the top of the hour. Stub the full client.resources().inNamespace().resource().create() chain as lenient so early-return tests aren't penalized but happy-path tests won't NPE. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Revert conf/openmetadata.yaml — fix belongs in a separate PR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:48:18 +02:00
Harsh Vador	f4c939869d	ci(security): add Retire.js workflow to detect bundled JS vulnerabilities (#27315 ) * ci(security): add Retire.js workflow to detect bundled JS vulnerabilities * address gitar * add om existing security scan workflow * address gitar * add slack support & remove PR check * address gitar * change job name * address comment * address comment	2026-04-15 19:12:53 +05:30
Sriharsha Chintalapani	bb0daa180e	RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex (#26902 ) * RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex * Update generated TypeScript types * Address comments from copilot * Update generated TypeScript types * fix test issues * Fix minor UI bugs * Add the missing filters * Fix RDF export API error * Add export functionality * Fix ui-checkstyle * Fix java checkstyle * Fix unit tests * Fix and increase the coverage for KnowledgeGraph.spec.ts * Fix tests * Remove rdf as default in playwright and local docker * fix ui-checkstyle * Address comments * Potential fix for pull request finding 'CodeQL / Artifact poisoning' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Address copilot comments * Address copilot comments * FIx tests * FIx docker * Update openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexCoordinator.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Address copilot review comments: license headers, JSON escaping, type safety, border-color, stop semantics Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1 Co-authored-by: harshach <38649+harshach@users.noreply.github.com> * Show error toast for unsupported export format in KnowledgeGraph Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1 Co-authored-by: harshach <38649+harshach@users.noreply.github.com> * Fix docker * Fix docker for playwright * Fix docker for playwright * Fix tests * Fix tests * Fix docker * Fix docker * Fix glossary and pagination spec flakiness * update the missing translations * Fix docker * Fix docker * Fix integration test * Fix fuseki not starting * Fixed the run local docker script * worked on comments * Fix flakiness in knowledge graph tests * Fix checkstyle --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: harshach <38649+harshach@users.noreply.github.com>	2026-04-14 13:24:41 -07:00
Chirag Madlani	4f7be5f014	fix(ci): filter blob pattern causing failure to sonarcloud (#27357 ) * fix(ci): filter blob pattern causing failure to sonarcloud * fix(ci): add missing backslash continuation in sonar-scanner command Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/88d229f2-81dd-4662-8295-a3bb0df03815 Co-authored-by: chirag-madlani <12962843+chirag-madlani@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-04-14 20:06:21 +05:30
Aniket Katkar	3428dfbd6a	Chore(UI): Fix rbac tests not running on PR checks (#26994 ) * Fix rbac tests not running on PR checks * update the dependency * Update the SearchRBAC dependency	2026-04-14 17:53:59 +05:30
Pere Miquel Brull	f6258819e7	ci: reduce checkout history footprint in PR workflows (#27221 ) * ci: reduce checkout history footprint in PR workflows Optimize actions/checkout usage to avoid downloading the full repo blob history on every PR run. The repo is large, so cloning everything just to run tests wastes minutes of CI time per job. - py-operator-build-test.yml: drop fetch-depth: 0 (no history needed) - openmetadata-service-unit-tests.yml: drop fetch-depth: 0 (Sonar is explicitly skipped via -Dsonar.skip=true); shallow-fetch PR base ref - airflow-apis-tests.yml, py-tests.yml, yarn-coverage.yml: add filter: blob:none to Sonar jobs so commits/trees remain available for blame while blobs are fetched lazily on demand - ui-checkstyle.yml: add filter: blob:none to all jobs that rely on tj-actions/changed-files (needs commit/tree metadata, not blobs) * ci: drop fetch-depth: 0 from jobs that don't walk history Follow-up audit after the initial pass. Four jobs were still declaring fetch-depth: 0 (plus filter: blob:none in two cases) without actually needing any history beyond HEAD. ui-checkstyle.yml - i18n-sync: runs 'yarn i18n' then 'git status --porcelain'. git status compares the working tree to HEAD; no history walk. Default depth 1 is sufficient. - app-docs: same pattern with 'yarn generate:app-docs'. py-sonarcloud-nightly.yml - py-unit-tests: only uploads a coverage artifact, no Sonar invocation. - py-integration-tests: same. - py-combine-coverage: does run SonarSource/sonarqube-scan-action, so it genuinely needs the commit graph — added filter: blob:none for parity with the PR Sonar jobs. * ci: remove unused 'Fetch PR base branch' step from service unit tests Copilot review flagged that the step was using --depth=1 while the main checkout is also shallow, which would break any merge-base operation. On investigation, nothing downstream actually uses the base ref: the only command that runs after the checkout is 'mvn ... -Dsonar.skip=true', which has no git dependency. The step was preserved defensively in the previous commit, but it's dead code — cleanest fix is to delete it.	2026-04-13 10:46:17 -07:00
Chirag Madlani	917a36c6a4	Potential fix for code scanning alert no. 1842: Artifact poisoning (#27220 ) * Potential fix for code scanning alert no. 1842: Artifact poisoning Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Pin Yarn version to 1.22.18 to fix artifact poisoning alert Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/29aebdb5-eef0-4a2a-be01-489deef48d2b Co-authored-by: chirag-madlani <12962843+chirag-madlani@users.noreply.github.com> * Fix artifact poisoning in update-playwright-e2e-docs.yml: replace npm install -g yarn with pinned corepack Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/550fba5a-bb13-45da-a144-b67599c9eaa4 Co-authored-by: chirag-madlani <12962843+chirag-madlani@users.noreply.github.com> * Remove corepack prepare to eliminate artifact poisoning: use only corepack enable (bundled yarn) Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/90f6ed8d-3f2b-4c3d-9a34-cd1f57c4d89c Co-authored-by: chirag-madlani <12962843+chirag-madlani@users.noreply.github.com> --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>	2026-04-10 16:12:28 +05:30
Sriharsha Chintalapani	b2b49db75e	MSAL Token Renewal Fix — Safari Session Loss (#27214 ) * MSAL Token Renewal Fix — Safari Session Loss * MSAL Token Renewal Fix — Safari Session Loss * MSAL Token Renewal Fix — Safari Session Loss * apply lint * MSAL Token Renewal Fix — OIDC fix * wait for token update * fix unit tests * Add SSO playwright tests * Add tests --------- Co-authored-by: Chirag Madlani <12962843+chirag-madlani@users.noreply.github.com>	2026-04-09 17:45:00 -07:00
Mohit Yadav	3ec31e3e68	Make OpeNMetadata Service Unit Test Required (#27099 )	2026-04-09 15:58:50 -07:00

1 2 3 4 5 ...

761 commits