Commit graph

235 commits

Author SHA1 Message Date
Sid
28ffe76785
fix(alerts): guard isBot() against deleted users to avoid aborting change-event batches (#28304)
* fix(alerts): guard isBot() against deleted users to avoid aborting batches

AlertsRuleEvaluator.isBot() called Entity.getEntityByName with
Include.NON_DELETED without catching EntityNotFoundException. When a
change event's userName referenced a user that had been deleted (common
for short-lived test fixtures torn down by afterAll), the exception
escaped the SpEL filter and was caught in AbstractEventConsumer.execute
as "Error in polling events for alert : Entity not found: user ...".

That outer catch logs the error and lets the finally block advance the
offset by batchSize, silently dropping every other event in the batch.
The ActivityFeedAlert subscription was hit hardest because every event
flows through its isBot() filter rule.

Catch EntityNotFoundException and treat an unresolvable actor as not-a-bot.

* fix(alerts): null-safe getIsBot() + IT + unblock ActivityAPI spec

- AlertsRuleEvaluator.isBot(): use Boolean.TRUE.equals(user.getIsBot())
  so a null isBot field on the resolved user doesn't NPE on auto-unbox.
- Add AlertsRuleEvaluatorResourceIT#test_isBot_returnsFalseWhenActorUserDeleted
  covering the original bug — create a user, evaluate isBot() (false),
  delete the user, evaluate isBot() again and assert it still returns
  false instead of throwing.
- Remove test.describe.fixme from ActivityAPI.spec.ts so the spec runs
  again now that the underlying batch-abort is fixed.

---------

Co-authored-by: Siddhant <siddhant@MacBook-Pro-751.local>
2026-05-22 12:08:22 +05:30
sonika-shah
dbc0d997bb
fix(table,dataModel): persist inline column.extension from POST/PUT-create (#28328)
POST/PUT-create with column.extension inlined in the request body never
wrote a row to entity_extension — the data only landed in the table JSON,
while every reader (and the PATCH original reload) looks in entity_extension
via getColumnExtension(). The override clobbered the inline data with null
on every read, leaving the Custom Properties tab blank and causing any
JsonPatch op walking /columns/N/extension/... to fail with "no mapping for
the name 'extension'" on the reloaded original.

One shared single-column writer for both create and update paths:

- Add EntityRepository.storeColumnExtension(UUID, Column) — the upsert that
  used to live inside EntityUpdater.updateColumnExtension, lifted out so the
  create paths can reach it.
- Add EntityRepository.storeColumnExtensions(UUID, List<Column>) — single
  entity walker that flattens via EntityUtil.getFlattenedEntityField and
  calls the primitive per column.
- Add EntityRepository.getColumnsForExtensionPersistence(T) hook overridden
  in TableRepository and DashboardDataModelRepository to return getColumns().
- Wire into createNewEntityFlush alongside storeExtension.

Update path simplifications:

- updateColumns now uses the shared storeColumnExtension primitive.
- updateColumnExtension (the EntityUpdater-private duplicate writer) is
  deleted.
- Existing-column branch skips the upsert when stored.extension equals
  updated.extension — previously every update pass rewrote identical JSON
  for every column with a non-null extension.
- Added-columns branch now persists extensions via the walker. updateColumns'
  main loop skips columns with no stored match, so inline column.extension
  on a freshly-added column was silently dropped on PUT-update too.

Regression coverage in ColumnCustomPropertiesIT covers both tableColumn
and dashboardDataModelColumn inline-on-create persistence, plus the
PUT-update-adds-column path.
2026-05-22 12:00:23 +05:30
Sriharsha Chintalapani
b62db6224f
feat(tasks): policy-driven authorization with self-approval guard (#28315)
* feat(tasks): policy-driven authorization with self-approval guard

Moves Task resolve/close/reassign authorization from ~150 lines of custom
Java in TaskRepository into the policy engine. Adds ResolveTask, CloseTask,
ReassignTask MetadataOperation values, isTaskFiler/isTaskAssignee/isTaskReviewer
SpEL conditions, and a new TaskAuthorPolicy seed. Closes the self-approval
gap where a filer who was also in the assignees list could approve their
own task (now denied via deny rule). TaskResourceContext.getOwners now
returns target entity owners so isOwner() retains its conventional meaning;
v200 migration backfills the new policy attachment on the DataConsumer role
for upgrades.
2026-05-21 22:20:46 -07:00
Mohit Yadav
81d30a7498
Fix Failing LineageImpactAnlaysisTest (#28340) 2026-05-21 11:10:25 -07:00
Sriharsha Chintalapani
7042153e32
feat(context-center): search indexing + vector body text for memories/pages (#28314)
* feat(context-center): enable search indexing + vector body text for memories/pages

ContextMemory was indexable in the schema layer but supportsSearch was false,
so live indexing and bulk reindex did not include it. Vector embeddings for
ContextMemory and Page fell through to the default description-only body text
extractor, which produced near-empty embeddings since the actual content lives
in title/question/answer (ContextMemory) and displayName/page payload (Page).

Changes:
- Add ES/OS index mapping for context_memory_search_index across en/ru/zh/jp
- Register contextMemory in indexMapping.json with parentAliases=[all]
- ContextMemoryIndex (TaggableIndex) flattens shareConfig into visibility +
  sharedWithIds, normalizes source UUIDs, and populates entity refs with
  display names
- Wire SearchIndexFactory.buildIndex() + flip ContextMemoryRepository
  supportsSearch=true so create/update/delete fire live indexing
- Flip supportsSearchIndex=true in ContextMemoryIT to inherit BaseEntityIT's
  4 search-index tests
- ContextMemoryBodyTextContributor concatenates title/summary/question/answer/
  description for the vector embedding instead of just description
- PageBodyTextContributor adds title (displayName) and, for QuickLink pages,
  the destination URL alongside the markdown description
- Register both contributors via static initializers in their owning
  EntityRepositories, per the VectorBodyTextContributor convention

Tests: 25 new unit tests across ContextMemoryIndexTest (10),
ContextMemoryBodyTextContributorTest (6), PageBodyTextContributorTest (9).
All passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-center): address Copilot review feedback on indexing PR

- PageBodyTextContributor: fall back to page.getName() when displayName is
  null/blank so vectors always have a title (matches the convention in
  SearchIndex.populateCommonFields)
- PageBodyTextContributor: log the exception object (not e.getMessage()) so
  the stack trace is available when debug logging is on
- ContextMemoryIndex: null-guard each principal entry in shareConfig.sharedWith
  before dereferencing, so a malformed payload cannot NPE the indexer

Added 2 tests covering both behaviors; existing tests adjusted for the new
title-fallback default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 12:41:02 +02:00
Mohit Yadav
04e8ed1a1f
Improve Vector Embedding recalculation logic (#28313)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Remove Vector Embedding recalculation logic

* Add test

* Address Review Comments

* Address comment

* ADdress review comments

* Fix VectorEmbeddingIntegartionIT

* More review addressed
2026-05-21 08:26:34 +02:00
Sriharsha Chintalapani
fd8d66c899
perf(glossary): batch related-term hydration via /glossaryTerms/byIds (fix N+1) (#28279)
* perf(glossary): batch related-term hydration via new /glossaryTerms/byIds endpoint

Sentry on release-1-13 flagged an N+1 on the Glossary Term Relations Graph
tab (transaction /glossary/.../relations_graph, p95 ~1.4s) — eight
sequential GET /api/v1/glossaryTerms/{id}?fields=relatedTerms,children,
parent,owners calls at ~180ms each. Two recursive resolution loops in
useOntologyExplorer.ts (loadNextTermPage:658-683 and fetchGraphDataFromDatabase:822-847)
fan out per-Id getGlossaryTermsById calls to hydrate cross-glossary
related terms after the initial paginated load, recursing up to 5 levels
deep. The customer hit a depth-1 cascade that produced ~8+ HTTP round
trips for a single page visit.

Adds GET /v1/glossaryTerms/byIds?ids=u1,u2,...&fields=... that returns a
single hydrated List<GlossaryTerm>, capped at 200 ids per request to
stay well under URL length limits and to isolate a single bad Id to one
batch. Missing/deleted/unauthorized Ids are silently dropped, matching
the old Promise.allSettled semantics so callers don't need to change
their error handling. Both resolution loops now call the batch endpoint
once per BATCH_SIZE (100) chunk instead of fanning out per-Id; depth-1
goes from 8 round trips to 1.

Tests: backend IT covers happy-path, fields hydration, silent-skip of
missing ids, and empty-input semantics. Playwright spec opens the
Relations Graph tab on a term with a cross-glossary relation and
asserts zero per-Id /glossaryTerms/{id}?fields=relatedTerms... requests
fire — failing if anyone re-introduces the resolution N+1. The new
batch endpoint is asserted to be called at least once, evidencing the
new path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(glossary): address review feedback on byIds endpoint and N+1 fix

Reviewers (gitar-bot + Copilot) raised 7 findings across the backend, the
client hook, and the Playwright spec. Addressed all:

Backend (GlossaryTermResource.byIds):
- Catch AuthorizationException alongside EntityNotFoundException so a
  single unauthorized id can't 403 the whole batch — matches the
  documented "silently omit missing/unauthorized" contract.
- Switched the OpenAPI response schema from a single GlossaryTerm to
  @ArraySchema so generated clients see the correct array shape.
- Dropped `required = true` on the `ids` query param: the implementation
  tolerates blank/missing (returns []) and the IT pins that behavior, so
  the spec was lying. Description now states the contract explicitly.

Backend tests:
- Added the two negative tests the PR description claimed but the file
  was missing: malformed UUID -> 400 and >200 ids -> 400.

Frontend (useOntologyExplorer resolution loops):
- On a whole-batch failure, set `aborted = true`, break out of the
  current chunk loop, and clear missingIds before the next pass. Before
  this, the same failing batch was silently retried up to
  MAX_RESOLUTION_DEPTH - 1 more times for no benefit.

Playwright (GlossaryRelationsGraphPerf):
- Authenticate the new page via AdminClass.login(page) before the
  request listener attaches; previously `browser.newPage()` + a
  separate `performAdminLogin(browser)` left the test page unauth'd, so
  the request listener never saw the API calls and the spec hung on
  the wait-for-response timeout.
- Fixed the `relatedTerms` JSON-Patch shape (it stores TermRelation
  objects, not bare EntityReferences). With the old shape the relation
  never landed, the resolution loop never fired, byIds was never
  called, and the spec hit a wait-for-response timeout (the 1.2-minute
  retries observed in CI).
- Replaced the dual `/rdf/glossary/graph` OR `/glossaryTerms/byIds`
  signal with byIds-only: for `scope === 'term'` the rdf graph endpoint
  isn't called even when rdfEnabled, so listening for it just added
  flake. Bumped the timeout to 60s for cold-CI runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(glossary): IT failures on byIds — wrong JSON path + URL header limit

Two integration tests added in the previous commit failed in CI:

1) getGlossaryTermsByIds_honorsFieldsParam_hydratesRelatedTerms
   The assertion read `relatedTerms.get(0).path("id")` but glossary term
   relatedTerms is a List<TermRelation> serialized as
   {relationType, term: {id, ...}} — the id is nested under `term`, not
   at the top of the array element. Fixed to `.path("term").path("id")`.

2) getGlossaryTermsByIds_tooManyIds_returns400
   Sending 201 UUIDs in a query string puts the URL at ~7.5 KB which
   trips Jetty's 8 KB request-header limit; the request was rejected
   with 431 Request Header Fields Too Large before reaching the
   server-side cap check, so the test never saw the documented 400.

   Two-part fix:
     - Lower MAX_BATCH_BY_IDS from 200 to 100. 100 * 37 chars per UUID +
       separators is ~3.7 KB, well below 8 KB. This also matches the
       client's BATCH_SIZE in useOntologyExplorer.ts (so the client can
       now use the whole window without hitting the cap defensively).
     - Test uses 101 ids (still tiny URL) so the cap check actually
       fires and returns the documented 400.
   Updated Javadoc and the client-side BATCH_SIZE comment to reflect
   the new alignment.

The python failure on test_validations_datalake.py reproduces across
3.10/3.11/3.12 in the same parameterized case and is unrelated to the
glossary changes in this PR — pre-existing on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(glossary): address two new review comments on byIds

1) glossaryAPI.ts comment said the batch endpoint supports up to 200 ids
   but the backend cap was lowered to 100 in the previous commit. Updated
   the comment to match (and link the rationale — 100 keeps the URL
   under Jetty's 8 KB header limit).

2) The two 400-response IT tests asserted `contains("400") || contains
   (<substring>)`, which would let a 500 with "invalid" or "too many"
   in the response body silently pass. Tightened both to require BOTH
   the HTTP 400 status AND the expected message substring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(glossary): assert HTTP 400 by exception type, not message substring

The previous tightening required `e.getMessage().contains("400")`, but
the SDK's OpenMetadataHttpClient.handleErrorResponse puts ONLY the
error body in the exception message (no "HTTP 400" prefix in the parsed
path), so the assertion failed on real 400 responses with bodies like
"ids parameter contains an invalid UUID" / "Too many ids: 101 (max 100)".

Use assertThrows(InvalidRequestException.class, ...) instead — the SDK
throws InvalidRequestException ONLY for HTTP 400 (other statuses surface
as ApiException or status-specific subclasses like ForbiddenException),
so the type assertion locks the status code as strongly as a body
substring check would. Substring check stays for the body content.

Removes the no-longer-used `Assertions.fail` static import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(glossary): address four review comments on byIds

Backend (GlossaryTermResource.java):
- Add includeRelations query param for parity with GET /{id} so callers
  can pass per-relation include controls (owners:non-deleted, etc.)
  through the batch path. Forwards to the existing 6-arg getInternal.
- Add a catch (RuntimeException) per-id so unexpected failures
  (validation, downstream 5xx surfaced as WebApplicationException, etc.)
  don't fail the whole batch. EntityNotFoundException /
  AuthorizationException stay on debug; the broader catch logs at warn
  so a real bug isn't silently swallowed.

Frontend (useOntologyExplorer.ts):
- Extract the duplicated related-term resolution loop into a top-level
  `resolveRelatedTerms(terms)` helper. Both call sites
  (loadNextTermPage, fetchGraphDataFromDatabase) now do
  `await resolveRelatedTerms(...)`, ~100 fewer LoC and no risk of the
  two implementations drifting.
- Change the failure semantics from "abort the whole resolution on
  first batch failure" to "remember the failed Ids in a skip set, keep
  going". This restores best-effort hydration (matching the old
  Promise.allSettled behavior on the client) without falling back into
  the gitar-bot-flagged retry-the-same-batch-MAX_DEPTH-times footgun:
  the skip set causes collectMissingRelatedTermIds to never hand the
  same Ids back on subsequent depth passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(glossary): pin cross-glossary relation surfacing in graph + list paths

User reported the glossary graph endpoint returning partial data (focused
term + a sibling, both marked glossaryTermIsolated, edges = []) when the
focused term has cross-glossary related terms. Adds two ITs that
exercise the exact scenarios.

1) GlossaryTermResourceIT#listGlossaryTerms_hydratesCrossGlossaryRelatedTerms
   Hits GET /v1/glossaryTerms?glossary=<A>&fields=relatedTerms with a
   focused term in glossary A pointing at a related term in glossary B.
   Asserts BOTH the single-entity GET and the bulk-list endpoint return
   the cross-glossary relation — pins the bulk hydration path
   (GlossaryTermRepository.setFieldsInBulk -> fetchAndSetRelatedTerms)
   against the single-entity path (setFields -> getRelatedTerms) as
   producing the same shape.

2) RdfGlossaryGraphIT#crossGlossaryRelationSurfacesAsEdgeAndNodeInScopedGraph
   Creates focused (in glossary A) + relatedAcross (in glossary B),
   adds the relation, hits GET /v1/rdf/glossary/graph?glossaryId=<A>,
   asserts the response contains: focused NOT marked
   glossaryTermIsolated, relatedAcross as a secondary node, edge
   between them.

Both tests pass against current main on a clean DB, which means the
user's reported failure mode does not reproduce from a freshly seeded
fixture. Likely causes:
  - Stale RDF data in the user's deployment (Fuseki has the terms but
    is missing the relation triples; SPARQL returns nodes but 0 edges
    and the nodes.isEmpty() fallback in RdfRepository.java:1618
    doesn't fire because nodes ARE present)
  - The deployment may be on a code version that predates the canonical
    relation storage fix from PR #25886
  - User-specific relation type / glossary structure not captured here

The tests now stand as regression guards: any future change that
breaks cross-glossary surfacing on either path will fail CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert "test(glossary): pin cross-glossary relation surfacing in graph + list paths"

This reverts commit 460459b00c.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 17:59:19 -07:00
Sriharsha Chintalapani
09af9fc801
Fixes #4003: bulk + async restore for large entity hierarchies (#27997)
* fix(restore): bulk + async restore for large entity hierarchies

EntityRepository.restoreEntity walked descendants synchronously, taking
4+ minutes on a 12k-table database and exceeding typical proxy timeouts.
restoreChildren now groups CONTAINS children by type and dispatches one
bulkRestoreSubtree per type, batching DB writes, version history,
change events, and cache invalidation; the existing ES cascade handles
descendant index updates in one update_by_query.

Adds an async option (?async=true) on the deep-hierarchy restore
endpoints that returns 202 Accepted with a job id and runs the restore
on AsyncService, emitting WebSocket notifications on
restoreEntityChannel. Java SDK adds .restore().async().execute() fluent
builders on Tables/Databases plus restoreServerAsync on
EntityServiceBase; Python SDK mirrors this with
restore_request().with_async().execute() and restore_async() helpers
on BaseEntity, exposing a new AsyncJobResponse type.

Tests: EntityRepositoryRestoreTest verifies the per-type grouping and
bulk dispatch path; RestoreFluentAPITest covers the Java SDK fluent
behavior; RestoreHierarchyIT exercises sync and async restore against a
real DB→schemas→tables tree end-to-end; test_restore_async.py covers
the Python SDK paths.

Fixes #4003
2026-05-20 17:57:40 -07:00
Sriharsha Chintalapani
86e4127e9b
feat(context-center): /v1/contextCenter namespace + Java/Python SDK support (#28237)
* refactor(context-center): move endpoints under /v1/contextCenter namespace

Renames the Context Center API paths to disambiguate from the external Drive
Service connector. Singular /v1/drive/* and plural /v1/drives/* were trivially
easy to confuse; the new prefix makes namespace ownership obvious.

  /v1/drive/files     -> /v1/contextCenter/files
  /v1/drive/folders   -> /v1/contextCenter/folders
  /v1/knowledgeCenter -> /v1/contextCenter/pages

Drive Service endpoints (/v1/drives/*, /v1/services/driveServices) and the
generic /v1/attachments endpoint are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(context-center): nest drive files/folders under /drive sub-namespace

Keeps the Drive concept visible in the URL while staying under the
contextCenter prefix.

  /v1/contextCenter/files   -> /v1/contextCenter/drive/files
  /v1/contextCenter/folders -> /v1/contextCenter/drive/folders

Pages stays at /v1/contextCenter/pages (not a Drive concept).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(context-center): add Java + Python SDK support for files, folders, pages

Wires the Context Center entities into both SDK fluent surfaces so callers do
not have to hit /v1/contextCenter/drive/files, /v1/contextCenter/drive/folders,
or /v1/contextCenter/pages directly.

Java (openmetadata-sdk):
  * ContextFileService + PageService alongside the existing FolderService.
    Standard CRUD comes from EntityServiceBase; entity-specific methods
    expose move (files), vote / follower / hierarchy (pages), and folder
    contents (folders).
  * Folders, ContextFiles, Pages fluent wrappers with Creator and Finder
    builders, plus static convenience methods.
  * OpenMetadataClient.folders() / contextFiles() / pages() accessors.
  * OM.Folder / OM.ContextFile / OM.Page entry points.

Python (ingestion):
  * Folders, ContextFiles, Pages entity facades over BaseEntity for CRUD,
    list, retrieve, and search via the existing ometa client.
  * Top-level metadata.sdk re-exports + lowercase aliases.

Binary download and multipart upload are intentionally not exposed yet — those
endpoints need streaming / multipart support that the SDK HTTP layer does not
currently provide. Page voting / follower / hierarchy operations are Java-only
for the same reason (no underlying ometa methods).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: apply spotless line wrap in DriveFileUploadIT

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix playwright test

* fix(context-center): route entity delete + playwright regex through new path

DeleteWidgetClassBase still returned the old 'knowledgeCenter' URL segment for
KnowledgePage / KnowledgeCenter entity deletes, so the UI was issuing
DELETE /api/v1/knowledgeCenter/{id}?hardDelete=... — which the server no
longer serves, returning 404. ContextCenter playwright spec also had a leftover
escaped regex (/\/api\/v1\/knowledgeCenter\/.../) matching the same legacy URL.

Point both at /api/v1/contextCenter/pages so the Knowledge Center / Context
Center delete flows reach the server again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-center): re-align playwright delete matchers with new path

A prior fix-up commit (844a8736d2) reverted two Playwright waitForResponse
matchers back to /api/v1/knowledgeCenter, but the UI now correctly issues
the request against /api/v1/contextCenter/pages (after the DeleteWidgetClassBase
fix in c34552217f). Restoring the matchers so they line up with the live
server endpoint and the tests can observe the response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(context-center): widen hard-delete poll window to 30s

testHardDeleteRemovesObjectFromMinIO polled for the file to disappear within
10 seconds, but the hard-delete chain (soft-delete -> async background worker
-> search/relationship cleanup -> row drop -> MinIO unlink) regularly exceeds
that window in CI. Awaitility was returning the entity still present at
status 200 inside the deadline.

Bump atMost to 30s with a 200ms poll interval (matching FolderResourceIT's
hardDeleteEntity pattern) so the test reflects the real async budget instead
of a tight, machine-dependent guess.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix playwright tests

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Rohit0301 <rj03012002@gmail.com>
Co-authored-by: Rohit Jain <60229265+Rohit0301@users.noreply.github.com>
2026-05-20 08:28:10 -07:00
Adrià Manero
0d5792e2ed
refactor(data-insights): isolate enricher step failures + per-step EntityStats (#28264)
* fix(data-insights): resolve owner by id in processTeam to handle bare historical refs

* refactor(data-insights): isolate enricher step failures + per-step EntityStats

* test(data-insights): add enricher behavior IT + rename equivalence IT

* refactor(data-insights): extract VersionResolver + SnapshotMaterializer

* refactor(data-insights): extract enricher concerns into step classes + bounded owner cache + rate-limited loss warns

* refactor(data-insights): apply review feedback — tighten enricher docs, drop dead code

* docs(data-insights): drop unsupported 'intentional' claim from TierStep prefix-match docstring

* fix(data-insights): guard VersionResolver against null updatedAt on historical version rows

* refactor(data-insights): tighten VersionResolver walk — consume isFirst at top, drop historyDone flag
2026-05-20 16:07:20 +02:00
Adrià Manero
d744e397da
fix(alerts): strict literal matching across alert filter functions (#27987)
* fix(alerts): make matchAnyEntityFqn match FQNs literally, not as regex

* fix(alerts): strict literal matching across alert filter functions

* fix(policy): AST-based SpEL validation so string-literal content is not parsed as syntax

* fix(data-insight): evaluate chart formulas outside the policy/alert validator

* test(playwright): ObservabilityAlerts must pick filter options by FQN, not bare name

* test(playwright): focus combobox before fill in alert action input loop

* test(playwright): extend inputValueId pattern to alert action inputs

* refactor(alerts): move UUID regex to constants and use size variable
2026-05-20 14:05:36 +02:00
Ram Narayan Balaji
b1781f7324
feat(workflow): add inputPorts, outputPorts, glossaryTerms as WorkflowTriggerFields (#28120)
* feat(workflow): add inputPorts, outputPorts, glossaryTerms as WorkflowTriggerFields

- Add inputPorts, outputPorts, glossaryTerms to workflowTriggerFields.json enum
- Unit tests in FilterEntityImplTest verifying all three fields pass the
  passesFieldBasedFilter check, including include/exclude config
- Integration test scenario appended to test_CustomApprovalWorkflowForNewEntities:
  adds table as inputPort, removes it, adds as outputPort — each port change
  must produce an approval task proving the workflow trigger fires correctly

* Update generated TypeScript types

* fix(workflow): fix inputPorts/outputPorts changes not triggering governance workflows

- Add inputPorts, outputPorts, glossaryTerms to WorkflowTriggerFields enum
- Fix executeBulkPortsOperation to update entity changeDescription so
  FilterEntityImpl reads the correct changed fields
- Add governance-bot impersonation to applyPatchEntityFieldAction to
  prevent entityStatus updates from triggering spurious workflow signals
- Pass caller username through port endpoints to fix self-approval check
  incorrectly removing the reviewer when entity updatedBy was set to the
  reviewer from a prior task resolution

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-20 09:23:34 +05:30
Sriharsha Chintalapani
d388a5264b
Fixes 28267: Preserve test case suite search membership (#28271)
* Fix test case suite search hydration

* Stabilize test case repository unit test
2026-05-19 12:54:01 -07:00
Ram Narayan Balaji
fc74d08cb0
fix(deps): remove explicit jersey version pins to inherit 3.1.11 from BOM (#28106)
jersey-client and jersey-apache-connector were pinned to 3.1.9 in
openmetadata-sdk and openmetadata-integration-tests. The root pom.xml
jersey-bom already manages all org.glassfish.jersey.* artifacts at 3.1.11.
Removing the explicit pins lets both modules inherit 3.1.11 from the BOM,
consistent with the rest of the project.
2026-05-19 17:19:04 +00:00
Pere Miquel Brull
7485c5b421
feat: add ContextMemory entity (Context Center memories) (#28224)
* feat(spec): add ContextMemory + CreateContextMemory JSON schemas

* feat(jdbi3): add ContextMemoryDAO

* feat: register contextMemory entity type constant

* feat(service): add ContextMemory repository, resource, mapper

* feat(bootstrap): add context_memory table DDL

* test(service): ContextMemory resource CRUD test

* fix(context-memory): address review (relationship types, stable FQN, status msg, test name)

- storeRelationships: rootMemory -> Relationship.CONTAINS, parentMemory -> Relationship.HAS
  so the root-ancestor and direct-parent hierarchies are distinguishable.
- setFullyQualifiedName: derive from the immutable name only (drop mutable
  primaryEntity/owner derivation that destabilized nameHash on update).
- validateStatusTransition: separate "no transitions defined" from "disallowed transition".
- Rename ContextMemoryResourceTest -> ContextMemoryStatusTransitionTest (pure unit test).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(context-memory): add ContextMemoryIT + SDK ContextMemoryService

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(spec): register contextMemory in EntityLink.g4 ENTITY_TYPE grammar

EntityLinkGrammarTest.testAllEntityTypesHaveGrammarOrExclusion enumerates every
Entity.java constant and requires each to be in the EntityLink grammar or the
test's exclusion list. ContextMemory is a normal EntityRepository-backed
top-level entity (like learningResource / contextFile), so it belongs in the
ENTITY_TYPE rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(context-memory): override owner ITs for creator-as-owner default

ContextMemoryMapper.defaultOwners() intentionally assigns the creating
user as owner when the create request omits owners. BaseEntityIT's
patch_entityUpdateOwner_200 and patch_entityUpdateOwnerFromNull_200
assert "no owner initially" for any supportsOwners entity, so both
failed for ContextMemory.

Override both in ContextMemoryIT: keep the PATCH-replace-owner contract,
change only the precondition to expect the creator as the sole initial
owner (asserted by count, not a hardcoded principal). Mapper unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update generated TypeScript types

Add the generated ContextMemory TS types (entity/context/contextMemory.ts,
api/context/createContextMemory.ts). The schemas were on the branch but their
generated types were missing, failing the TypeScript Type Generation check on
this fork PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): address review (relationship cleanup, owner scope, validations)

Copilot review on the ContextMemory entity:
- #1 record primaryEntity/relatedEntities/root/parent/source*/machineRepresentation
  in version history; usageCount/lastUsedAt documented as untracked telemetry
- #2 clear stale HAS/RELATED_TO/CONTAINS edges before re-adding in storeRelationships
- #4 default creator as owner only on create; PUT without owners no longer
  silently replaces previously set owners
- #5 schema documents that any status is allowed at creation; transitions
  enforced only on update
- #6 setFullyQualifiedName via FullyQualifiedName.build with skip-if-set guard
- #7 validate shared principal type is user/team/domain
- #8 reject self-reference for parentMemory/rootMemory
- #10 inline Entity.CONTEXT_MEMORY, drop redundant constant

Regenerate ContextMemory TS types for the schema doc change; add IT coverage
for the self-reference and invalid-shared-principal validations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): don't blanket-delete relationships (domain data loss)

The #2 cleanup via deleteTo(memory, CONTEXT_MEMORY, HAS, null) also matched the
framework's domain --HAS--> memory edge (storeDomains runs before
storeRelationships in storeRelationshipsInternal, on every create and update),
silently dropping domain assignments.

storeRelationships is now add-only (addRelationship upserts, so re-running on
update is idempotent). Stale-edge cleanup moved to ContextMemoryUpdater using
the framework's updateFromRelationship(s) helpers, which delete only the
specific changed refs and record the version change. parentMemory now uses
Relationship.PARENT_OF (distinct from primaryEntity's HAS and the framework's
domain HAS) so the parent edge can be maintained without collision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(bootstrap): move context_memory DDL from 2.0.1 to 2.0.0

The context_memory table belongs in the 2.0.0 migration. Relocated the
MySQL and Postgres DDL verbatim; the 2.0.1 schemaChanges.sql files are
restored to their original task_migration_mapping-only content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(bootstrap): add ENGINE=InnoDB to context_memory MySQL DDL

Explicit engine clause, consistent with the task/search-index tables in the
same migration and robust to any server default change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(context-memory): preserve sanitized/validated fields; validate relatedEntities

Review follow-ups:
- ContextMemoryMapper no longer re-sets description/owners/domains/tags/displayName
  after copy(). copy() sanitizes description (stored-XSS) and validates owners and
  domains; re-setting the raw request values bypassed both. Only ContextMemory-
  specific fields are set now.
- prepare() now assigns the result of EntityUtil.populateEntityReferences back onto
  relatedEntities so orphaned/invalid refs are filtered instead of persisted.
- ContextMemoryIT Javadoc now references ContextMemoryRepository#setCreatorAsDefaultOwner
  (the defaultOwners mapper method no longer exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 18:10:46 +02:00
Sriharsha Chintalapani
d6cec7ed1c
Fixes #28229: cascade Table certification PATCH to child search docs (#28236)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Maven Collate Tests / maven-collate-ci (push) Waiting to run
Java Checkstyle / java-checkstyle (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Fixes #28229: cascade Table certification PATCH to child search docs

When a Table's certification is added, changed, or removed via PATCH the
existing cascadeCertificationToChildren path never fired because
SearchRepository.requiresPropagation returned false on a cert-only
ChangeDescription. TableRepository did not list certification in its
propagation descriptors, so the gate stayed closed and the DQ
dashboard's Certification filter kept returning the stale cert on
test_case / test_case_result / test_case_resolution_status / test_suite /
column docs until a full reindex.

Add an EXTERNAL_HANDLER PropagationType for fields whose cascade is
driven by a dedicated SearchRepository handler instead of the generic
descriptor-driven script (cert needs full-object replace on add/update
and explicit removal on delete, which RAW_REPLACE can't express because
it restores the old value on delete). Register certification with this
type on TableRepository so the gate opens and the existing
cascadeCertificationToChildren method runs. Add no-op cases in the
three appendAdd/Update/DeleteScript switches so the new enum value
doesn't accidentally trigger generic auto-propagation.
2026-05-19 08:18:02 -07:00
Sriharsha Chintalapani
0ab31cb647
fix(rdf): reclaim Fuseki disk via compaction + upgrade Jena 4.10 → 5.6.0 (#28242)
* fix(rdf): reclaim Fuseki disk via compaction + upgrade Jena 4.10 → 5.6.0

PR #28117's SPARQL cleanup converged the logical RDF state but never freed
disk: TDB2 deletes only mark blocks free and the journal grows monotonically
until /$/compact runs. RdfIndexApp.clearRdfData() now calls a new
RdfStorageInterface.compactStorage() between clearAll() and reloadOntologies()
so each recreate run reclaims to a fresh dataset directory. JenaFusekiStorage
posts to /$/compact/{dataset}?deleteOld=true and polls /$/tasks/{id} until
finished, with failures logged and swallowed (disk hygiene, not correctness).

Also unifies the Jena classpath: openmetadata-service was on 4.10.0 and
openmetadata-integration-tests on 5.0.0. Both now pin to 5.6.0 via a single
root pom property, dropping the apache-jena-libs BOM in favour of explicit
jena-core/arq/rdfconnection deps (we're a remote-Fuseki client and never
embed TDB; pulling jena-tdb1/2 triggers a Jena 5/6 static-init regression).
Picks up CVE-2025-49656 and CVE-2025-50151 (admin-side fixes shipped in
Jena 5.5.0). Jena 6.x parked: both 6.0.0 and 6.1.0 hit a recursive clinit
bug where TypeMapper.reset reads RDF.dtLangString before RDF.<clinit>
completes.

Fuseki server bumped 4.10/5.0 → 5.6.0 across all in-repo Dockerfiles; the
unmaintained stain/jena-fuseki:* image references in dev compose files
switched to building from docker/rdf-store/Dockerfile, and Testcontainers
moved to secoresearch/fuseki:5.5.0 (maintained, CVE-fixed; the dataset is
created by JenaFusekiStorage.ensureDatasetExists() so the stain-only
FUSEKI_DATASET_1 env var is no longer needed).
2026-05-18 23:08:46 -07:00
Sriharsha Chintalapani
5c53151d16
Fixes #24294: support re-parenting a Container via PATCH (#28201)
* feat(container): support re-parenting a Container via PATCH (#24294)

Allow the PATCH API to update a Container's `parent`, cascading the FQN
change to every descendant container, nested column FQN, tag-usage row,
entity-link, policy condition, and search-index document — same shape as
GlossaryTerm re-parenting. Scoped to same-StorageService moves; cross-service
parents are rejected with HTTP 400. Adds parent-aware fluent SDK methods in
Java (`Containers.under(...)`, `FluentContainer.withParent(...)`/`withoutParent()`)
and Python (`Containers.set_parent`, `Containers.clear_parent`), unit tests
for the validation logic, 11 integration tests covering the cascade and
rejection paths, and 9 + 2 SDK tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(container): reject re-parent PATCH on oversized subtrees (#24294)

The single-transaction cascade in updateParent locks every descendant row
in storage_container_entity, rewrites their JSON, renames their tag_usage
rows, and issues an Elasticsearch updateByQuery — all while holding
row locks that block any concurrent write on the subtree. At ~10k+
descendants this becomes a multi-minute outage on the cluster.

Add an indexed COUNT(*) preflight in ContainerUpdater.updateParent that
short-circuits with HTTP 400 when the moved subtree exceeds
openmetadata.container.maxReparentDescendants (default 10000), pointing
the operator at the system property if they have measured the impact
and accept it. Run BEFORE invalidateCacheForRenameCascade so a rejected
request pays no cache-eviction cost.

Tests: 5 new unit tests in ContainerRepositoryParentValidationTest cover
under/at/over-limit and the system-property override; 2 new IT methods
in ContainerResourceIT exercise the end-to-end reject path with a
test-scoped low threshold and confirm at-limit moves still succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(container): address PR #28201 review — perf, cycle detection, IT race (#24294)

Four fixes from the gitar-bot review:

1. Performance — validateContainerParent short-circuits when the proposed
   parent id matches the original. Skips an unnecessary Entity.getEntity
   round-trip on every container PATCH/PUT that doesn't touch the parent
   field (description edits, tag adds, etc.).

2. Cycle detection — second-line ID-based ancestor-chain walk added in
   ContainerUpdater.validateAncestorChainCycle. Uses
   relationshipDAO.findFrom (direct DB) so a descendant with a briefly
   stale FQN can't bypass the FQN-prefix check. Visited-set bounded.

3. IT race condition — drop System.setProperty in
   patch_containerParent_rejectsOversizedSubtree_400 and
   patch_containerParent_allowsMoveAtConfiguredLimit_200. Add a
   package-accessible test-override field (set/clear via public static
   methods) plus @ResourceLock(MAX_REPARENT_DESCENDANTS_TEST_LOCK) to
   serialize any test that mutates the override, even though the class
   runs methods concurrently.

4. SQL build comment — document why ContainerDAO.updateFqn interpolates
   values via String.format (mirrors EntityDAO pattern, FQN values are
   server-computed, escapeApostrophe handles the only SQL metacharacter
   that can appear in a validated entity name).

Tests: ContainerRepositoryParentValidationTest extended to 12 cases
(adds parent-unchanged short-circuit assertion + override-priority
coverage). Full ContainerResourceIT still 255/255.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(container): address PR #28201 copilot review (#24294)

Two findings from copilot-pull-request-reviewer:

1. setFields perf regression — restored the conditional
   fields.contains(FIELD_PARENT) ? getContainerParent(c) : c.getParent()
   guard. The unconditional load forced an extra entity_relationship
   lookup on every container GET, which is a measurable regression on
   the hot path. The PATCH flow still loads parent because
   CONTAINER_PATCH_FIELDS includes parent (so fields.contains is true
   there). Full ContainerResourceIT still 255/255.

2. updateEntityLinks shallow walk — previously only iterated direct
   children, leaving deep descendants' (grandchildren+) legacy feed
   thread entityLinks pointing at the old FQN and breaking activity-
   feed navigation after a multi-level move. Now takes the
   renamedContainers snapshot captured by invalidateCacheForRenameCascade
   and rewrites each descendant's entityLink by swapping the FQN
   prefix (oldFqn → newFqn) — consistent with the same prefix-substitution
   ContainerDAO.updateFqn applies to the JSON/fqnHash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 07:01:40 -07:00
Sriharsha Chintalapani
b69a4026fd
feat(context-center): sortBy on lists, drive-file move, IT coverage (#28096)
* feat(context-center): add sortBy, drive-file move, and IT coverage

- Add `sortBy`/`sortOrder`/`offset` to the article list (`/v1/knowledgeCenter`)
  and drive-file list (`/v1/drive/files`). When `sortBy` is set, the request is
  routed through `listInternalFromSearch` against OpenSearch; cursor pagination
  (`before`/`after`) is rejected with HTTP 400 to keep the contract explicit.
  Supported values: `name`, `createdAt` (aliased to `updatedAt` for v1),
  `updatedAt`. Default `sortOrder` is `desc` when `sortBy` is set.
- Add `PUT /v1/drive/files/{id}/move` for moving a drive file between folders
  or to the drive root. Sync handler (ContextFile is a leaf, no FQN cascade).
  Body: new `MoveContextFileRequest` schema with an optional `folder`
  EntityReference.
- Fix `ContextFileUpdater.entitySpecificUpdate` so folder changes are recorded
  in the change description and the `CONTAINS` relationship edge is rewired
  to the new folder. Without this, the move would silently update JSON only.
- Integration tests in `ContextFileIT` (move between folders, move to root,
  non-existent folder, unprivileged-user 403, sort by updatedAt/name, cursor
  combo rejection) and `KnowledgeCenterIT` (sort by updatedAt, name, createdAt
  alias, cursor combo rejection).
2026-05-18 06:28:53 -07:00
Sriharsha Chintalapani
d3430d6abb
fix(logging): non-blocking streamable log handler + split flush executors (#28198)
* fix(logging): non-blocking streamable log handler + split flush executors

Client side (ingestion):
- Rewrite StreamableLogHandler as fire-and-forget: producer never blocks,
  bounded buffer drops on overflow, daemon drainer + single daemon sender
  thread, thread-local recursion guard, isolated REST client (own
  Session/connection pool), no-wait close() that orders /close behind
  in-flight batches via a sentinel marker.
- Add quiet client path: REST.post_best_effort and OMetaLogsMixin's
  send_logs_batch_best_effort / send_close_best_effort. Bypasses the
  retry/sleep loop, the connection-error second-try, and all
  ometa_logger emissions (which would otherwise feed back into the
  streamable handler attached to the metadata logger).
- Backwards-compatible timeout=/retries= kwargs threaded through
  REST._request / REST.post for callers that want per-call overrides
  without going through the best-effort path.

Server side:
- Split the single s3-log-cleanup ScheduledExecutorService into two:
  s3-log-partial-flush (the path that gets logs onto S3) and
  s3-log-abandoned-cleanup. A stuck cleanup task can no longer starve
  partial.txt writes.
- Wrap scheduled tasks with safeScheduledTask(...) so a Throwable from
  one tick can never silently de-schedule the executor.
- StreamableLogsMetrics: add lastPartialFlushTimestamp,
  partialFlushHeartbeat, abandonedCleanupHeartbeat, pendingStreamsCount,
  pendingFlushBytes, pendingFlushLines, flushFailuresCounter. Wired
  into writePartialLogs / writePartialLogsForStreamLocked /
  cleanupAbandonedStreams / recordFlushFailure.

Tests: 35 Python unit tests (overflow non-blocking, recursion guard,
saturated drop, daemon threads, isolated session, no-wait close,
close ordering, PR #28160 regression) and 46 S3LogStorageTest unit
tests (executor split, safeScheduledTask Throwable guard, per-stream
failure does not poison the loop, metric updates) — all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(logging): address PR #28198 review — auth race, close ordering, visibility

Three review findings:

1. Race: _build_request_headers no longer refreshes the auth token. The
   streamable handler shares ClientConfig with the main metadata client,
   so two threads could concurrently call _auth_token() and mutate
   config.access_token / config.expires_in. The sender thread now reads
   the current access_token only; refresh remains the exclusive
   responsibility of REST._request() on the main thread. Avoids
   double-refresh against rate-limited / single-use refresh tokens and
   the TOCTOU on expires_in.

2. close() race: _stop.set() is now called AFTER _drain_remaining_buffer
   and the CLOSE_MARKER enqueue. Previously the sender's get(timeout=1s)
   could return Empty in the narrow window between set() and enqueue,
   hit the if-stop-return branch, and exit without ever processing the
   marker — silently dropping the /close notification under light load.

3. Visibility: safeScheduledTask is now private. It's an internal
   scheduler-boundary wrapper, not part of the class API.

Regression tests:
- test_build_request_headers_does_not_refresh_token
- TestCloseOrderingRegression.test_close_enqueues_marker_before_setting_stop
- TestCloseOrderingRegression.test_close_marker_arrives_even_when_sender_was_idle

All 60 Python tests and 46 S3LogStorageTest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(logging): trim verbose code comments + PR-number references

Strip the design-doc-style docstrings and inline narration that crept
into the previous commits. Keep only comments where the WHY is not
obvious from the code: lock-ordering rationale in close(), the auth
non-refresh rationale in _build_request_headers, the executor-split
rationale in S3LogStorage. Everything else (what the function does,
what counter increments, "regression for PR #N", numbered design
contracts) is removed — that context belongs in PR history, not source.

No behavior change. 44 Python tests + 46 Java tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(logging): add type annotations on new kwargs flagged by basedpyright

The static-check warning shown on the PR view (reportMissingParameterType
on `timeout=None` and `client=None`) for the new fire-and-forget log path
kwargs. Adding the explicit Optional[Union[float, tuple[float, float]]] /
Optional[int] / Optional[REST] annotations to client.post,
client.post_best_effort, client._request, OMetaLogsMixin.send_logs_to_s3,
send_logs_batch, send_logs_batch_best_effort, send_close_best_effort.

logs_mixin.py is now 0 errors / 0 warnings on basedpyright.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(client): type rest of REST.post/post_best_effort + harden self._retry guard

basedpyright was still flagging on the PR view:
- path / data / json / headers untyped on post, post_best_effort,
  _request, _build_request_headers (reportMissingParameterType)
- self._retry > 0 with self._retry: Optional[int] (reportOptionalOperand)
- headers=headers passing Optional to non-Optional _request param
  (reportArgumentType)

All three classes of warning are fixed. baseline.json reduction: 20.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: restore baseline.json — should not have been touched in this PR

The previous commit auto-pruned 160 lines from .basedpyright/baseline.json
as a side effect of running basedpyright locally. Baseline regeneration
belongs in its own PR; mixing it with a feature PR risks masking other
in-flight PRs' regressions.

The source-code type fixes in d5551580ac stay; only the baseline change
is reverted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(logging): address remaining PR review on drain/close race + counter split

Two unresolved review findings on streamable_logger.py:

1. Drainer no longer drops a batch it already collected just because
   close() flipped self._closed. The check was racing
   _drain_remaining_buffer: drainer grabs items from _buffer, sees
   _closed, drops them; meanwhile _drain_remaining_buffer misses
   those same items. Removed the check — once _stop is set the loop
   exits on the next iteration, and any in-flight batch reaches
   _send_queue normally.

2. dropped_overflow used to conflate three operationally distinct
   failure modes:
   - buffer full at put_nowait  → still dropped_overflow
   - emit() after close()        → now dropped_after_close
   - formatter raised            → now dropped_format_error
   An operator seeing high dropped_overflow can now distinguish
   'buffer too small' from 'logs arriving after shutdown' from
   'real bug in the formatter' without further investigation.

New test test_emit_after_close_increments_dedicated_counter pins
the split; existing format-failure test updated to assert on the
new dropped_format_error counter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(client): SourceApiClient.post — propagate new timeout/retries kwargs

CI static-check failure on this PR:

  source_api_client.py:122 - error: Method "post" overrides class "REST"
  in an incompatible manner

REST.post gained timeout=/retries= in this PR; the SourceApiClient
subclass that overrides post() for per-call latency tracking still had
the old signature, so basedpyright flagged the override as incompatible
in strict mode. Propagating the new kwargs and matching the parent's
type annotations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 12:07:48 +02:00
Adrià Manero
4350922707
fix(ingestion-pipeline): inherit owners from service and authorize trigger (#28109)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(ingestion-pipeline): inherit owners from service and authorize trigger (#27962)

* fix(policy): grant Trigger to default bot + steward policies for /trigger authz

* fix(test): use java.time.Instant instead of Joda-Time per code review

* fix(policy): use a dedicated TriggerRule on DataStewardPolicy instead of mixing into EditRule

* fix(migration): drop generic exception catch + inline rule description per code review

* fix(policy): add trailing newline to DataStewardPolicy.json per code review
2026-05-18 08:35:10 +02:00
sonika-shah
2ca2677bdd
fix(users): Online Users "All time" filter no longer returns empty (#28145)
The /v1/users/online endpoint computed `now - timeWindow*60*1000` for the
SQL threshold. When the UI sent `timeWindow=0` for the "All time" option,
the threshold collapsed to `now`, producing `lastActivityTime > now` —
which matched zero rows. Skip the time predicate entirely when
`timeWindow <= 0` so the filter falls through to "all non-bot users".

Reported via #27993.
2026-05-18 10:32:56 +05:30
sonika-shah
a160fcb145
fix(entity): null-safe updateColumns for entities without columns (#28047)
* fix(entity): null-safe updateColumns for entities without columns

PATCH on a File without columns (PDF/image/etc.) NPEs in
ColumnEntityUpdater.updateColumns because FileRepository.entitySpecificUpdate
unconditionally invokes the columns updater. recordListChange already
null-coalesces internally, but the subsequent `for (Column updated :
updatedColumns)` iteration does not — any updater calling updateColumns
with a null list hits the same path.

Null-coalesce origColumns and updatedColumns at the top of
ColumnEntityUpdater.updateColumns so any optional-columns entity is safe.

* test(entity): generic PATCH-add-tag and PATCH-add-glossary-term coverage

Add two generic tests to BaseEntityIT that any entity inheriting from it
must pass:

- patch_addClassificationTag_200_OK
- patch_addGlossaryTerm_200_OK

Both create a minimal entity, set a single tag/glossary-term label,
PATCH, and assert the label is present. Gated on supportsTags &&
supportsPatch, so an entity that opts out keeps doing so.

This gives every new EntityRepository subclass (e.g. the next entity
type someone adds) automatic defense against the class of bug where
the tag/glossary PATCH path NPEs on an unrelated optional field
(updateColumns with null columns being the original instance).

* test(it): migrate drive entity ITs to BaseEntityIT, add FolderResourceIT

Migrate FileResourceIT, DirectoryResourceIT, and SpreadsheetResourceIT
onto BaseEntityIT<T, K> so they automatically inherit the ~60 generic
entity tests (CRUD, owners, tags PATCH, glossary PATCH, soft-delete,
versions, custom extensions, etc.). Add a brand-new FolderResourceIT
covering the previously untested folder entity type.

The previous standalone harnesses were a migration gap from the bulk
"Faster tests" PR (#24948) — only WorksheetResourceIT got the full
BaseEntityIT plumbing; the rest of the drive family did not. This meant
bugs like the updateColumns NPE fixed in this PR did not surface in
the IT suite for File and friends.

Each entity sets feature flags conservatively:
- supportsFollowers/Domains/DataProducts/CustomExtension/BulkAPI/
  DataContract = false (matches the existing minimal surface area)
- Folder additionally sets supportsVersionHistory/GetByVersion = false
  since the FolderResource doesn't expose /versions

Entity-specific tests (column handling, directory hierarchy, root
filter, FQN structure, etc.) are preserved. Existing redundant smoke
tests (createMinimal, deleteById, getByName, ID/name not-found) are
removed since BaseEntityIT covers them.

Also add openmetadata-sdk/.../drives/FolderService.java so Folder has
SDK access (basePath /v1/drive/folders).

* fix(it): make FolderResourceIT pass inherited BaseEntityIT tests

Three real issues surfaced by the inherited generic tests after the
drive-IT migration:

1. FolderResource.list was missing @Min(0) / @Max(1000000) on the limit
   query param, so negative or excessive values were silently accepted.
   Add validation to match DirectoryResource/WorksheetResource. This
   fixes BaseEntityIT.get_entityListWithInvalidLimit_4xx.

2. FolderResource hard-delete is asynchronous — it kicks off
   deleteByIdAsync and returns 200 before the row is gone. The generic
   delete_entityAsAdmin_hardDelete_200 test asserts the entity is no
   longer fetchable immediately. Override hardDeleteEntity in
   FolderResourceIT to poll on include=deleted until the async delete
   completes.

3. BaseEntityIT.get_deletedEntityVersion_200 calls getVersion(...) but
   only gated on supportsSoftDelete/supportsPatch, missing the
   supportsGetByVersion gate. For entities like Folder that don't expose
   /versions, the test threw UnsupportedOperationException from the
   subclass override. Add the missing gate.

* chore: address gitar-bot review on PR #28047

- SpreadsheetResourceIT.test_listSpreadsheetsWithRootParameter was
  vacuously passing if the list came back empty (the for-loop body
  silently runs zero times). Add an explicit
  assertFalse(getData().isEmpty()) so a regression in the ?root=true
  filter actually fails the test.

- FolderResource.list parameter description said "1 to 1000000" but
  the annotation is @Min(0) — 0 is a valid limit (returns empty page).
  Align the description with the annotation.

* test(it): restore entity-specific drive tests dropped during BaseEntityIT migration

The previous migration commit was too aggressive in deleting tests it
assumed BaseEntityIT covers. Restore the entity-specific tests:

SpreadsheetResourceIT (8 restored):
- test_createSpreadsheetWithOptionalFields (displayName + description)
- test_updateSpreadsheet (Spreadsheet-specific path/size fields)
- test_spreadsheetWithWorksheets (@Disabled — worksheet relationship)
- test_listSpreadsheetsByService (?service filter)
- test_spreadsheetFQNPatterns (nested directory FQN construction)
- test_spreadsheetsWithAndWithoutDirectory (directory-presence variations)
- test_listSpreadsheetsWithRootParameterAndPagination (root + pagination)
- test_listSpreadsheetsWithRootParameterEmptyResult
- test_listSpreadsheetsWithRootParameterAcrossMultipleServices (@Disabled)

FileResourceIT (2 restored):
- test_createFileWithDisplayName
- test_fileWithAllOptionalFields

DirectoryResourceIT (1 restored):
- test_createDirectoryWithAllFields

BaseEntityIT still covers the genuinely redundant tests that stayed
deleted (CRUD smoke tests, get-by-id, get-by-name with fields, delete,
non-existent ID/FQN, fluent-SDK find variants).

* chore: address remaining gitar-bot + copilot review threads

FolderResource:
- Grammar: "Limit the number folders" -> "Limit the number of folders"

FolderResourceIT:
- Javadoc said Folder "supports ... followers" but supportsFollowers=false.
  Align doc with actual capability flags.
- Awaitility hard-delete poll was catching `Exception` and treating any
  failure as success (would mask transient 500s). Narrow to ApiException
  with statusCode==404; re-throw everything else so Awaitility surfaces
  real errors.

FileResourceIT:
- Pass String IDs to service.update for consistency with the rest of the
  IT suite (createdFile.getId().toString()).
- Compare service references via getFullyQualifiedName() instead of
  getName(); the latter only matches for top-level services and would
  silently break if the reference schema changes.

SpreadsheetResourceIT:
- Import @Disabled and use the short annotation form instead of
  @org.junit.jupiter.api.Disabled (project standards prohibit FQNs in
  annotations).
- Strengthen test_listSpreadsheetsWithRootParameter: assert the root
  spreadsheet we created appears in ?root=true results AND the child
  spreadsheet does NOT, so a broken root filter actually fails the test
  instead of passing on an empty list.

* test(it): restore all remaining originally-deleted drive tests

Per reviewer request, restore the rest of the tests that the migration
commit removed under the (incorrect) assumption they were fully
redundant with BaseEntityIT.

FileResourceIT (+8): test_createFileMinimal, test_createFileWithDescription,
  test_deleteFile, test_findFileById, test_findFileByNameWithFields,
  test_getFileByNameWithFields, test_getFileWithNonExistentId_shouldFail,
  test_getFileByNameWithNonExistentFQN_shouldFail.

DirectoryResourceIT (+10): test_createDirectoryMinimalRequest, test_getByName,
  test_getByNameWithFields, test_deleteDirectory, test_findDirectoryById,
  test_findDirectoryByName, test_findDirectoryWithFields,
  test_createMultipleDirectories, test_getNonExistentDirectory_fails,
  test_getByNameNonExistent_fails.

SpreadsheetResourceIT (+10): test_createSpreadsheet, test_createSpreadsheetMinimal,
  test_getSpreadsheetById, test_getSpreadsheetByName, test_deleteSpreadsheet,
  test_finderWithFields, test_finderByNameWithFields, test_getByNameWithFields,
  test_createMultipleSpreadsheetsUnderSameService, test_patchSpreadsheetAttributes.

BaseEntityIT generic coverage stays intact — the subclass tests and the
inherited tests now coexist (deliberate overlap, opted in by reviewer).

Counts vs ab535900da (original): File 14→18 (+4 column tests added),
Directory 15→15, Spreadsheet 27→27.

* test(it): restore exact original test bodies (assertNotNull lines included)

Prior commits restored test methods by name but trimmed bodies — e.g.,
dropped `assertNotNull(created)`, `assertNotNull(driveService)`, and other
redundant-but-original assertions. Reviewer asked for the tests "as is",
so this commit replays each restored method body byte-for-byte from
ab535900da (the introducing commit, PR #24948).

Preserved on top of the verbatim bodies:
- FileResourceIT: getFullyQualifiedName() comparison instead of getName()
  in test_createAndGetFile / test_fileWithAllOptionalFields (gitar-bot
  review fix).
- SpreadsheetResourceIT: @Disabled (short form) instead of
  @org.junit.jupiter.api.Disabled (gitar-bot review fix). The skip state
  and reasons on test_spreadsheetWithWorksheets and
  test_listSpreadsheetsWithRootParameterAcrossMultipleServices match
  the original — they were disabled in PR #24948 due to backend gaps,
  not by this PR.

* fix(entity): secondary NPE + logic bug in ColumnEntityUpdater.updateColumns

Copilot review caught two pre-existing bugs in the same method this PR
already touches:

1. NPE on added column tags. `added.getTags().stream()` NPEs when a
   column has no tags (Column schema defaults tags to null). Wrap with
   listOrEmpty(added.getTags()) so the stream is safe.

2. Inverted carry-forward condition. The original
       if (nullOrEmpty(addedColumn.getTags()) && nullOrEmpty(deleted.getTags()))
           addedColumn.setTags(deleted.getTags());
   only copied when BOTH sides were empty — a no-op. The intent is "if
   the added column has no tags but the deleted column did, preserve
   them". Flip to !nullOrEmpty(deleted.getTags()).

Both fit the scope of this PR (null-safety in updateColumns) so
including them rather than spinning a separate PR.

* revert: keep ColumnEntityUpdater tag-carry-forward condition untouched

Earlier commit (4a9a639) flipped
  nullOrEmpty(addedColumn.getTags()) && nullOrEmpty(deleted.getTags())
to
  nullOrEmpty(addedColumn.getTags()) && !nullOrEmpty(deleted.getTags())
based on a reviewer-bot suggestion. That logic change is out of scope
for this PR (the PR is null-safety for updateColumns, not semantics of
the carry-forward branch).

The line dates back to the original ColumnEntityUpdater implementation
and changing its behaviour silently — without a dedicated test or
release-note — could affect any flow where a column is "redefined"
(same name, different dataType/ordinal) on PUT or PATCH. Reverting to
the pre-existing form. If the carry-forward really is broken, that
deserves its own PR with regression coverage.

The NPE fix (listOrEmpty(added.getTags()) on the .stream() call)
remains.
2026-05-18 03:27:45 +00:00
sonika-shah
9498cc9a11
fix(bulk-assets): enforce dryRun on tag/glossary/dataProduct/team remove (#28005)
* fix(tags): enforce dryRun on PUT /v1/tags/{id}/assets/remove (#27954)

The bulk remove endpoint hardcoded dryRun=false and ignored the flag from
the request, so callers passing dryRun=true still had the tag removed from
the target asset. Mirror the dryRun guard already present in the bulk add
flow: read the flag, propagate it onto the BulkOperationResult, and short
circuit before any deletes when dryRun is true.

Adds TagBulkAssetsDryRunIT covering dryRun=true, default (omitted) dryRun,
and dryRun=false against PUT /v1/tags/{id}/assets/remove.

* fix(bulk-assets): enforce dryRun across glossary/dataProduct/team remove

The same dryRun bug fixed for tags in the previous commit also exists in
the other bulk-asset endpoints. Each site hardcoded withDryRun(false) on
the result and went straight to the destructive path:

- GlossaryTermRepository.bulkRemoveGlossaryToAssets — mirrors the tag-add
  guard: read dryRun, propagate it onto the result, short-circuit before
  any deletes.
- EntityRepository.bulkAssetsOperation (base; used by Team) — propagate
  dryRun, skip relationship writes and the change-event emit when true.
- DataProductRepository.bulkAssetsOperation (override) — same pattern.
  Validation still runs for adds so dryRun returns the same errors a real
  call would, but no relationship writes / cache invalidations / change
  events occur.

Adds BulkAssetsRemoveDryRunIT with dryRun=true and dryRun=false coverage
for the three synchronous remove endpoints. The existing
DomainBulkAssetsDryRunIT continues to cover the domain override and
TagBulkAssetsDryRunIT covers the async tag endpoint.

* test(bulk-assets): move dryRun remove tests into existing IT files

Address review feedback:

- Drop the standalone TagBulkAssetsDryRunIT / BulkAssetsRemoveDryRunIT
  and fold the dryRun=true / dryRun=false coverage into the per-entity
  IT files that already own each endpoint:
    * TagResourceIT — tag bulk remove (async) with Awaitility during(...)
      window
    * GlossaryTermResourceIT — glossary term bulk remove (sync)
    * DataProductResourceIT — data product bulk remove (sync), domains
      properly aligned so add validation passes
    * TeamResourceIT — team bulk remove (sync), reusing the existing
      sanitized createTestUser helper to avoid the email-too-long /
      invalid-chars failures the standalone file hit
- Rename the misleading addTagToAssetsRequest local in
  TagRepository.bulkRemoveAndValidateTagsToAssets to assetsRequest
  (copilot review)
- Make the dryRun branch in TagRepository / GlossaryTermRepository
  iterate per asset and return one BulkResponse per request entry,
  matching the DataProduct/Team paths so callers get a real preview
  with rowsProcessed / rowsPassed counts (gitar-bot review)

* test(bulk-assets): add dryRun-omitted regression cases for bulk remove

Address Copilot review on PR #28005. The two bulk-asset request schemas
have asymmetric dryRun defaults:

- addTagToAssetsRequest.json / addGlossaryToAssetsRequest.json: default true
  (omitted -> preview, no mutation)
- bulkAssets.json (data product / team / domain): default false
  (omitted -> destructive)

Add a regression test per endpoint that posts a raw JSON body without the
dryRun field, pinning the full chain (schema default -> jsonschema2pojo
field initializer -> Jackson deserialization -> server-side
Boolean.TRUE.equals check). Guards against any of those layers silently
flipping the per-endpoint default.

* fix(bulk-assets): tag-add dryRun preview + add-side IT coverage

TagRepository.bulkAddAndValidateTagsToAssets short-circuited on dryRun=true
with a single "Nothing to Validate." message and zero row counts — the same
degraded-preview anti-pattern gitar-bot flagged on the remove side and that
was fixed in 59e6572. Mirror that fix: only short-circuit on empty assets;
on dryRun=true iterate per asset and emit one BulkResponse with proper
processed/passed counts. No mutation either way.

Add add+dryRun regression ITs for all four endpoints:

- TagResourceIT — async, Awaitility during(20s) asserting the tag is never
  applied to the asset
- GlossaryTermResourceIT — sync, asserts counts + tag absence on the table
- DataProductResourceIT — sync, asserts counts + table not attached to the
  data product
- TeamResourceIT — sync, asserts counts + user not joined to the team

DataProduct/Team add share bulkAssetsOperation with the remove path, so
these tests are belt-and-suspenders coverage rather than catching a new
regression; Tag/Glossary add use separate methods so the coverage is real.

Refactor TagResourceIT and GlossaryTermResourceIT to extract createBareTable
helper so add-side tests don't need to pre-tag the table.

* fix(bulk-assets): guard EntityRepository.bulkAssetsOperation against null assets

Four of the five bulk-asset code paths already early-return on
nullOrEmpty(request.getAssets()) (Tag add/remove, Glossary add/remove,
DataProduct via listOrEmpty). EntityRepository.bulkAssetsOperation
(used by the Team endpoint) did not, so a request like {"dryRun": true}
with no assets field — valid per the BulkAssets schema, which doesn't
mark assets as required — would NPE at the enhanced-for loop.

Mirror the early-return pattern from the per-tag-usage repositories.
Adds a TeamResourceIT case that posts a body with omitted assets and
verifies the server responds with zero counts instead of 500.

Flagged by Copilot review on #28005.

* fix(tags): run validation during dryRun add, only skip the writes

Flagged by Copilot review on #28005. The previous dryRun branch in
TagRepository.bulkAddAndValidateTagsToAssets returned per-asset
successes immediately after populateEntityReferences, bypassing the
mutually-exclusive tag check and the column FQN/lookup validation. A
dryRun=true call would falsely report that all assets would pass even
when the real call would fail validation, violating the dryRun contract
("validated but no changes").

Mirror the GlossaryTermRepository.bulkAddAndValidateGlossaryToAssets
pattern: drop the early-return dryRun branch, let the existing
validation loop run, and gate only the write side-effects (applyTags,
searchRepository.updateEntity, tag_usage writes) on !dryRun. Same
treatment for addTagToColumn.

The Tag add endpoint is async (returns jobId, ships BulkOperationResult
via WebSocket), so end-to-end IT coverage for the validation outcome
would require WebSocket subscription. The existing
test_bulkAddTagToAssets_dryRunTrue_doesNotApply continues to cover the
no-mutation invariant. The validation-runs invariant is covered by the
gold-standard sync glossary pattern this fix now mirrors.

* fix(bulk-assets): guard DomainRepository.bulkAssetsOperation against null assets

Same NPE pattern Copilot flagged on EntityRepository.bulkAssetsOperation:
the for-loop iterates request.getAssets() directly, so a request like
{"dryRun": true} (assets is optional per the BulkAssets schema) would
NPE on the enhanced-for. populateEntityReferences(null) is a no-op
thanks to its internal nullOrEmpty guard, but the loop on the next line
isn't safe.

Mirror the same early-return pattern now used by EntityRepository and
the four per-tag-usage repositories.

* fix(bulk-assets): run validation during dryRun on Tag+Glossary remove

Flagged by Copilot review on #28005. The dryRun branches in
TagRepository.bulkRemoveAndValidateTagsToAssets and
GlossaryTermRepository.bulkRemoveGlossaryToAssets short-circuited after
populateEntityReferences and reported all assets as passed, skipping
the same validations the real remove loop runs:

- column refs: removeTagFromColumn resolves the column FQN and looks up
  the parent table — throws on malformed FQN or missing table
- non-column refs: entityRepository.get(null, ref.getId(), ...) throws
  when the asset doesn't exist

So dryRun=true could falsely report success for inputs that would fail
the real call. Same fix shape as the Tag add side (commit 03c6f5002f):
drop the early-return dryRun branch, let the validation loop run, gate
only the destructive side-effects (tag_usage delete +
searchRepository.updateEntity) on !dryRun. removeTagFromColumn (both
Tag and Glossary versions) gets a dryRun parameter for the same
treatment inside.

The EntityRepository.bulkAssetsOperation case Copilot also flagged is
not bundled — addRelationship/deleteRelationship don't have a
separable validate phase, populateEntityReferences silently drops
orphan refs rather than throwing, and the Domain override (the
pre-existing gold standard) has the same gap. Fixing that properly
needs a broader design decision about preview semantics across the
entity_relationship path; tracked as a follow-up.

* fix(bulk-assets): guard TeamRepository validateAllRefUsers against null assets

Same NPE pattern Copilot flagged on EntityRepository and DomainRepository:
validateAllRefUsers iterates request.getAssets() directly, so a request
like {"dryRun":true} (assets is optional per the BulkAssets schema) would
NPE on the enhanced-for at "refs.iterator()" before reaching
bulkAssetsOperation (which already has its own null guard).

Add the standard nullOrEmpty early-return inside validateAllRefUsers so
both bulkAddAssets and bulkRemoveAssets pass through cleanly. Also drop
the duplicate inline Entity.USER check from bulkAddAssets — it asserted
the same invariant validateAllRefUsers already enforces.

Fixes test_bulkAssets_omittedAssets_returnsNothingToValidate which now
returns numberOfRowsProcessed=0, numberOfRowsPassed=0 instead of NPE-ing
at TeamRepository.java:452.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(bulk-assets): run dryRun add when glossary term has no parent tags

bulkAddAndValidateGlossaryToAssets short-circuited on
"dryRun && (glossary.getTags() empty || assets empty)" and returned
numberOfRowsProcessed=0. The glossary-tags-empty branch was wrong: the
loop applies the *term's* FQN to assets, and the glossary tags only feed
the mutual-exclusivity check. When the parent glossary has no inherited
tags (the normal case for a freshly-created glossary), assets should
still be processed and counted on dryRun.

Mirror the gold-standard TagRepository.bulkAddAndValidateTagsToAssets
(line 356) and sibling bulkRemoveGlossaryToAssets (line 1375) pattern:
short-circuit only when there are no assets to process.

Fixes test_bulkAddGlossaryToAssets_dryRunTrue_doesNotApply which now
returns numberOfRowsProcessed=1, numberOfRowsPassed=1, no tag applied —
instead of 0/0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(team): rename misleading domainName param to teamName in bulkRemoveAssets

Copilot review comment on #28005: the parameter was named domainName but
this method operates on Teams (path /v1/teams/{name}/assets/remove).
Pre-existing naming, surfaced because the dryRun diff touched these
lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 01:34:16 +00:00
Sriharsha Chintalapani
6c8441a06d
fix(search): skip soft-delete script propagation to time-series child aliases (#28064)
* Entity Time Series

* test case results

* Add Playwright

* fix: declare 'summary' on TestSuiteIndex so reindex preserves lastResultTimestamp

TestSuiteIndex.buildSearchIndexDocInternal computes the top-level
lastResultTimestamp field from testSuite.getTestCaseResultSummary().
TestSuiteRepository registers a fetcher for that under the field name
"summary"; the reindex path only invokes fetchers whose field is in
getRequiredReindexFields(). Without "summary" declared, the fetcher does
not run, testCaseResultSummary stays null on the entity, and the Index
takes its else branch (TestSuiteIndex.java:41) writing
lastResultTimestamp=0L on every reindexed suite.

That field is exactly what the DQ /data-quality/test-suites list page
sorts by (TestSuites.component.tsx:175), so a full reindex collapses
every basic suite to the 1970 epoch and "most recently run first" stops
working.



Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-05-17 09:55:51 -07:00
Sriharsha Chintalapani
ac5211451a
fix(rdf): scope glossary graph by membership and surface glossary labels (#28199)
* fix(rdf): scope glossary graph by membership and surface glossary labels

The /rdf/glossary/graph endpoint silently ignored its `glossaryId` filter
and returned every term in every glossary, because the SPARQL query bound
`?glossary` via `OPTIONAL { ?term1 om:belongsTo ?glossary }` while the
JSON-LD context (governance.jsonld) maps GlossaryTerm.glossary to
`om:belongsToGlossary`. The downstream `FILTER(?glossary = <…>)` was a
no-op, and the UI hierarchy view rendered every glossary's terms grouped
by their raw UUID — most visibly for cross-glossary scenarios.
2026-05-17 08:00:23 -07:00
Sriharsha Chintalapani
383d05b575
fix(cache): seal hard-delete stale-read window via NotFoundCache marker (#28197)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(cache): mark hard-deleted entities in NotFoundCache to seal stale-read window

EntityRepository.cleanup() invalidates Guava L1, Redis L2, and pub/sub before
and after the dao.delete() inside its transaction. A concurrent reader that
arrives between the two invalidate passes can still snapshot the row from
the not-yet-deleted DB state, write-through it back into Redis (and Guava on
the loader callback), and silently re-populate the cache past the DELETE
response.

The NotFoundCache (notFoundTtlSeconds, default 30 s) is already consulted by
find(UUID,…) and findByName(fqn,…) on L1 miss as a known-not-found
short-circuit. Populate it explicitly when cleanup() finishes: the next
read sees a stale L1 entry → false → checks the loader path → loader is
short-circuited by isMarkedNotFound* → 404. Recreate-with-same-id paths
already clear the marker via CacheBundle.invalidateEntity() in postCreate,
so this doesn't shadow legitimate reuses.

Read-path order is unchanged (L1 → negative cache → loader) so cache-hit
GETs still pay zero extra Redis round-trips.
2026-05-16 16:12:12 -07:00
Sriharsha Chintalapani
4bb6574815
fix(glossary): preserve all relation types between same term pair (#28172)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(glossary): preserve all relation types between same term pair

The entity_relationship primary key (fromId, toId, relation) caused the
second INSERT for the same (term, term) pair to UPSERT and overwrite the
json discriminator, silently dropping any previously stored relationType.
Adding the same target term with "synonym" then "seeAlso" left only one
relation visible on GET.

Extend the PK to (fromId, toId, relation, relationType) so each typed
relation lives in its own row. The new column defaults to '' for every
non-glossary edge, leaving existing call sites and queries semantically
unchanged. CollectionDAO.deleteWithRelationType and countByRelationType
now filter on the column directly instead of JSON_EXTRACT. The 1.13.0
schemaChanges migration backfills relationType from the existing json
for glossaryTerm RELATED_TO rows, then atomically swaps the PK.

Adds three integration tests in GlossaryTermResourceIT covering the
add path, the targeted-removal path, and tag-usage cleanup when a
table tagged with a glossary term is hard-deleted.
2026-05-16 11:37:21 -07:00
Sriharsha Chintalapani
1352d67cf4
feat(dar): Granted lifecycle, filters, sort, and self-service create policy (#28044)
* feat(dar): add Granted lifecycle, filters, sort, and self-service create policy

Splits the Data Access Request lifecycle into Approved (awaiting grant) and
Granted (active access) so the UI can show an "approved – awaiting grant"
banner that clears once an admin marks the request as granted. Adds an
indexed approvedBy/approvedById/approvedAt on Task, captured at the approve
transition through a new direct-persist helper. Introduces a dedicated
/v1/tasks/dataAccessRequests endpoint pre-scoped to category=DataAccess with
DAR filters (dataset, service, status, requestedBy, approver, accessType)
and an asc/desc sort on createdAt; generic /v1/tasks gains service/approver
filters too. DataConsumerPolicy now grants Create on resource=task so
authenticated non-admins can file a DAR (fixes "operations [Create] not
allowed"). Reworks the workflow handler so transitions whose targetTaskStatus
is non-terminal (Approved, Granted) don't close the task, and updates
CreateTask.isTerminalTaskStatus to allow advancing between Approved →
Granted stages. Adds a new "active" statusGroup that includes the DAR
lifecycle states while preserving the existing open/closed semantics that
Glossary-style workflows depend on. Includes a Postgres + MySQL migration
for the indexed approvedById generated column and integration coverage in
DataAccessRequestIT spanning the new lifecycle, filters, sorting, approver
capture, and the non-admin policy path.

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: anuj-kumary <anujf0510@gmail.com>
Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com>
Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com>
2026-05-16 07:35:15 -07:00
Sriharsha Chintalapani
5696286b27
Address Transitive vulnerabilities (#28169)
* Address transitive vulnerabilities

* Address transitive vulnerabilities

* fix(deps): resolve pyOpenSSL/cryptography conflict and align constraint pins

CI dependency resolution failed because pyOpenSSL~=24.1.0 caps cryptography
at <43, conflicting with the cryptography>=44.0.1 bump. Widens pyOpenSSL to
>=24.3.0 (first version compatible with cryptography 44.x) and aligns the
airflow constraint file pins for cryptography and GitPython with the
upstream setup.py bumps so pip install -c can resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 00:02:49 -07:00
Sriharsha Chintalapani
92a212a3a5
fix(search): scope column-grid metadata-status ITs + surface ES shard-failure reason (#28162)
`test_getColumnGrid_withMetadataStatusIncomplete` has been failing consistently on
the new Integration Tests - PostgreSQL + Elasticsearch + Redis lane (added in
#28012), even with @Isolated + 2 GB ES heap on the existing fix-cache-it-failures
branch — so it's not parallel-test contention or generic heap pressure. The
underlying ES `[search_phase_execution_exception]` reason is being swallowed by
the aggregator's catch block, leaving us guessing.

Two changes:

1. `ElasticSearchColumnAggregator` — add `logShardFailureDetails(e, indexes, query)`
   to the four `catch (ElasticsearchException)` sites. Logs the serialized query
   plus `e.error().rootCause()` before rethrowing, so the next failed CI run will
   reveal the actual shard-level reason (circuit_breaker, too_many_buckets,
   mapping conflict, etc.) instead of just "all shards failed".

2. `ColumnGridResourceIT` — scope the three `metadataStatus=*` tests
   (`Missing`/`Complete`/`Incomplete`) to their own `serviceName` like every other
   test in the class. They were the only tests in the file that aggregated over
   the entire index instead of just their own service's data — a defensive fix
   that drops the per-test query workload to ~1 doc regardless of how much
   accumulated data is in the cluster.
2026-05-15 19:54:42 -07:00
Sriharsha Chintalapani
0949bcc3ec
fix(cache): close rename-cascade race and bot-task delete eviction gap (#28100)
* fix(cache): close rename-cascade race and bot-task delete eviction gap
2026-05-15 18:50:28 -07:00
Pere Miquel Brull
4217e6db8d
fix(log-storage): plug clobber bugs in streamable S3 logs (partial.txt + logs.txt) (#27926)
* fix(api): make closeStream idempotent when log storage is not configured

closeStream used to throw IllegalStateException("Log storage is not
configured") which the resource layer translates to a 500 response.
That made the contract surprising for callers: any defensive cleanup
path (exit handlers, retry logic, generic teardown) had to know in
advance whether streaming was configured before calling close, or eat
spurious server errors.

Closing a stream is naturally idempotent — same shape as DELETE on a
non-existent resource. When log storage is not configured, return
silently with a debug log so callers can call close() defensively
without checking state first.

Adds a unit test covering the no-op path.

* Add design spec for streamable logs stability fix

Captures the design discussion for fixing partial.txt and logs.txt
clobber bugs in S3LogStorage when ingestion runs hit idle gaps longer
than the 5-minute stream timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add full design flow doc for streamable ingestion logs

End-to-end documentation of the streamable logs feature: architecture,
storage layout, run lifecycle, read paths, abandoned-run recovery,
configuration, concurrency model, and observability. Reflects the
post-fix design captured in the streamable-logs-stability spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add implementation plan for streamable-logs stability fix

Step-by-step TDD plan grouped into 8 PR-sized tasks: config schema
additions, per-stream lock, pendingFlush + merge-always flush, multipart
removal, sweeper rewrite, /close rewrite, read-path correction, and
integration tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(log-storage): add config fields for streamable-logs stability fix

Adds streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures. Deprecates
streamTimeoutMinutes in favor of streamTimeoutHours. Pure schema-only
change; no Java code consumes these fields yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): add deprecated:true keyword and clarify watermark unit

Addresses code review on Task 1: project convention uses the JSON Schema
deprecated keyword alongside description annotation. Also clarifies that
earlyFlushWatermarkBytes default (5242880) equals 5 MB.

* feat(log-storage): wire new stability-fix config fields into S3LogStorage

Reads streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures from
LogStorageConfiguration with sane defaults. No behavioral change yet —
values are stored but not consumed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): broaden streamTimeoutMinutes deprecation warning + drop FQN

Addresses code review on Task 2: warning now fires whenever
streamTimeoutMinutes is set (not only for values < 30 min), since the
field is deprecated for all deployments. Also imports java.lang.reflect.Field
in the test helper instead of using a fully-qualified name (CLAUDE.md
no-FQN rule).

* refactor(log-storage): add per-stream ReentrantLock for S3LogStorage

Introduces streamLocks map and acquire/release helpers. appendLogs,
writePartialLogsForStream, closeStream, and cleanupExpiredStreams all
serialize on the per-stream lock. No behavior change; locking is
pure mutual-exclusion at this point.

* fix(log-storage): close iterator.remove race in cleanupExpiredStreams

Move iterator.remove() inside the per-stream lock to prevent a window
where a concurrent appendLogs sees the still-present closed StreamContext
and writes to a closed stream. Also clarifies the comment on flush(fqn,runId)
ordering and documents that streamLocks accumulates monotonically until
Tasks 7 and 8 add cleanup.

* feat(log-storage): track pendingFlush queue and totalLinesAppended counter

Each appendLogs now also populates per-stream pendingFlush (lines awaiting
flush) and totalLinesAppended (monotonic logical line counter). State is
written but not yet consumed; the new flush logic in the next commit reads it.

* fix(log-storage): document thread-safety + lifecycle on Task 4 maps, add test

Addresses review on Task 4: documents that pendingFlush ArrayList values
may only be accessed under the per-stream lock; clarifies that
consecutiveFlushFailures is written and consumed in Task 5 (not just
consumed); aligns its type with AtomicInteger for consistency with
the other counters; adds a test for the trailing-newline trim path.

* fix(log-storage): merge-always partial.txt PUT and persist offset in S3 metadata

Replaces the old writePartialLogsForStream that skipped the read-merge step
when partialLogOffsets[streamKey] was 0 (the canonical 80MB->KB clobber bug).
The new flush always reads existing partial.txt, appends a snapshot of
pendingFlush, and PUTs with offset state in S3 user-defined metadata.

Also adds an early-flush watermark trigger so high-burst writes don't
pile up unbounded in pendingFlush.

Closes the partial.txt-clobber half of the streamable-logs-stability spec.

* fix(log-storage): replace task-number comments with intent-describing language

Addresses code review on Task 5: production code comments should describe
invariants, not the planning-doc task that filled the gap. Also clarifies
the parse-before-lock and the byte-counter atomicity assumption.

* refactor(log-storage): remove MultipartS3OutputStream, rewrite closeStream as server-side copy

appendLogs no longer initiates a multipart upload; bytes flow only through
pendingFlush -> partial.txt PUTs.

closeStream now: (1) drains pendingFlush via final partial.txt PUT,
(2) issues CopyObjectRequest from partial.txt to logs.txt server-side,
(3) deletes partial.txt and the .active marker, (4) drops in-memory state.

Idempotent: a second /close sees no partial.txt (NoSuchKeyException) and
returns gracefully.

Closes the logs.txt-clobber half of the streamable-logs-stability spec
and finalizes the canonical /close flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): plug listener/lock leaks, propagate SSE on copy, recover counter from metadata

Addresses code review on combined Tasks 6+8:
- dropStreamState now removes activeListeners entries (SSE listener leak fix).
- cleanupExpiredStreams now removes streamLocks entries on expire (lock leak fix).
- copyPartialToLogs applies SSE configuration to CopyObjectRequest (was unencrypted on copy).
- writePartialLogsForStreamLocked reads last-flushed-line metadata from existing
  partial.txt and uses it to keep totalLinesAppended monotonic across restarts.
- consecutiveFlushFailures reset uses computeIfAbsent + set(0) instead of allocating
  a new AtomicInteger every successful flush.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(log-storage): rewrite sweeper as cleanupAbandonedStreams (24h/1h)

Bumps the idle threshold from 5 min to streamTimeoutHours (default 24h)
and the poll interval from 1 min to cleanupIntervalMinutes (default 1h).
On expire, finalizes the abandoned run by copying partial.txt -> logs.txt
server-side, deleting partial.txt, and dropping in-memory state — same
end-state as closeStream.

Also wires partialFlushIntervalMinutes into the periodic flush schedule
and removes the legacy streamTimeoutMs field that no longer drives behavior.

* fix(log-storage): preserve streamLocks entry on cleanup retry path

Addresses code review on Task 7: streamLocks.remove was unconditionally
in the finally block of finalizeAbandonedStream, so it ran even when the
sweeper returned early to retry next tick on a copy failure. That meant
the next sweep tick would create a fresh ReentrantLock, and any
concurrent appendLogs in the meantime would contend on a different lock
object than the retry, defeating mutual exclusion.

Now we only remove the lock entry once finalization has succeeded
(after dropStreamState). The retry path leaves the lock in place so
the next tick and any concurrent appendLogs see the same lock identity.

* fix(log-storage): include pendingFlush snapshot in mid-run reads

getCombinedLogsForActiveStream now appends the in-memory pendingFlush
snapshot to the partial.txt body when reading mid-run, so the UI's
paginated GET surfaces the most recent tail even before the next
scheduled flush has happened.

Only appends pendingFlush when a partial.txt file exists, avoiding
duplication in the fallback path where recentLogsCache already
includes those lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): tighten Task 9 read path safety + invariant comment

Addresses review on Task 9: the unsafe null-lock fallback in the
pendingFlush append path is removed (it was structurally unreachable
but a latent hazard for future lifecycle changes). The pendingFlush
read now happens entirely under the per-stream lock, with a
conservative skip if no lock entry exists.

Also documents the recentLogsCache-vs-pendingFlush invariant in the
fallback path and adds a total-count assertion to the new test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(log-storage): add bug-reproducer ITs for streamable-logs stability

- testIdleGapDoesNotClobberPartial: two log bursts within an open run;
  asserts both are present in the read response.
- testCloseProducesLogsTxtMatchingPartial: write, close, read; asserts
  content survives the close.
- testCloseIsIdempotent: a second /close is a graceful no-op.

Tests are tolerant of the storage backend in the test environment
(DefaultLogStorage in CI may not persist; S3LogStorage in S3-configured
environments). Deep behavioral coverage is in S3LogStorageTest unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): address final-review critical bugs

- closeStream and finalizeAbandonedStream now propagate PUT failures
  from writePartialLogsForStreamLocked (which returns boolean).
  closeStream throws IOException; the sweeper retains state for retry.
  Fixes silent data loss when the final flush PUT fails.
- streamLocks entries are no longer removed; this prevents an
  acquire-vs-remove race that would break mutual exclusion. Memory
  growth is bounded by maxConcurrentStreams in practice.
- cleanupAbandonedStreams re-checks expiration inside the per-stream
  lock so a stream that was bumped by appendLogs between the scan
  and the lock acquisition is not finalized.
- deleteLogs now acquires the per-stream lock before mutating state.
- getCombinedLogsForActiveStream appends pendingFlush in BOTH the
  S3-found and memory-fallback branches, so reads aren't truncated
  when recentLogsCache evicts oldest at its 1000-line cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): use pendingFlush as canonical mid-run read source (no duplicates)

The previous Issue 5 fix appended pendingFlush unconditionally, which
caused duplicate lines in the read response when the fallback branch
used recentLogsCache (since both are populated by the same appendLogs).

Now: in the foundPartialFile branch, append pendingFlush AFTER the S3
body (non-overlapping by construction). In the fallback branch
(no partial.txt yet), use pendingFlush directly as the canonical
source — this is more complete than recentLogsCache (1000-line cap)
and avoids the duplicate issue. recentLogsCache remains a defensive
fallback for the rare case where pendingFlush is empty in the fallback
path.

* Update generated TypeScript types

* chore(log-storage): drop dead abortIncompleteMultipartUpload lifecycle rule

The multipart upload write path was removed; the bucket lifecycle's
abortIncompleteMultipartUpload(7 days) rule served only as migration
cleanup for in-flight uploads from the old code at deploy time. After
the migration window it does nothing.

Drops the rule from configureLifecyclePolicy, the AWS SDK import, the
"7 days multipart cleanup" string in the startup log, and the
corresponding bullet in docs/streamable-logs.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: ignore docs/superpowers/

Local-only working notes (specs, plans) live there and shouldn't be tracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(log-storage): tolerate DefaultLogStorage in CI for streamable-logs ITs

CI runs the integration tests against the bootstrap config which uses
DefaultLogStorage (delegates to k8s/Airflow which isn't running). The
storage returns:
- "No pods found for this pipeline" sentinel for getLogs
- non-2xx status (the SDK wraps it as statusCode -1) for /close

Adjustments:
- testIdleGapDoesNotClobberPartial: parse JSON, only assert when total>0.
  When storage actually persists (S3 deployments), assert BOTH bursts
  are present — that's the real "no clobber" check.
- postClose helper: tolerate any exception from the close call
  (idempotency is the contract; transient errors are acceptable).

The deep behavioural coverage continues to live in S3LogStorageTest unit
tests where mock S3 is the storage backend.

* test

* fix

* Update generated TypeScript types

* fix

* Update generated TypeScript types

* fix(log-storage): record UTF-8 byte length in partial.txt total-bytes metadata

String.length() returns UTF-16 code units; for non-ASCII content this
diverged from the actual S3 object size, breaking the drift cross-check
documented in docs/streamable-logs.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(log-storage): address PR review findings on S3LogStorage

Plumbs the documented timing knobs (cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures) through LogStorageConfiguration
so operators can actually tune them. Replaces the unbounded streamLocks ConcurrentHashMap
with a Guava Striped<Lock> capped at 256 stripes, eliminating the per-(fqn, runId) memory
leak and the acquire-vs-remove race that a per-key map would have. Adds a Multipart
Upload + UploadPartCopy concatenation path for partial.txt >= 5 MB, avoiding the O(n^2)
total transfer and full in-JVM body merge that the prior GET+PUT-everything strategy hit
on long-running pipelines. Realigns docs/streamable-logs.md with the actual schema and
implementation, drops the broken superpowers/* spec link, and renames the misleading
testIdleGapDoesNotClobberPartial IT (which posted bursts back-to-back without simulating
any gap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-15 11:02:19 +02:00
sonika-shah
17c3b8b9b2
fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488) (#27746)
* fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488)

Root cause
----------
The 1.9.9 migration introduced two separate index regressions on
`profiler_data_time_series`:

1. **PostgreSQL**: `schemaChanges.sql` explicitly dropped the unique
   constraint `profiler_data_time_series_unique_hash_extension_ts`
   (entityFQNHash, extension, operation, timestamp) to allow altering the
   generated `operation` column expression, but never recreated it.  After
   the migration the table kept only the `(extension, timestamp)` index,
   which is useless for queries filtering by `entityFQNHash`.

2. **MySQL/both**: `postDataMigrationSQLScript.sql` created temporary indexes
   (idx_pdts_entityFQNHash, idx_pdts_composite, etc.) for its bulk UPDATE
   pass and then dropped **all** of them, including the only index covering
   `entityFQNHash`.

The batch query issued by `getLatestExtensionsBatch()` when
`fields=profile` is requested:

  SELECT entityFQNHash, MAX(timestamp) FROM profiler_data_time_series
  WHERE entityFQNHash IN (...N hashes...) AND extension = 'table.columnProfile'
  GROUP BY entityFQNHash

required an `(entityFQNHash, extension, timestamp)` index.  Without it the
database performs a full table scan.  On production deployments with
millions of profiler rows this caused 100+ second response times (Grafana:
106 770 ms; 99 % in DB; 93 dbOps).  Without `profile` in the fields param
the same endpoint returned in ~150-220 ms.

A secondary N+1 bug existed independently of the index: `customMetrics`
in fields called `getCustomMetrics(table, column)` once per paginated
column, issuing up to N identical queries against `entity_extension` and
then filtering in Java.

Fix
---
* **migration 2.0.2** (MySQL + PostgreSQL): `CREATE INDEX IF NOT EXISTS
  idx_pdts_fqnhash_ext_ts ON profiler_data_time_series(entityFQNHash,
  extension, timestamp)`.  The `IF NOT EXISTS` guard makes the migration
  safe to re-run and handles both upgrade and fresh-install paths.

* **`getTableColumnsInternal`** — `customMetrics` block: fetch all column
  custom metrics for the table in one query, group by column name in Java,
  then distribute.  Reduces N queries to 1.

* **`getTableColumnsInternal`** — `profile` block: skip the duplicate
  `populateEntityFieldTags` call when `tags` was already fetched earlier in
  the same request, saving one prefix-scan on `tag_usage` per request.

Related: PR #26855 (fixed N+1 tag queries on the list-tables path but left
the profiler-index and customMetrics N+1 untouched on the columns sub-path).

* fix(profiler): restore unique constraint on profiler_data_time_series + batch column extension/customMetrics fetch

Move the migration from 2.0.2/ to 1.12.8/ and switch from a non-unique
covering index to restoring the original unique constraint dropped in
1.9.9. The two-phase CREATE UNIQUE INDEX CONCURRENTLY + ADD CONSTRAINT
USING INDEX pattern avoids the ACCESS EXCLUSIVE lock on the hot
profiler_data_time_series table during the upgrade. Closes the 1.9.9
regression and brings Postgres back in line with MySQL (which never lost
the constraint). The leading (entityFQNHash, extension) prefix serves
the column-profile batch query — same shape MySQL has been running
without 504s. MySQL needs no migration.

Java side, eliminates two more N+1 patterns that compound the latency at
customer scale:

* getTableColumnsInternal extension block: replaced per-column
  getColumnExtension() loop with a single getExtensionsByJsonSchema()
  call, grouped by column FQN-hash in Java.
* searchTableColumnsInternal customMetrics block: applied the same
  batch-fetch pattern already used in getTableColumnsInternal, replacing
  per-column getCustomMetrics() with one getExtensions() call.

New DAO method on EntityExtensionDAO:
  getExtensionsByJsonSchema(id, jsonSchema) — selects extensions for a
  table id filtered by the jsonschema discriminator. Required because
  column extensions are stored with MD5-hashed extension keys and have
  no shared prefix the existing getExtensions(id, prefix) could use.

* chore(profiler): address review feedback — empty-list literal + accurate test comments

* Replace `new ArrayList<>()` default in `metricsByColumn.getOrDefault(...)`
  with `List.of()` at both call sites in `TableRepository` (getTableColumnsInternal
  and searchTableColumnsInternal). `getOrDefault` evaluates its default eagerly,
  so the new ArrayList allocates per-column even when the key is found —
  unnecessary work on a hot path.

* Reword two stale test comments in `test_getColumnsWithProfileField_correctnessAndNoBatchRegression`:
  - "all four field combinations" → "the three field combinations exercised below"
  - "(c) duplicate populateEntityFieldTags must not run twice" → describe the
    observable contract the assertions actually verify (tags + profile both
    present), not the internal call count.

* fix(profiler): force outer index scan in getLatestExtensionsBatch by pushing IN list to the join

The getLatestExtensionsBatch query was the right shape for correctness but
the planner — on Postgres at customer scale, with the new unique constraint
in place — was still choosing a parallel sequential scan over the full
profiler_data_time_series table for the outer side of the JOIN, rather than
a merge join with index scan on both sides.

Inner subquery: filtered by `entityFQNHash IN (...)`, used the index.
Outer: only filtered by `p.extension = :extension`, no IN list, planner
couldn't infer the transitive constraint that p.entityFQNHash must equal
one of the inner hashes (because it's enforced through the JOIN ON clause,
not a WHERE predicate). Result: full table scan reading 6.7M+ rows even
when the actual answer is 23 rows.

Adding the redundant `AND p.entityFQNHash IN (<entityFQNHashes>)` to the
outer WHERE makes the constraint explicit. The result set is unchanged
(implied by the join condition), but the planner can now use the unique
index for the outer access too.

Verified on the AUT dump (6.94M-row pdts):
  EXPLAIN of the batch query: 7,234ms → 79ms (Hash Join + Parallel Seq
  Scan → Merge Join + Index Only Scan).
  Live API /columns?fields=profile&include=all: 6-36 seconds → 22-28ms
  (warm) / 1.9s (very first call). 250-1000x improvement, depending on
  cache state.

Same SQL works on both engines; no @ConnectionAwareSqlQuery split needed.

* test(profiler): shorten classification/tag fixture names in IT to fit varchar(256)

The IT fixture for test_getColumnsWithProfileField_correctnessAndNoBatchRegression
was building a tagFQN of `<classification>.<tag>` where each part went through
TestNamespace.prefix(). With the descriptive method name (62 chars) + class
name (15 chars) + namespace UUID (32 chars) plus the `profile_test_cls` /
`profile_test_tag` base names (16 chars each), the resulting tagFQN was 263
characters — over the tag_usage.tagFQN VARCHAR(256) limit:

  ERROR: value too long for type character varying(256)

Shorten the fixture base names from `profile_test_cls`/`profile_test_tag` to
`cls`/`tag`. The namespace prefix already encodes test isolation (class +
method + UUID), so the base name doesn't need to repeat that context.

New tagFQN length: 237 chars (cls__<32>__TableResourceIT__<62>.tag__<32>__TableResourceIT__<62>),
comfortably under 256.

* fix(table): include extensionKey in column-extension deserialize warn log

Addresses gitar-bot review on PR #27746: the warning log on failed column-
extension deserialization only had table.getId(), so operators could not
pinpoint which row was bad. Add record.extensionName() (the entity_extension
row key) to the log. No extra iteration - record is already in scope inside
the catch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(migration): move profiler unique-constraint migration to 1.12.9

1.12.8 was already published with the PII classification fix from #27910.
Move the profiler_data_time_series unique-constraint restore (this PR's
postgres migration) to 1.12.9 so customers upgrading past the published
1.12.8 still pick it up.

Add a MySQL placeholder schemaChanges.sql for 1.12.9 consistent with the
1.12.7 convention — MySQL was unaffected by the 1.9.9 regression
(MODIFY COLUMN re-evaluates generated expressions in place without
touching the constraint, so MySQL still has the constraint from 1.1.5).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(table): extract batchFetchCustomMetricsByColumn helper

Addresses PR #27746 Copilot review:
- Dedupe custom-metric batch logic between getTableColumnsInternal and
  searchTableColumnsInternal.
- Reword IT inline comment to reflect what the test actually validates
  (completes within timeout + correct profiles) instead of claiming it
  inspects query plans.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-05-14 06:56:39 -07:00
Sriharsha Chintalapani
64f49c1747
Cache improvements: lineage + search layers, observability, CI gate (#28012)
* cache: lineage cache, per-type metrics, invalidation registry, search-cache

Add Redis-backed lineage response cache and search response cache, both
gated by the existing CACHE_PROVIDER toggle and falling through to direct
computation when the cache is unavailable. The cache remains optional —
verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and
confirming all paths continue to work (just without the L2 hit).

Coverage:
- CachedLineage wraps LineageRepository.getLineage with hybrid TTL +
  direct invalidation (60s default). Direct edits invalidate the affected
  root cache entries; transitive changes fall through to TTL.
- CachedSearchLayer wraps /api/v1/search/query with auth-aware caching
  (cache key includes principal so users with different ACLs don't share
  results). 30s default TTL.

Observability:
- /api/v1/system/cache/stats response now includes a metrics block with
  hits/misses/hitRatio/evictions/errors/writes plus read/write latency
  Timers, and a byType breakdown so coverage gaps are visible per
  entity-type and per cache-layer.

Correctness:
- New Invalidatable interface + CacheBundle registry + invalidateEntity
  helper so future cache layers plug in by implementing one method
  instead of editing multiple mutation paths.
- Edge mutations in LineageRepository.addLineage/deleteLineage invalidate
  both endpoints; entity mutations in EntityRepository.postUpdate /
  postDelete / restoreEntity invalidate the lineage rooted at the entity.
- Pub/sub handler in CacheBundle iterates registered Invalidatables so
  remote-pod evictions flow to all layers automatically.

Tooling:
- docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for
  local A/B testing without tearing down DB/ES volumes.
- CachedSearchLayerIT exercises hit-on-second-call, distinct-query
  misses, distinct-page-size misses, and byType shape via the metrics
  endpoint. Each test gracefully no-ops when the cluster runs cache-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints

Builds on the phase 1 commit (c20a29b11b) with operability and correctness
items from .context/cache-improvements-design.md. All four pieces respect
the optional-cache contract: with CACHE_PROVIDER=none they no-op cleanly.

P2.3 — Single-flight on CachedSearchLayer
  Striped<Lock> keyed by SHA-cache-key. 100 concurrent users hitting the
  same uncached query collapse to one ES call instead of N. SearchResource
  now uses loadOrCompute so the lock-and-recheck pattern lives inside the
  cache layer; the supplier is the actual ES call kept tight to minimize
  lock-hold time. Non-200 upstreams bypass cache and refetch.

P2.6 — Slow cache reads logged
  RedisCacheProvider.get/hget timing checked against
  cache.slowReadThresholdMs (default 50ms). Exceeding fires a WARN log
  and bumps a new cache.reads.slow Micrometer counter exposed in
  /cache/stats.metrics.slowReads. Leading indicator of Redis pressure /
  network glitch / hot-key contention.

P2.4 — Negative caching for not-found entities
  NotFoundCache marks "we looked, no such entity" with a short TTL
  (default 30s) so repeated 404 lookups (typo'd FQNs, references to
  deleted entities) don't hammer the DB. Wired into
  EntityRepository.find(UUID) and findByName for the !fromCache path.
  Implements Invalidatable so the postCreate fan-out drops the marker
  on entity create — without that, create-then-immediately-read would
  404 for up to TTL.

  Added CacheBundle.invalidateEntity to EntityRepository.postCreate so
  newly-created entities reach every Invalidatable registry layer.

P2.5 — Admin cache ops endpoints
  GET  /api/v1/system/cache/keys?pattern=...      — SCAN keys, returns count
  POST /api/v1/system/cache/invalidate?pattern=.. — SCAN+UNLINK, returns deleted
  POST /api/v1/system/cache/invalidate/entity?type=&id=&fqn=
                                                  — fan to all Invalidatables

  All admin-only. Pattern endpoints document the "no broad globs" rule —
  we never want a SCAN over om:prod:* on a busy cluster. Per-entity
  endpoint goes through the existing Invalidatable registry so future
  cache layers are reachable from ops without ever touching this code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: pipelined mget on CacheProvider + CachedReadBundle.getBatch

Adds a foundational batch-read primitive at the provider layer:

  CacheProvider.mget(List<String>) -> List<Optional<String>>

Default implementation does sequential per-key gets (correct, no batching
benefit). RedisCacheProvider overrides with a true pipelined version: all
GETs are queued under setAutoFlushCommands(false), then flushed once and
awaited as a single TCP round-trip. Records hits/misses through the
existing CacheMetrics counters and respects the slow-read threshold.

Per-key pipelining over true MGET — Redis Cluster requires same-slot keys
for MGET; pipelined per-key GETs work transparently across slots without
the constraint, at the same network cost.

CachedReadBundle.getBatch(entityType, ids) consumes the new primitive
for prefetch use cases (UI prefetch on hover, list-then-detail
navigation warmup). The list endpoint hot path itself does NOT use this
layer — list responses are SQL-batched via EntityRepository.setFieldsInBulk
which calls fieldFetchers in bulk, not per-row CachedReadBundle.get.
That's why bench3 showed list endpoints at neutral cache_off-to-on
ratio: lists already amortize at the SQL layer.

The mget primitive is what later phases will plug into when wiring
batch-prefetch to specific UI flows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cache): use unique query in sameQueryHitsCacheOnSecondCall to avoid state pollution

Sequential test run on postgres-os-redis caught a flake: the test issued
3 identical "q=*" calls expecting at least 1 cold-write. By the time it
ran, prior tests in the same JVM session had already cached
(q=*, index=table_search_index, size=10), so call 1 was a hit, call 2
hit, call 3 hit — total writes=0, asserts failed.

Switching to a per-invocation nonce ensures we always start cold,
matching the pattern the other 3 tests in this class already use.

Confirmed via subsequent parallel-pass run on the same suite where the
test passed (different test ordering, fresh cache for that key).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cache: drop search cache TTL from 30s to 2s for create-then-search freshness

Integration tests on the postgres-os-redis profile caught a real correctness
regression: tests that create an entity and Awaitility-poll for it to appear
in search timed out at 30s because our 30s search TTL pinned the
pre-create empty result for the entire test window. Same issue surfaces
in production: a user creates a domain / table / dashboard and immediately
searches for it would see "no results" for up to 30s.

2s caps the staleness while still catching the dominant UI access pattern:
multiple components in the same render frame fire identical search queries.
Those happen within milliseconds, well inside any reasonable TTL.

The longer-term fix is search-cache invalidation on entity writes (a
generation counter per entity-type, search keys include the generation,
writes bump the generation). That's design-doc-tracked in
.context/cache-improvements-design.md but deferred — the 2s TTL is good
enough for now, and the more complete invalidation strategy can be a
follow-up PR with its own dedicated tests.

Failing tests under 30s TTL that this fixes:
  - DomainAssetsColumnExclusionIT (domain create-then-search)
  - LineageImpactAnalysisIT (owner removal reflected in search)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: cache-tests profile runs full IT suite + new postgres+es+redis CI workflow

The cache-tests Maven profile previously ran only the four cache/* IT
classes — too narrow to catch cache-correctness regressions in the rest
of the codebase. Expanded it to mirror the mysql-elasticsearch profile
shape: sequential + parallel failsafe executions, full **/*IT.java
inclusion, postgres + elasticsearch + redis backend, with
cacheProvider=redis system property added so every test path exercises
the cache layer.

Locally, the focused-cache-only run is preserved via
  mvn verify -P cache-tests -Dit.test='**/cache/*IT'

New CI workflow integration-tests-postgres-elasticsearch-redis.yml
mirrors the structure of integration-tests-postgres-opensearch.yml:

  - Same triggers (push to main, PR target, merge_group, workflow_dispatch)
  - Same path filters (openmetadata-service/**, integration-tests/**, etc.)
  - Same Maven cache + JDK 21 setup
  - Runs `mvn verify -pl :openmetadata-integration-tests -Pcache-tests`
  - Surefire-report publication with fail_on_test_failures

Result: PRs touching cache code (or any read path) get automatic CI
coverage with redis enabled. Cache-invalidation and stale-data bugs
that previously only surfaced in production now have a CI gate before
merge — same protection that mysql-elasticsearch and postgres-opensearch
provide for the no-cache code paths.

Smoke verified locally: `mvn verify -P cache-tests -Dit.test='**/cache/*IT'`
runs both sequential and parallel passes (6 tests each), all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): address PR review feedback for cache improvements

Nine review-driven fixes spanning the cache PR (#28012):

RedisCacheProvider.mget (bug):
  - Restructured the auto-flush window so `setAutoFlushCommands(true)` is
    in the OUTER `finally` of the entire op. The previous structure had
    the restoration in an inner finally that only fired around the
    awaitAll call; an exception in the queueing loop or flushCommands()
    would leave the SHARED connection in auto-flush=false mode, making
    every subsequent op from any caller silently buffer indefinitely.

SearchResource (bug):
  - Removed the double-call on the non-cacheable response path. The
    supplier now captures the upstream Response object so the outer code
    can return it directly when the body isn't cacheable (non-200 or
    non-String entity) — previously the caller re-invoked
    searchRepository.search() on every error/non-200, doubling backend
    load for failing queries.

EntityRepository negative cache (edge case):
  - Hoisted the NotFoundCache fast-path OUTSIDE the `!fromCache` guard in
    both `find(UUID,...)` and `findByName(...)`. Default callers go in
    via `find(id, include)` which delegates with fromCache=true; the
    previous gate made the fast-path unreachable for the most common
    caller. Also added negative-cache population from the cached path's
    ExecutionException so repeated requests for a non-existent id do
    short-circuit after the first miss.

SystemResource cache endpoints (security + style):
  - `/cache/keys` and `/cache/invalidate` now validate the glob pattern
    via `validateCachePattern` — rejects pure wildcards or patterns with
    fewer than 6 literal characters before the first wildcard. Stops a
    careless or malicious admin from issuing `*` or `om:*` that would
    block the Redis cluster on a large keyspace. ReDoS-safe: linear
    char scan, no regex backtracking.
  - `/cache/invalidate/entity` now also calls
    `EntityRepository.invalidateCacheForEntity(...)` to evict the Guava
    L1 caches (`CACHE_WITH_ID`, `CACHE_WITH_NAME`) and propagate via the
    existing pub-sub channel — the previous code only invalidated the
    `INVALIDATABLES` registry layers, leaving stale L1 entries.
  - Replaced fully-qualified class names (`org.openmetadata.service.
    cache.CacheMetrics`, `jakarta.ws.rs.QueryParam`, `java.util.UUID`)
    with proper imports per the project style guide.

CachedLineage (edge case):
  - Single-flight stripe lock now keys on the FULL cache key
    `(rootId, upstreamDepth, downstreamDepth, includeDeleted)` instead
    of `rootId` alone. Concurrent requests for different depths or
    include-deleted flags on the same root no longer block each other.

CachedSearchLayer (doc):
  - Javadoc now correctly says default TTL is 2s (was incorrectly 30s)
    and explains why — see commit 41489056ff which dropped it from 30s
    after IT regressions where users couldn't see their own writes for
    half a minute.

CI workflow (bugs + security mitigation note):
  - Removed `if: steps.cache-output.outputs.exit-code == 0` from the
    `Set up JDK 21` and `Install Ubuntu dependencies` steps.
    `actions/cache@v4` exposes `cache-hit`, never `exit-code`; the
    expression always evaluated to false and those steps NEVER ran.
    Maven was using whatever JDK shipped with the runner.
  - Added explicit security note in the workflow header AND on the
    `Checkout` step documenting why `pull_request_target` is intentional
    and what the `safe to test` label gate accomplishes — CodeQL flags
    the pattern, the label gate is the accepted mitigation that mirrors
    every other integration-tests-*.yml workflow in this repo.

Verified:
  - mvn compile -pl openmetadata-service → BUILD SUCCESS
  - mvn test -pl openmetadata-service -Dtest=OpenMetadataAssetServletTest
    → 9/9 pass
  - mvn spotless:apply ran clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): only negative-cache on real EntityNotFoundException

The previous code caught every ExecutionException / UncheckedExecutionException
from the Guava cache loader and (a) populated NotFoundCache for 30s, (b)
rethrew as EntityNotFoundException. That conflated three very different
failure modes:

  1. Entity truly doesn't exist     → loader throws EntityNotFoundException
  2. Entity exists but is invalid   → loader throws IllegalStateException
  3. Transient DB / deser failure   → loader throws JdbiException, IOException

Cases 2 and 3 would poison the negative cache, turning a momentary DB
hiccup or a single bad row into a sustained 30s 404 for every caller that
asks for that id/fqn. Worse, the original cause was masked behind a
synthetic EntityNotFoundException, so logs and clients never saw the real
failure.

This change inspects e.getCause() and:
  - On EntityNotFoundException: populate NotFoundCache, rethrow the
    original (not a synthetic) so the caller's `instanceof` checks and
    message text still work.
  - On any other RuntimeException: rethrow unchanged — DB blips return
    5xx as before, validation errors surface, and the next request can
    re-attempt without hitting a poisoned cache.
  - On checked Throwable cause (rare for these loaders): wrap in
    RuntimeException so the contract is preserved.

Applied symmetrically to find(UUID, …) and findByName(String, …).

Addresses gitar-bot review on PR #28012:
https://github.com/open-metadata/OpenMetadata/pull/28012#discussion_r... (negative cache poisoning)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review — blank param, javadoc, mget hardening

Four review comments from PR #28012 review 4266159401:

SystemResource.invalidateCacheForEntity (line 1069 → blank query params):
  `?type=X&id=&fqn=` slipped past the required-params check because only
  `null` was treated as absent. Normalize blank id/fqn to null up front
  so the missing-both branch fires correctly and the downstream
  CacheBundle / EntityRepository calls receive a clean null instead of
  an empty string.

CacheKeys.search/childrenPage (line 116 → orphaned Javadoc):
  When the search() helper was added between the children-page Javadoc
  and the childrenPage() method, the Javadoc got stranded above the
  wrong method. Move it back so javadoc tooling generates accurate docs.

RedisCacheProvider.mget (line 610 → shared-connection auto-flush race):
  setAutoFlushCommands(false) toggles state on the shared Lettuce
  connection — two concurrent mgets could overlap and one caller's
  commands would buffer until the other restored auto-flush, surfacing
  as latency spikes / hangs on other paths sharing the connection.
  Wrap the pipeline in a new instance-level ReentrantLock so only one
  mget runs the auto-flush dance at a time. try/finally still restores
  auto-flush unconditionally; lock release sits in an outer finally.

RedisCacheProvider.mget (line 621 → unbounded f.get() on timeout):
  Previously LettuceFutures.awaitAll(...) returned a boolean we ignored;
  if it timed out, the subsequent f.get() calls were unbounded and would
  block the request thread until the Lettuce event loop eventually gave
  up. Capture the boolean, cancel non-done futures on timeout (so f.get()
  returns CancellationException instead of blocking), and log a warning
  with the timeout value and key count for operators.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget partial timeout must trip the circuit breaker

The previous mget rewrite cancelled in-flight futures on awaitAll timeout
but still called recordSuccess() at the end of the happy-path. That fed
consecutiveSuccesses on every partial timeout, so a Redis instance that
was consistently slow (answering some keys, dropping others) would
*never* trip the breaker — masking real backend degradation as healthy.

Branch on the captured allCompleted boolean:

  - all futures completed → recordSuccess() as before
  - partial timeout → recordFailure(TimeoutException) and bump
    CacheMetrics.recordError() so the breaker's sliding-window failure
    detector picks it up and the metric reflects the degraded state

No other behaviour change — the per-key fallback Optionals still surface
to callers either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget shorter critical section + cache/stats + cache/keys doc

Three review comments from PR #28012 second copilot pass:

RedisCacheProvider.mget (RedisCacheProvider.java:624 — shared-connection
hold time): previous code held setAutoFlushCommands(false) for the entire
queue+flush+await window. Other paths (single get/set/hget on the same
Lettuce connection) would buffer until our await finished. Shrink the
critical section to just queue+flush: once flushCommands() returns, the
batch is on the wire and we can restore auto-flush and release the
pipelineLock before awaiting. A slow Redis now blocks only the calling
thread, not every concurrent caller using the shared connection.
Cancel-on-timeout and breaker accounting are unchanged.

SystemResource.getCacheStats (line 962 — noisy WARN when cache disabled):
CacheMetrics.getInstance() logs WARN every call when the metrics singleton
isn't initialized, which happens whenever CACHE_PROVIDER=none. An ops
dashboard polling /system/cache/stats on a cache-off deployment would
spam the log. Gate the metrics call on cacheProvider.available() so the
WARN never fires in that configuration. Stats payload still includes
provider-level fields; just no `metrics` key when cache is off.

SystemResource.scanCacheKeys (line 1006 — OpenAPI lies about count param):
Description claimed "bounded by the count parameter" but no count param
exists; scanCount() walks the full cursor. Rewrote the description to
state the actual safety mechanism: the validateCachePattern enforces a
6-character literal prefix before any wildcard, so '*' and 'om:*' are
rejected at validation. Reflects what the endpoint actually does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review pass 3 — hot-path L1 check + lineage hash + cleanups

Eight comments from the latest copilot review on PR #28012:

1. SystemResource.getCacheStats: gate metrics on cacheConfig.provider != none
   instead of cacheProvider.available(). When Redis is configured but the
   circuit breaker is tripped, app-level counters are exactly what an
   operator needs to diagnose the outage — suppressing them while the
   provider is "down but configured" hides the diagnostic signal. Also
   downgrade CacheMetrics.getInstance() WARN → DEBUG so a poller loop
   doesn't spam logs in the entirely-normal cache-off state.

2. CachedReadBundle.getBatch contract: the method is documented as
   returning a list 1:1 with entityIds, but bypass returned
   Collections.emptyList() and callers indexing by position would shift
   off the rails. Return a same-size list of nulls under bypass so the
   positional contract holds regardless of cache state.

3+4. CacheBundle.invalidateEntity / Invalidatable.invalidate javadocs
   claimed they were called from EntityRepository.postUpdate / postDelete
   / restoreEntity. They are NOT (only postCreate, the pub-sub handler,
   and the admin endpoint reach this path). Updated both javadocs to
   reflect actual call sites so future Invalidatables aren't built on a
   wrong invalidation contract.

5+6. EntityRepository.find / findByName: check Guava L1 (getIfPresent)
   FIRST, NotFoundCache only on L1 miss. The previous shape consulted
   NotFoundCache before L1, adding one Redis GET per cached read — a
   regression on the hottest read path. L1 hit now serves with zero
   Redis traffic; the negative cache short-circuits only when the loader
   would otherwise pay for a DB / Redis-L2 round trip.

7. CachedLineage redesign: variants for one root now live as fields of a
   single Redis hash (HSET / HGET) instead of separate keys. Invalidate
   is one DEL — O(1) — instead of SCAN-and-iterate (O(N) over keyspace).
   This matters because invalidate fires on the hot write path (entity
   updates and lineage-edge mutations) and the SCAN cost grew linearly
   with cache size. CacheKeys.lineageGraphPattern is gone; new helpers
   are lineageGraphHash(rootId) and lineageGraphField(up, down, incDel).

8. SystemResource.invalidateCacheForEntity: when only fqn is supplied,
   resolve to id server-side via Entity.getEntityRepository(type).
   findByName(...) before fanning out. Id-keyed cache layers (lineage,
   CACHE_WITH_ID, NotFoundCache id-side) need the UUID; the previous
   shape silently skipped them. Lookup failures are logged at DEBUG and
   the request still proceeds with fqn-only invalidation — admin
   force-invalidate is best-effort by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): lineage hash TTL claimed only by first writer (EXPIRE NX)

Previous shape called `hset(hashKey, fields, ttl)` which translated to
HSET + EXPIRE under the hood. Every variant write therefore reset the
hash's expiry — variant A cached at T=0 with TTL=60, variant B cached at
T=55, and A's effective lifetime jumped to 115s instead of the intended
60s. Under a constant trickle of variant writes on a hot root, the
"stale" variant could effectively live forever.

Split the operation:

  - CacheProvider.hset(key, fields) — new overload, no TTL touch.
    Defaults to a 365-day TTL so providers that don't override get
    a long-lived key rather than an immortal one.
  - CacheProvider.expireIfAbsent(key, ttl) — EXPIRE … NX semantics:
    set the TTL only when the key has no prior expiry. Default
    returns false (providers that can't express NX get no extension
    benefit, but no regression).
  - RedisCacheProvider implements both: HSET without expire, then
    EXPIRE with ExpireArgs.Builder.nx(). Falls back gracefully on
    Redis < 7.0 (logs at DEBUG, returns false).

CachedLineage.safeHset now uses the split shape — the first writer
to seed a hash establishes the 60s window; subsequent variant writes
leave the expiry alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): mget unavailable-path alignment + lineage deser fallback

Two copilot review comments on PR #28012:

RedisCacheProvider.mget (line 646): when `available == false` we returned
`Collections.emptyList()`, violating the 1:1 positional contract that
callers (CachedReadBundle.getBatch and friends) rely on. Match the
error-fallback branch: return one Optional.empty() per requested key so
caller-side indexing stays aligned regardless of provider health.
Truly-empty input keeps returning empty list (no positions to align).

LineageRepository.getLineage (line 1345): unconditional readValue on the
cached JSON would throw and fail the request if Redis held a
partial/corrupted/old-schema value — turning cache corruption into a
persistent 500 until TTL expiry. Wrap the deserialize in try/catch; on
failure log WARN with the root id and depth, invalidate the affected
root's lineage hash, and fall through to a fresh computeLineage(). User
sees the same answer as cache-off; subsequent requests repopulate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): expireIfAbsent falls back to plain EXPIRE on NX failure

The previous shape returned false silently when EXPIRE … NX wasn't
supported (Redis < 7.0 syntax error, transient failure). That meant the
preceding HSET-without-ttl call could leave the lineage hash key with no
expiry at all, accumulating in Redis memory until the next manual
invalidation.

Catch the NX failure, log at DEBUG, and issue a plain EXPIRE so the key
still gets a bounded lifetime. The trade-off: on older Redis, every
variant write extends the expiry — strictly worse than the NX semantics
on a 7.0+ deployment, but vastly better than the alternative of
permanent unbounded keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): copilot review pass 5 — dedicated mget conn + breaker + IT isolation + key collision

Five comments from the latest copilot review on PR #28012:

RedisCacheProvider.expireIfAbsent breaker bookkeeping (line 432, gitar-bot):
the NX-fallback path issued a plain EXPIRE without recordSuccess() /
recordFailure(), so a real network blip there was invisible to the
sliding-window failure detector. Both success and failure now feed the
breaker, consistent with every other Redis-calling method in the class.

RedisCacheProvider.mget shared-connection hazard (line 692): even with
pipelineLock, single-key callers using syncCommands/asyncCommands on the
*same* connection had their commands buffered for the duration of the
auto-flush-off window. Switched to a dedicated `pipelineConnection` /
`pipelineAsyncCommands` created at init time and closed on shutdown. The
shared connection's auto-flush is never toggled now, so unrelated request
paths can't be starved by mget. pipelineLock still serializes mget vs
mget on the dedicated connection.

SystemResource.invalidateCacheForEntity fqn→id resolution (line 1113):
the resolution call used `findByName(fqn, ALL, fromCache=true)`. That
path consults NotFoundCache and the L1/L2 caches, which an admin force-
invalidate is explicitly trying to recover from — a poisoned negative
entry would short-circuit the resolution and silently skip every id-keyed
cache layer. Switched to fromCache=false so the resolution always goes
to the DB; only then can we trust the id we hand to CacheBundle /
EntityRepository invalidation.

CachedSearchLayerIT.java parallel-execution flakiness (line 50): the
test assertions depend on deltas in the *global* /system/cache/stats
counters. Under @Execution(CONCURRENT) other ITs issuing searches in
parallel inflate the counters and the deltas either don't show up (false
negative) or come from someone else's hits (false positive that masks
broken cache keying). Marked @Isolated + ExecutionMode.SAME_THREAD so
the class runs alone within its window.

CachedSearchLayer.buildKey ambiguous encoding (line 220): fields were
joined with a raw `|` delimiter, no escaping. A query string containing
`|idx=foo` would produce the same preimage as a different (principal,
index, query) tuple — cache-key collision → wrong cached response served
to the wrong user. Added length-prefixed field encoding
(`name=<utf8-bytes>:value|`); two distinct logical tuples can no longer
serialize to the same hash input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
2026-05-13 06:41:09 -07:00
Sriharsha Chintalapani
ed5e68c2a5
Make SearchIndexing distributed-only (#27971)
* Make search indexing distributed-only

* Update generated TypeScript types

* Address search index review comments

* Normalize search index entity aliases

* Return defensive search index config copies

* Update search indexing application docs

* Share staged reindex context mapping

* Speed up distributed job polling discovery

* Use database polling for distributed job discovery

* Address distributed search indexing review comments

* Address distributed indexing polling review

* Add SearchIndex promotion test coverage

* Fix distributed reindex finalization review comments

* Fix SearchIndex app Playwright run history parsing

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-12 08:22:59 -07:00
Sriharsha Chintalapani
2859cbeeb5
Add db-tune ops subcommand + production RDS runbook (#27890)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Add db-tune ops command + production RDS runbook

Generate a per-table autovacuum (Postgres) / InnoDB stats (MySQL)
tuning report from the live database; --apply executes the ALTER
statements after operator confirmation; --analyze refreshes planner
stats on the changed tables. Heuristic skips small dev tables
(row-count gated) and recognizes already-tightened settings as OK.

Catalog values come from the 600k-container production audit. The
companion runbook in docs/rds-postgres-production-runbook.md captures
the parameter-group settings, extensions, and the indexes that must be
applied by hand because the migration runner can't dedup cross-version.

Tests: 38 unit tests on the heuristic + report formatter; DbTuneIT
covers analyze / apply / analyzeOne / dry-run against the IT
bootstrap's live Postgres or MySQL Testcontainer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove file

* db-tune: address review 4230257768 — direction-agnostic status, empty/match disambiguation, script path

- ServerParamCheck.STATUS_UNDERSIZED -> STATUS_MISMATCH. Some recommended
  values are deliberately lower than engine defaults (random_page_cost
  1.1 vs 4.0, autovacuum scale factors). Labelling those as "UNDERSIZED"
  was wrong. Added a regression test that pins random_page_cost as
  MISMATCH when the current value is higher than recommended.

- dbTune --apply: distinguish "no tracked tables on this DB" from "all
  tables already match" in the Nothing-to-apply log line. Mirrors the
  same disambiguation already in DbTuneReport.appendNextSteps.

- DbTuneReport next-steps now points at ./bootstrap/openmetadata-ops.sh
  (the actual location of the wrapper script) instead of the bare
  ./openmetadata-ops.sh which would be file-not-found from the repo root.

- DbTuneIT analyzeReturnsRecommendationsForKnownTables: assertion
  message now references TEST_TABLE so it stays correct after the
  switch from storage_container_entity to entity_relationship.

Tests: 40 unit tests passing (+1 for the higher-than-recommended MISMATCH case).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* DbTuneIT: contain apply path to a private isolated table

The earlier IT had applyChangesReloptionsAndIsIdempotent + analyzeOne
running ALTER TABLE / ANALYZE TABLE against entity_relationship — a
real catalog table that the rest of the suite writes to. On MySQL this
broke ~17 downstream ITs (GlossaryResourceIT, GlossaryTermRelationsIT,
DomainResourceIT, LineageBrokenReferenceIT, OpenLineageLineageResolutionIT)
with "Cannot create a JSON value from a string with CHARACTER SET 'binary'"
on inserts into entity_relationship.json.

Root cause: ALTER TABLE bumps MySQL's per-table metadata version, which
invalidates JDBC prepared-statement caches across the whole shared
Testcontainer. After re-prepare, the JSON column metadata sometimes
comes back as VARBINARY in Connector/J's cache, breaking subsequent
JSON inserts. Confirmed via CI log timing: DbTuneIT finished at
17:58:12, GlossaryTermRelationsIT failed starting 17:58:53.

The recommendations themselves are correct — STATS_PERSISTENT=1,
STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=100 are safe and modern InnoDB
defaults except for SAMPLE_PAGES. The test just cannot afford the side
effect on a shared DB.

Fix: introduce dbtune_it_isolated_table created/dropped per test in
@BeforeEach/@AfterEach, build a TableRecommendation for it directly
(bypassing the static catalog), and run apply + analyzeOne against it.
The read-only tests (analyze, dry-run) continue to use the real catalog
since they don't mutate anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: add --diagnose for read-only DBA findings

The tuning report we already had is a static playbook (table → recipe + row-count gate). Operators rightly asked whether db-tune actually inspects the live database for *its own* signals — unused indexes, bloat, slow queries, cache hit ratios — and surfaces recommendations from those measurements. Until now: no.

This adds a parallel read-only path:

  ./bootstrap/openmetadata-ops.sh db-tune --diagnose

Postgres categories (each isolated in its own try block; missing extension or permissions surfaces in DbTuneDiagnosis.notes rather than failing the run):

- UNUSED_INDEX — pg_stat_user_indexes idx_scan=0, size > 10 MB, non-unique non-pkey
- HIGH_DEAD_TUPLES — n_dead_tup/n_live_tup > 0.2 with n_live_tup > 10k (autovacuum falling behind)
- LOW_CACHE_HIT — heap_blks_read > 1000 AND hit ratio < 90%
- STALE_STATS — last_autoanalyze NULL or > 14 days, n_live_tup > 1000
- SEQ_SCAN_HEAVY — seq_scan/idx_scan > 10 with > 1000 seq scans (suggests missing index)
- SLOW_QUERY — pg_stat_statements top 10 by mean_exec_time, calls > 100 (gracefully skipped if extension absent)

MySQL categories:

- UNUSED_INDEX — sys.schema_unused_indexes filtered to current schema
- LOW_BUFFER_POOL_HIT — Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests < 99%
- SLOW_QUERY — performance_schema.events_statements_summary_by_digest top 10 by avg_timer_wait
- FULL_TABLE_SCAN — sys.statements_with_full_table_scans

Concept stays separate from AutoTuner: Diagnostic is read-only and never participates in --apply. Categories with zero findings are suppressed in the report; notes capture what couldn't be checked. Each finding is structured as (category, severity, attributes) with the attribute keys driven by DiagnosticCategory.columns() so the renderer dispatches a category-specific layout.

Tests: 50 unit tests passing (40 → 50, +10 in DiagnosticReportTest covering EnumMap grouping order, empty/non-empty rendering, suppression, notes appending, query truncation, and column-list immutability). DbTuneIT gains diagnoseCompletesWithoutErrorAndReturnsStructuredResult that exercises the full end-to-end against the live Testcontainer (read-only, safe).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: restore apply DB verification + include idx_scan=0 in seqScanHeavy

Two findings from the review round on 2d9f047b71:

1. DbTuneIT.applyExecutesAndIsIdempotent stopped verifying that apply()
   actually persisted the settings to the DB after the rewrite to an
   isolated test table — it only asserted the ALTER statement
   contained the table name and that apply() didn't throw. The earlier
   shape (assertSettingsMatch) was load-bearing for catching regressions
   where apply() silently no-ops. Restored end-to-end verification.

   Added AutoTuner.currentSettingsForTable(Handle, String) so the test
   can read back settings on a table that is NOT in the static catalog
   (the isolated dbtune_it_isolated_table). Implemented for both
   engines by reusing the existing parseReloptions / parseCreateOptions
   logic. Falls out cleanly from a single pg_class / INFORMATION_SCHEMA
   query; no duplication.

   The IT now: builds the recommendation, asserts the generated SQL is
   well-formed, applies it, reads back via currentSettingsForTable,
   asserts each recommended key/value persisted numerically (handles
   PG lowercase / MySQL uppercase key conventions via case-insensitive
   lookup), then re-applies and asserts the snapshot is byte-equal.

2. PostgresDiagnostic.seqScanHeavy WHERE clause used
   seq_scan::numeric / NULLIF(idx_scan, 0) > :ratio, which silently
   excludes tables with idx_scan=0 — exactly the worst-case candidates
   for a missing index. The Java mapper even has a dead-code branch
   handling idx_scan=0 ("∞" ratio display) that could never trigger.

   Changed predicate to (idx_scan = 0 OR seq_scan::numeric/idx_scan > :ratio).
   The mapper's ∞ branch now actually executes. Inline comment added
   explaining the choice so a future audit doesn't undo it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* db-tune: address review 4264516128 — Enter dismisses prompt, NULL renders blank, ratio uses double

Four findings from the round on 4951a04776:

1. confirmApply prompt blocked on bare Enter
   scanner.hasNext()/next() are token-based, so a user pressing Enter to
   accept the [y/N] default would hang waiting for a non-whitespace
   token. Switched to scanner.nextLine() and treat empty input (and EOF)
   as "no" — matches the [y/N] convention.

2. & 3. NULL timestamps rendered as literal "null"
   highDeadTuples.last_vacuum and staleStats.last_analyzed are nullable
   columns; String.valueOf(rs.getString(...)) maps SQL NULL to the four-
   character string "null", which the renderer then prints verbatim in
   the diagnostic table. Added nullSafe() helper that returns "" for
   null and replaced both call sites.

4. seqScanHeavy ratio used integer division
   seq_scan / idx_scan as longs reports 15/2 → 7, masking how
   seq-scan-heavy a table actually is. Extracted formatSeqIdxRatio()
   that does (double) seqScan / idxScan formatted to one decimal, and
   still returns "∞" for the idx_scan=0 worst case. The previous fix to
   include idx_scan=0 in the SQL (4951a04776) means the "∞" branch now
   actually fires.

Tests: 53 passing (+3 for the new helpers — nullSafe + formatSeqIdxRatio
across normal, zero, and one-decimal cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 14:05:22 -07:00
Sriharsha Chintalapani
d3bbbefe37
fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999)
* fix(rdf): dedupe lineage edges and broaden PROV-O coverage

The RDF Knowledge Graph endpoint was emitting two edges per lineage
relationship — once as `om:UPSTREAM` (forward) and once as
`prov:wasDerivedFrom` (reverse) — because the parser preserved each
predicate's native subject/object orientation instead of canonicalizing
both into a single `(upstream, downstream)` edge.

Also extend PROV-O coverage so external SPARQL clients can use the W3C
Provenance vocabulary directly:
- `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on
  datasets / pipelines / users
- `prov:wasAttributedTo` mirror of `om:owners`
- `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used`
  on lineageDetails so the Entity → Activity → Entity chain is complete
- `prov:hadPlan` + `prov:Plan` for SQL transformation recipes
- `prov:startedAtTime` / `prov:endedAtTime` on Activity instances
- `prov:wasAssociatedWith` Activity → Agent linking
- `prov:invalidatedAtTime` on soft-deleted entities

Other RDF cleanups in the same area:
- LineageDetails URIs are now deterministic (driven by from/to ids
  instead of a timestamp), so re-indexing collapses duplicate Activity
  resources via the existing DELETE+INSERT idempotency
- Skip emitting the redundant `om:owners` JSON-string literal — the
  mapped path already produces clean `om:hasOwner <agent>` triples
- Skip empty `[]` array literals in the unmapped path
- Propagate failures from `RdfRepository.{addRelationship,
  addLineageWithDetails, bulkAddRelationships,
  bulkAddGlossaryTermRelations}` instead of silently swallowing them,
  so downstream callers can surface the failure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf-index-app): surface Fuseki failures in app run record

Per-entity and per-batch failures from the RDF index app used to be
logged via SLF4J only — they never made it into the AppRunRecord, so
the UI/run history showed "completed" even when every entity had
silently failed to write to Fuseki.

- `RdfBatchProcessor.processEntities` now captures the last error per
  entity, returns it in `BatchProcessingResult.lastError`, and
  accumulates relationship-processing failures into the same result.
- Relationship and lineage processing methods (`processBatchRelationships`,
  `processLineageRelationship`, `processGlossaryTermRelations`) return
  structured results with failure counts and last-error messages instead
  of `void`, so failures are visible to the partition worker.
- `RdfIndexApp` records the failure on `jobData` for both the
  distributed and non-distributed code paths, so users see a real
  error message in the run history (e.g.
  "Failed to write entity X to Fuseki: ConnectException").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* perf(rdf-index-app): port distributed-mode improvements from SearchIndex

The RDF distributed-indexing fork was lagging behind several SearchIndex
improvements that addressed concrete reliability and throughput issues.
Port them across:

Core perf / reliability
- Precomputed partition start cursors: coordinator walks each entity
  once via keyset pagination at job init and caches the boundary cursor
  per (jobId, entityType, rangeStart). Workers consult the cache before
  falling back to the OFFSET-based path. Eliminates the previous O(N²)
  per-partition cursor lookup.
- `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion`
  on the coordinator. Stop now cancels both PENDING and PROCESSING
  partitions in a single SQL update and immediately drives the job
  status from STOPPING → STOPPED, so the UI status no longer hangs
  while workers drain.
- Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset`
  uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of
  `List.of("*")`, avoiding expensive fetchers (e.g. fetchAndSetOwns)
  per batch.
- Partition heartbeat thread: virtual thread refreshes
  `lastUpdateAt` every 30s for partitions actively being processed by
  this server, so the stale reclaimer no longer interrupts active work.
- `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path
  rejects when the server already holds 5 PROCESSING partitions, giving
  fair distribution across pods. Verified the existing claim DAO uses
  `FOR UPDATE SKIP LOCKED` for both MySQL and Postgres.
- Gate WebSocket stat broadcasts during the STOPPING phase so the
  Quartz-scheduler-driven STOPPED status push isn't overwritten.

Multi-server scaffolding (single-pod is unaffected)
- `RdfPollingJobNotifier`: DB-polling discovery for other server pods
  to find an in-flight RDF reindex they can join.
- `RdfEntityCompletionTracker`: per-entity-type partition tracking with
  callback firing once all partitions for an entity complete, foundation
  for early per-entity index promotion.

Tests: precomputed-cursor cache lookup, in-flight backpressure,
cancelInFlight delegation, completion tracker callback semantics,
notifier start/stop.

DAO additions on `rdf_index_partition`:
- `cancelInFlightPartitions(jobId, now)` — covers both PENDING and
  PROCESSING in one statement
- `countInFlightPartitionsForServer(jobId, serverId)` — backpressure
- `countPartitionsByStatus(jobId, status)` — used by completion check

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row

When an app has no run history, AppRunsHistory fabricated a synthetic
placeholder row that looked like a real run — `runType: "CurrentConfig"`,
a fake `Run At` timestamp pulled from `appData.updatedAt`, an
ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button
that targeted nothing.

Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows,
and hide the `Stop` button so users no longer see "Run now → 19-minute
Running with Stop button" when the actual job never registered. Real
app runs are unaffected — they still display `runType` from the
backend (OnDemandJob, Hourly, Daily, Custom, etc.).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): address PR review findings

Four issues raised in PR #27999 review:

- **Cursor format consistency in walkAndRecord** (bug):
  The defensive branch produced cursors via a custom `{name, id}` map
  while the regular path used `repo.getCursorValue()`. For entities
  with quoted names these encodings diverge — a quoted-name entity
  could land in the cache with a cursor incompatible with what the
  worker fetches via keyset pagination. Track the last seen entity
  reference and run it through `repo.getCursorValue()` in both paths.
  `encodeBoundaryCursor` is removed.

- **Adaptive scheduling in RdfPollingJobNotifier** (perf):
  The previous implementation woke the scheduler thread every 1s and
  short-circuited inside the poll method when idle. Reschedule the
  task at the appropriate interval (1s active / 30s idle) when
  `setParticipating` flips, so the thread genuinely sleeps when idle.

- **Cursor cache cleanup on startup recovery** (edge case):
  `partitionStartCursors` was only evicted by `refreshAggregatedJob`
  / `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job
  and never reached either, the cache entry leaked until process
  restart. Add `evictStaleCursorCacheEntries()` invoked by
  `performStartupRecovery` that drops entries for jobs that no longer
  exist in the DB or are already terminal.

- **Consolidate describeError helpers** (quality):
  `describeError`, `describeBulkError`, and `describeLineageError` in
  `RdfBatchProcessor` all walked the cause chain and formatted a
  prefixed message with the same logic. Reduced to a single
  `describeError(prefix, error)` plus a thin `describeEntityError`
  adapter for the per-entity call site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop()

stop() called workerExecutor.shutdownNow() inline AND through
cleanupLocalExecution -> shutdownWorkerExecutor, which broke the
DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce
verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the
inline call — cleanupLocalExecution is the single owner of the
shutdown path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: drop redundant DB matrix from openmetadata-service unit tests

The {mysql, postgresql} strategy matrix on openmetadata-service unit
tests doubled CI cost without adding signal: both jobs ran the same
surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined
only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single
`test.database` property, and that property is consumed exclusively by
the failsafe plugin (integration tests `*IT.java` / `*IntegrationTest.java`),
which only runs under `-Pintegration-tests` — not enabled here.

`openmetadata-service` itself has zero tests that read `test.database`
or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The
only testcontainer-based DB code in the repo lives in
`openmetadata-integration-tests`, a different module that this workflow
doesn't build.

Run the unit suite once. The `openmetadata-service-unit-tests-status`
required-check aggregator is unaffected (it depends on the renamed job
which still has the same name).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): address Copilot PR review findings

Six correctness issues raised on PR #27999:

- **Lineage-details DELETE was too broad** (RdfRepository): the cleanup
  step deleted *all* `<fromUri> om:hasLineageDetails ?d` triples,
  so reindexing one (fromId, toId) edge wiped lineage-details links
  for every other downstream of the same source entity. Pin the
  delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>`
  triple. Same with prov:generated cleanup — anchor it to the
  specific detailsUri instead of any details resource.

- **Predicate not flipped during canonicalization** (RdfRepository):
  `parseEntityGraphEdgesFromResults` swapped subject/object for
  reverse-direction predicates (`prov:wasDerivedFrom`,
  `prov:wasInfluencedBy`) but kept the original predicate URI on the
  resulting EdgeInfo. Exported graphs could carry semantically
  invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`.
  Add `forwardEquivalentPredicate` to substitute the OM-native
  forward predicate when the direction flips.

- **`dct:modified` was an invalid xsd:dateTime** (RdfPropertyMapper):
  `entity.getUpdatedAt().toString()` returns the epoch-millis Long as
  a string, but the literal was tagged `xsd:dateTime`. Convert via
  `Instant.ofEpochMilli(...).toString()` so the lexical form matches
  the type — same fix already in place for prov:invalidatedAtTime.

- **Unmapped EntityReference arrays were dropped entirely**
  (RdfPropertyMapper): the previous fix to skip noisy JSON-string
  literals also dropped fields like `domains`, `reviewers`, `voters`
  for entity contexts that don't have a JSON-LD mapping for them —
  the unmapped path was the only path emitting them, so nothing
  landed in RDF. Expand each array element through
  `addEntityReference` so the data still produces proper
  `om:<fieldName> <ref>` triples; mapped-path duplicates are
  collapsed by Jena's Model dedupe.

- **Partition failure detection missed reader errors**
  (DistributedRdfIndexExecutor): the EntityCompletionTracker was fed
  `result.errorMessage() != null`, but `RdfPartitionWorker` can
  increment `failedCount` from `readerErrors` without ever setting
  `lastError`. Use `result.failedCount() > 0` so partitions whose
  failures came from `ResultList.getErrors()` are also marked as
  failed when promoting an entity.

- **`COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0**
  (RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS
  whenever any partition is FAILED or CANCELLED, including for
  user-initiated stops where no record-level failures accrued. The
  monitor's `completedWithErrors` gate required `failedRecords > 0`,
  so those terminal states never hit `jobData.setFailure(...)` and
  the run record showed success. Drop the failedRecords precondition
  and tailor the fallback message based on whether there are
  record-level failures or partition-level only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): separate relationship failures + type lineage as prov:Activity

Two more PR review findings on #27999:

- **Relationship failures inflated failedRecords stat**: `processEntities`
  was folding relationship/lineage edge failures into `failedCount`,
  which becomes `failedRecords` in the index stats. Records there mean
  entities, computed from entity counts in `totalRecords`. Counting
  per-edge relationship failures could push `failedRecords` above
  `processedRecords`/`totalRecords` and produce nonsensical
  per-entity stats.

  Track them separately: add `relationshipFailureCount` to
  `BatchProcessingResult` and `PartitionResult`. `failedCount` now stays
  entity-level. The completion tracker is fed the broader
  `result.hasAnyFailure()` so partitions where relationship triples
  failed don't get prematurely promoted as success even though their
  entity writes succeeded.

- **`detailsResource` wasn't typed as prov:Activity**: the resource
  carries Activity-shaped predicates (prov:startedAtTime,
  prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy,
  prov:wasAssociatedWith) but only the OM-specific
  `om:LineageDetails` rdf:type. Add an explicit
  `rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL
  clients recognize it as an Activity without having to learn the
  OM type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): label lineage edges relative to focal node

The Knowledge Graph view was labeling every edge with relation
type "upstream" as "Upstream" regardless of direction relative to the
focal node. For a focal node F, the raw stored relation `(F, X, upstream)`
means "F is upstream of X" — i.e. X is *downstream* of F. The previous
output labeled both `F → X` and `X → F` edges as "Upstream", which made
bidirectional lineage look like a duplicated relation.

Re-orient the label in `convertEdgesToGraphData` based on whether the
focal is the edge's source or target:
- focal → X → "Downstream"
- X → focal → "Upstream"
- non-focal-touching edges keep the raw relation label.

Reported on a sample-data table with a circular lineage cycle
(`dim_customer ↔ fact_orders`) where both directions showed "Upstream".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close remaining Copilot review gaps

Three findings from PR #27999's third review pass — all about failure
signals being silently dropped between layers:

- **`RdfIndexApp.processTask` ignored relationship failures**: only
  `result.failedCount() > 0` was treated as a failure, so partitions
  whose Fuseki relationship/lineage writes failed (incrementing
  `relationshipFailureCount` but not `failedCount`) never wrote
  `jobData.failure`. Switch to `result.hasAnyFailure()` and report the
  combined count.

- **`checkAndUpdateJobCompletion` ignored partition `lastError`**: a
  partition can finish COMPLETED with `lastError` set when a relationship
  bulk write was caught and recorded but didn't bump `failedRecords` or
  flip the partition to FAILED. The job would then go to COMPLETED even
  though there were real failures. Treat the presence of any
  `rdf_index_partition.lastError` as an error signal — promote to
  COMPLETED_WITH_ERRORS and aggregate sample errors into the job's
  errorMessage if it was blank.

- **`forwardEquivalentPredicate` mapped to a non-existent
  `om:DOWNSTREAM` URI**: OpenMetadata only stores lineage with
  `om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O
  pair); there is no `om:DOWNSTREAM` predicate written anywhere — the
  downstream view is derived by reading the same UPSTREAM edge from the
  other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy`
  to `om:UPSTREAM` (both are reverse-direction causation predicates: in
  `B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and
  effect is B, so the canonical forward predicate is the same).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix RDF tag mapper

* Fix all the comments

Cherry-picked from #27562 (without bin/ autogenerated noise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Align RdfPropertyMapper tests with refactor and isolate ontology export IT

RdfPropertyMapperTest still referenced the removed addVotes helper and
expected addStructuredProperty to dispatch votes — both gone after votes
was added to IGNORED_PROPERTIES. Update the assertions accordingly.

GlossaryOntologyExportIT timed out on the full suite because it flips a
global RDF singleton in @BeforeAll and each test blocks a server thread on
synchronous Fuseki writes. SAME_THREAD only serialized methods within the
class — concurrent classes still raced for server threads. Adding @Isolated
matches the pattern already used by RdfResourceIT for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): align addCertification typing + relationType after predicate flip

Two findings on PR #27999 from the post-cherry-pick review pass:

- **`addCertification` mis-typed glossary-source certifications and
  skipped skos:Concept**: it always emitted `om:Tag` regardless of
  source, even though `resolveTagResource` returns a glossaryTerm URI
  when the certification points at a glossary term. It also didn't add
  `skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for
  classification tags), so SPARQL queries filtering certification
  targets by `a skos:Concept` missed them while `addTagLabel`-emitted
  tags were findable. Mirror `addTagLabel`: branch on source
  (`Glossary` vs `Classification`), emit the right primary type plus
  `skos:Concept` (glossary) or `om:Tag` (classification), and include
  `om:tagSource`.

- **`relationType` left stale after predicate flip**: when
  `parseEntityGraphEdgesFromResults` flipped subject/object for a
  reverse-direction predicate and rewrote `canonicalPredicate` to
  `om:UPSTREAM`, it kept the original `relationType` derived from the
  reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo
  with `relationType=downstream` + `predicate=om:UPSTREAM` —
  internally inconsistent, and the mismatched `edgeKey` prevented
  dedup against an existing UPSTREAM edge with the same endpoints.
  Re-derive `relationType` from the canonical predicate after the
  flip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close 2 review findings + add parser-helper unit tests

Two outstanding Copilot findings on PR #27999 plus targeted unit
coverage for the helpers that drive lineage canonicalization.

Findings:

- **`colLineageUri` collision risk** (RdfRepository): the deterministic
  key replaced non-alphanumerics in `toColumn` with `_`, so distinct
  column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which
  would lose / overwrite column-lineage resources during reindex.
  Append the loop index as a tiebreaker so distinct columns keep
  distinct URIs.

- **`createTypeResource` missing dprod prefix** (RdfPropertyMapper):
  the `getNamespace` switch didn't recognize `dprod`, so
  `RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`)
  produced an invalid `dprod:DataProduct` URI on the wire. Added the
  `DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod`
  case in the switch.

Coverage:

- New `RdfParserHelpersTest` exercises the canonicalization helpers
  via reflection: `isReverseDirectionPredicate` (recognizes
  PROV-O causation predicates, ignores forward predicates),
  `forwardEquivalentPredicate` (both `wasDerivedFrom` and
  `wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works),
  `relativeRelationLabel` (focal-relative Upstream/Downstream
  flipping with all the boundary cases — non-focal edges,
  non-lineage relations, null focal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): merge array contexts before per-field resolution

The third (low-confidence "suppressed") finding on review 4256830399
turned out to be a real duplication: when a field is mapped in one
context map of an array context but absent from another, the previous
processArrayContext ran processContextMappings once per map. The pass
where the field IS mapped emits the proper `om:hasOwner <ref>` triples
(plus `prov:wasAttributedTo`); the pass where the field is absent
falls through to processUnmappedField and emits an additional
`om:owners <ref>` triple. Net: two predicates for the same logical
relationship.

Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112
`om:owners` triples — one set per pass.

Fix: flatten all context maps in the array into a single merged map
once, then iterate entity fields exactly once against that combined
view (later contexts win on key conflicts, matching JSON-LD context
merge semantics). Each field is resolved against the union of
mappings, so the unmapped fallback only fires for fields truly absent
from every context. Net effect: `prov:wasAttributedTo` count is
unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners`
triples disappear.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close 2 review findings on coordinator finalization race

Two findings from PR #27999 review 4259628860:

- **`checkAndUpdateJobCompletion` early-returned before lastError check
  could promote**: `refreshAggregatedJob` already marks the job COMPLETED
  when partitions all finish without `failedRecords`/`failedPartitions`,
  so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())`
  short-circuit silently dropped the lastError signal. Move the
  partition-lastError check INTO `refreshAggregatedJob` so both code
  paths produce consistent terminal status — a partition that finished
  COMPLETED but carries a non-null lastError now correctly promotes the
  job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the
  race.

- **`completePartition` / `failPartition` overwrote CANCELLED state**:
  the unconditional partition row update lost a concurrent Stop's
  CANCELLED status if a worker finished its batch after the Stop
  request landed but before noticing it. Add a status-guarded
  `updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND
  status = 'PROCESSING') and have both completion paths use it; if 0
  rows update, log and skip the side effects (no server-stat increment,
  no refreshAggregatedJob call) so the authoritative CANCELLED status
  stays. Mirrors the pattern SearchIndex's coordinator uses for the
  same race.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
2026-05-11 06:14:50 -07:00
Sriharsha Chintalapani
22a6c10072
Context center (#27558)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Add Context Center: Migrate Knowledge Center , Images/ PDFs document support

* Add Context Center: Migrate Knowledge Center , Images/ PDFs document support

* Address PR #27558 review comments

- KnowledgePageRepository: null-safe pageType in getHierarchyWithSearch
  and getHierarchyWithSearchForActivePage so the /search/hierarchy
  endpoint no longer NPEs when the pageType query param is omitted. The
  ES/OS client helpers already skip the pageType term when the value is
  null or empty, so this is a pure null-guard.
- ContextFileResource.uploadFile: when a failure happens after the
  ContextFileContent row is created (e.g. inside extractionService.submit),
  the cleanup path now hard-deletes that content row so the DB is not
  left with an orphaned record.
- ContextFileResource: replace the raw Content-Disposition string with a
  buildContentDisposition helper that emits both the legacy quoted
  filename= and the RFC 5987 filename*=UTF-8'' parameter with
  percent-encoded bytes, so international filenames round-trip while
  staying header-injection safe. sanitizeFileName also falls back to
  "download" on null/blank input.
- ContextFileResourceTest: new cases for sanitizeFileName null/blank
  fallbacks and for buildContentDisposition ASCII/unicode/space/injection
  behaviour (18 tests, all passing).

* Address copilot review comments on PR #27558

- AssetRepository.getByFqnPrefix: swap arguments so (assetType, fqnPrefix)
  matches the DAO signature — previous ordering always missed the index.
- FolderResource / ContextFileResource getEntitySpecificOperations: return
  List.of() instead of null so callers iterating the returned list cannot
  NPE.
- SearchUtils.getPageHierarchy: replace UUID.fromString with a parseUuid
  helper that returns null for missing/malformed values and logs a warning
  instead of failing the whole hierarchy response.
- DaoListFilter: qualify the pageType column with the caller-provided
  tableName, rename getArticleCondition to getPageTypeCondition (legacy
  no-arg method kept as @Deprecated wrapper for compatibility).
- Elastic/OpenSearch client processPageHierarchyHits: replace the per-hit
  getChildrenCountForPage search (N+1) with a single pass over the batch
  that derives childrenCount from pages whose parent is in the same
  result set. Also drops the now-unused helper and its throws clause.
- openmetadata-sdk/pom.xml: mark JWT, JAX-RS client, Apache HttpClient,
  jakarta.json, parsson, and JUnit Jupiter as <optional>true</optional>
  so they don't leak into SDK consumers that only use the core client.
- InMemoryAssetService: use the shared AsyncService executor for upload
  /read/delete instead of the JVM common ForkJoinPool.
- sample-pricing.xlsx: replace the plain-text placeholder with a real
  minimal XLSX workbook so detection-based and extraction-based code
  paths see a valid Microsoft Excel 2007+ file.

* Use one filters aggregation for page hierarchy childrenCount

Follow-up to b8458e2868. The previous fix derived childrenCount from
pages whose parent appeared in the same batch — that worked for
listPageHierarchyForActivePage (which fetches all depths) but always
returned 0 on the plain listPageHierarchy path (which only fetches one
depth), so top-level listings lost the count semantically.

Replace with a single `filters` aggregation keyed by page id: each
named bucket matches descendants via a fullyQualifiedName prefix query
against the page's FQN. That gives accurate direct-descendant counts
for every returned page in one aggregation round-trip, still O(1)
additional search requests regardless of batch size.

* Add allowedFields entries for contextFile, folder, page

Fixes SearchSettingsHandlerTest.testEveryAssetTypeHasCorrespondingAllowedFields.

searchSettings.json already had assetTypeConfigurations for contextFile,
folder, and page but no matching allowedFields entries, so the test that
asserts every assetType has a corresponding allowedFields block failed
with 'Asset type contextFile has no corresponding allowedFields entry'.

Adds the three missing blocks with the fields that each index actually
exposes — name / displayName (with .keyword and .ngram variants),
description, fqn, fqnParts, tags/tier/domains/dataProducts, plus
entity-specific fields (fileType/contentType/extractedText for
contextFile, parent.displayName for folder/page, pageType for page).

* Fix ui checkstyle

* Fix Java checkstyle

* Address PR #27558 copilot review round 2

- ES/OS populateChildrenCounts: add fqnDepth == parentDepth + 1 to the
  per-page filter so childrenCount is direct children only, matching the
  field name and the UI's isLeaf check semantics. Previously matched all
  descendants.
- ES/OS buildPageNestedSearchHierarchy: filter out hits with a null id
  before Collectors.toMap, which would otherwise NPE when SearchUtils
  drops a malformed UUID.
- SearchUtils.getPageHierarchy: wrap PageType.fromValue in a parsePageType
  helper that logs and returns null on unknown values, so a single bad
  hit can no longer break the whole hierarchy response.
- TestSuiteBootstrap.setupMinIO: pin minio/minio to
  RELEASE.2024-01-16T16-07-38Z instead of :latest so a newly-published
  image cannot break integration tests without a code change.
- createContextFile.json: rewrite the assetId description to be provider
  agnostic (S3 / Azure Blob / in-memory / no-op) and flag it as the legacy
  path, preferring headContentId / ContextFileContent.

* Update generated TypeScript types

* Address PR #27558 copilot review round 3

- bootstrap/sql/migrations/native/2.0.0/mysql/schemaChanges.sql:
  - asset_entity: add PRIMARY KEY (id); mark all generated columns STORED
    for consistency with the other drive/knowledge tables in the same
    migration; compute deleted as a real boolean via
    IFNULL(JSON_EXTRACT(json, '$.deleted'), FALSE) so the boolean index
    behaves correctly.
  - knowledge_center: mark name, updatedAt, updatedBy, pageType as STORED
    and apply the same deleted expression so the existing indexes on
    name and (fqnHash, deleted) are reliable on fresh installs.
  - drive_folder / context_file / context_file_content: update the
    deleted generated column to use the same boolean-safe expression.
- ElasticSearch/OpenSearch hierarchy search: add an explicit sort on
  fullyQualifiedName ASC with _id ASC as tiebreaker so from/size
  pagination is deterministic and cannot skip/duplicate pages between
  requests.

* Fix UI checkstyle

* Address PR #27558 copilot review round 4

- createPage.json: rewrite the field descriptions for name, displayName,
  owners, reviewers, and entityStatus. They were copy/pasted from other
  schemas ('query', 'tag') and were misleading in generated docs and
  clients.
- NoOpAssetService.generateDownloadUrlWithExpiry: return asset.getUrl()
  instead of a synthetic 'https://cdn.example.com/...' URL. The old
  behaviour let clients attempt downloads that would never resolve when
  object storage was disabled; returning the asset's own (empty) URL
  surfaces the misconfiguration cleanly.
- AzureAssetService: normalize the prefix path the same way S3 does.
  Previously a null/blank prefix produced the literal 'null/' prefix,
  writing blobs under the wrong key. New formatPrefix returns "" for
  null/blank and ensures exactly one trailing '/' for a real prefix.
- AssetRepository.getByFQN: treat null *or* empty list as 'not found',
  matching getByFqnPrefix. Callers previously received an empty list
  silently when the DAO returned [] instead of a 404.

* Update generated TypeScript types

* Fix UI checkstyle

* Address PR #27558 copilot review round 5

- AssetDAO.update / AssetRepository.update: switch the UPDATE target from
  fqnHash to id. Two assets can share the same fullyQualifiedName (e.g.
  successive revisions of the same context file), so the old SQL could
  silently update sibling rows.
- ContextFileExtractionService: run the extraction pipeline on a
  dedicated fixed thread pool instead of AsyncService.getExecutorService.
  process() blocks on assetService.read(...).join(), and S3/Azure reads
  are themselves scheduled on AsyncService — sharing the same bounded
  pool risks starving those reads (and deadlocking) once every thread is
  busy running extractions.
- postgres/schemaChanges.sql: wrap the generated deleted column in
  COALESCE((json ->> 'deleted')::boolean, false) (and the asset_entity
  CAST variant) so an absent 'deleted' key is stored as FALSE, not NULL.
  Otherwise "non-deleted" filters based on the boolean index drop rows
  silently. Matches the MySQL IFNULL(..., FALSE) side of the migration.
- ContextFileUploadSupport.sanitizeEntityName: treat null/blank input as
  'file' instead of NPE-ing on replaceAll. Multipart uploads can arrive
  without filename metadata; the upload should still succeed with a
  stable generated name.

* Remove macOS-only @rollup/rollup-darwin-arm64 dev dep

I pinned this during local troubleshooting to get a Vite dev server
running on macOS (rollup's optional native binary was missing). CI runs
on Linux, where yarn install --frozen-lockfile refuses the package
('The platform \"linux\" is incompatible with this module'), which
broke license-header, lint-src, lint-playwright, i18n-sync, app-docs,
and ui-coverage-tests for PR #27558.

rollup re-resolves its native binary per platform — there's no reason
to pin the darwin one. Remove it from package.json and drop the
matching '@rollup/rollup-darwin-arm64@^4.60.2' block from yarn.lock.

* Re-declare optional SDK test deps on integration-tests classpath

KnowledgeCenterIT failed in CI with
'java.lang.NoClassDefFoundError: org/glassfish/jersey/apache/connector/ApacheConnectorProvider'
after I marked the JAX-RS client stack in openmetadata-sdk as
<optional>true</optional> during review round 2. That change stops the
deps from leaking to every SDK consumer, but integration-tests actually
uses org.openmetadata.sdk.test.util.RestClient, so the optional deps
must be re-declared on its own classpath.

Adds jakarta.ws.rs-api, jersey-client, jersey-apache-connector,
httpclient, jakarta.json-api, and parsson to
openmetadata-integration-tests/pom.xml as <scope>test</scope>.

* Fix IT failures from CI integration-tests-mysql-elasticsearch

1. MySQL deleted column: revert the IFNULL wrapper to plain
   (json -> '$.deleted'). My earlier
   IFNULL(JSON_EXTRACT(json, '$.deleted'), FALSE) hit
   'Incorrect integer value: false for column deleted' on fresh installs
   because MySQL cannot coerce the resulting JSON scalar into TINYINT(1)
   when the column is STORED. The bare '(json -> '$.deleted')' form is
   what other OM tables already use, and MySQL converts JSON true/false
   to 1/0 directly for the BOOLEAN column. STORED + PRIMARY KEY stay
   in place.
2. DriveFileUploadIT: raise the four short atMost(5s) awaits to 20s
   with explicit pollDelay(ZERO) + pollInterval(200ms).
   K8sOMJobOperatorIT sets a global Awaitility pollInterval of 5s at
   class setup; any subsequent test with atMost <= 5s hits
   'Timeout must be greater than the poll delay'. Overriding the
   per-call poll settings insulates these asserts from the global
   leak.

* Document SDK test-utility optional deps

In review round 2 we marked jersey-client, jersey-apache-connector,
jakarta.ws.rs-api, httpclient, jakarta.json-api, parsson, java-jwt, and
junit-jupiter-api as <optional>true</optional> on openmetadata-sdk so
that core SDK consumers don't inherit a heavy JAX-RS + JUnit stack.
openmetadata-integration-tests hit this immediately with
NoClassDefFoundError from RestClient; its own pom now re-declares the
deps.

Add a "Test utilities" section to the SDK README that lists the
optional deps downstream test-utility consumers must re-declare (with
the concrete <scope>test</scope> XML snippet) and explains the error
they'd otherwise see.

* NoOpAssetService: never return null from generateDownloadUrlWithExpiry

In review round 4 I changed this method to return asset.getUrl() when
the asset is non-null. But Asset.url is optional in the schema, so
asset.getUrl() itself can be null — which breaks the implied "never
returns null" contract downstream callers rely on (AttachmentResource
only null-checks defensively).

Normalize null and blank URLs to an empty string so the method's
non-null, non-blank contract holds even when storage is disabled and
the asset was never populated with a URL.

* AssetServiceFactory: swap to NoOp when re-initialized with storage off

init(...) previously only assigned NoOpAssetService when instance was
null. On a re-init with object storage toggled off (config reload, test
teardown, etc.), the previously wired S3/Azure/InMemory provider stayed
live and kept serving real IO against a backend the operator thought
was disabled.

Replace the instance with a fresh NoOp when storage is disabled unless
the instance is already a NoOp (idempotent on repeated disabled
inits).

* Type create-request domains arrays as fullyQualifiedEntityName

The three new KC/Drive create schemas (createFolder, createContextFile,
createPage) had domains as an array of unconstrained strings. The rest
of the OM API models domain references as FQNs, and the shared
basic.json#/definitions/fullyQualifiedEntityName is the convention for
this.

Point all three items refs at fullyQualifiedEntityName so generated
clients see a consistent FQN type and requests get validated for
non-empty length/format rather than any string.

* Update generated TypeScript types

* Address PR #27558 copilot review 4144965142

- ContextFileExtractionService: switch the default thread pool to
  a static final DEFAULT_EXECUTOR, so every production instance of the
  service reuses the same pool instead of leaking a fresh fixed pool
  per construction (tests especially create multiple instances).
  Threads remain daemons, so the pool never blocks JVM shutdown.
- ObjectDeleteQueueService: when queueCapacity is 0, use a
  SynchronousQueue so "reject-if-all-workers-busy, no buffering" holds.
  Previous Math.max(1, queueCapacity) silently allocated a 1-slot
  ArrayBlockingQueue, contradicting the caller's stated capacity and
  potentially buffering one task past the semaphore's accounting.

Not fixing:
- SearchUtils @Slf4j 'LOG' vs 'log'. OM's openmetadata-service/lombok.config
  sets 'lombok.log.fieldName = LOG', so @Slf4j correctly generates
  'LOG' for every class in this module. The reviewer's concern only
  applies to projects without that directive. Verified clean compile.

* Address PR #27558 copilot review 4144917449

- knowledgeCenterTags.json: change mutuallyExclusive from the string
  "false" to the JSON boolean false. The Classification schema declares
  this as `"type": "boolean"`; jackson's lenient string->boolean
  coercion masked it until now, but strict validators would reject and
  the other OM bootstrap tag files that use the correct boolean
  (piiTagsWithRecognizers.json) model what this should look like.

- ContextFileExtractionService.process: guard the updateContent
  updater with the same head-content check already used in
  updateFile. Previously, if headContentId flipped between the
  initial check and the status writes, updateFile would no-op while
  updateContent still marked the now-stale content "Analyzing",
  leaving it stuck once the later early-return fires.

- AzureAssetService.upload: stream the InputStream straight to the
  blob using the known asset.getSize() instead of reading the whole
  payload into a byte[] via IOUtils.toByteArray. Matches the S3
  streaming behaviour and avoids full-file heap pressure / OOM risk
  on larger files. Buffered fallback retained when size is unknown.

- Size fields modeled as integer: flip fileSize / size on
  createContextFile.json, contextFile.json, asset.json,
  createAsset.json, and contextFileContent.json from
  "type": "number" to "type": "integer" with "format": "int64" and
  "minimum": 0. Byte counts are inherently whole numbers; floating
  point loses precision above 2^53 and makes validation murky.
  Update the (double) call sites in ContextFileResource,
  ContextFileUploadSupport, and AttachmentResource to match.

Not fixing:
- ContextEntityPromptService "unused Authorizer import" — false
  positive, the class uses it in the constructor.
- NoOpAssetService.generateDownloadUrlWithExpiry null return — already
  fixed earlier in commit a4a2dcc91d (returns "" when url is
  null/blank).

* AssetService.read: run inline instead of hopping through AsyncService

Every caller of AssetService.read(...) immediately .join()s on the
returned future:

- ContextFileExtractionService.process reads + extracts
- ContextFileResource.downloadFile reads + streams back
- AttachmentResource.serveAsset reads + streams back
- QueuedDeleteAssetService just delegates

None of them exploit the async nature, but the S3/Azure/InMemory
implementations all wrapped the blocking fetch in
AsyncService.executeAsync or CompletableFuture.supplyAsync on a
bounded pool. That created a starvation path when any caller thread
was already running on AsyncService (or could monopolize it under
load) — join() would block the caller while the submitted read
task fought for a free worker.

Switch S3, Azure, and InMemory read() to execute on the caller's
thread and return CompletableFuture.completedFuture(...). Interface
is unchanged so existing .join() callers keep working; the extra
thread hop and the potential for AsyncService starvation are both
gone. Combined with the dedicated context-file-extraction pool, the
extraction pipeline no longer touches AsyncService for any
asset-read step.

* Address PR #27558 copilot review 4151211562

- FolderIndex / ContextFileIndex: stop re-setting entityType, deleted,
  owners, totalVotes inside buildSearchIndexDocInternal. Those common
  fields are populated by populateCommonFields in the SearchIndex
  template method (Phase 1) before Phase 3 calls the entity-specific
  internal builder, so the explicit puts were redundant and silently
  overrode the template output. Aligns with PageIndex convention and
  updates the unit tests to assert the internal builder sets only
  entity-specific fields.

- ContextFileTextExtractor: bound the Tika BodyContentHandler at
  MAX_CANONICAL_TEXT_LENGTH instead of passing -1 (unbounded) so a
  pathological image cannot drive OCR to accumulate arbitrary output
  on the heap.

- ContextFileExtractionService: replace the unbounded
  Executors.newFixedThreadPool backing queue with a ThreadPoolExecutor
  using an ArrayBlockingQueue + AbortPolicy. Without a bounded queue
  the RejectedExecutionException handling in submit(...) was dead
  code; with it, an overloaded server surfaces a "retry later"
  failure status instead of silently accumulating work.

- S3AssetService / AssetService / AssetServiceFactory /
  QueuedDeleteAssetService: make AssetService extend AutoCloseable
  with a default no-op, override close() in S3AssetService to release
  the S3Client and S3Presigner connection pools, and register a
  shutdown hook in AssetServiceFactory that closes the current
  provider on JVM exit (and on re-init when the provider changes).

- bootstrap 2.0.0 MySQL schemaChanges: change the deleted generated
  column from (json -> '$.deleted') to
  (JSON_EXTRACT(json, '$.deleted') IS TRUE) so rows where the JSON
  key is absent resolve to FALSE instead of NULL. Avoids filter misses
  on the composite (fqnHash, deleted) index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix Java checkstyle

* Fix integration test compile + S3 generateDownloadURL

ContextFileIT / DriveFileUploadIT compile failures came from the
fileSize schema switch to integer/int64 — the generated setter/getter
is now Integer. Replace the double literals with ints and the
assertEquals(double, ...) sites with intValue() so the (int, int)
overload resolves unambiguously.

Also override S3AssetService.generateDownloadURL to return a
short-lived presigned URL (mirroring AzureAssetService) instead of
inheriting the default, which would return the raw S3 key from
asset.url. Addresses review 4151282021.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert MySQL deleted column back to bare json -> expression

The JSON_EXTRACT(...) IS TRUE form broke integration tests — GET after
create started returning 404, consistent with MySQL evaluating the
IS TRUE predicate against the JSON scalar in a way that stored 1
instead of 0 for freshly-created rows (deleted=false).

Restoring the bare (json -> '$.deleted') expression used pre-review.
Rows with the key missing will store NULL on the generated column,
which is a theoretical concern the review flagged but does not affect
current code paths (all inserts write json.deleted explicitly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix Transi18next import path in KnowledgeCenter components

Two KnowledgeCenter files imported Transi18next from
'utils/CommonUtils', which is where Collate's UI re-exports it from.
OpenMetadata core exports Transi18next from 'utils/i18next/LocalUtil'
(same path every other core file uses). The Collate-style import
broke the production Vite/Rollup build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Harden ContextFileIT.testFileAppearsInSearch against async indexing

The test used a fixed Thread.sleep(2000) then a single assertEquals
on the status code. That was flaky two ways: ES indexing is async
and the 2s window is not always enough, and on a fresh cluster the
context_file_search_index itself may not exist yet at first query
(yielding 500).

Replace with an await() loop that polls every 200ms for up to 30s
and asserts both status==200 AND that the newly-created file's UUID
appears in the response. Matches the assertSearchContainsFile
helper in DriveFileUploadIT.

Also URL-encode the namespaced query string so the uniqueName
does not break the query parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Make playwright editor shortcuts platform-aware

The SHORTCUTS constant in playwright/constant/KnowledgeCenter.constant.ts
hard-coded "Meta+b" / "Meta+z" / etc. On macOS Meta is Cmd and those
shortcuts trigger bold / undo / copy as expected, but on the Linux CI
runners Meta is the Super (Windows) key — so every ProseMirror
formatting and history test just pressed Super+b, which does nothing,
and the test then fails waiting for the <strong>…</strong> element
(or for the undone text to disappear).

Detect the runner platform and use Meta on macOS, Control everywhere
else — matching the same pattern in src/constants/KnowledgeCenter.constant.ts.

Unblocks the 6 KnowledgeCenterTextEditor failures across Admin / Data
Consumer / Data Steward roles (Text Formatting + Undo/Redo). Slash
commands keep passing because they don't depend on modifier keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Run prettier on DateTimeUtils.ts

CI's lint-src job fails because ESLint+Prettier --fix produces a
non-empty diff against the committed tree. Local prettier pass
trimmed the indentation and added a trailing comma in the imports
block. No behavioral change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix Knowledge Page entity-link + DAO filter regressions from the port

Downloaded the failing playwright traces from the PR's postgres e2e run
and walked each one. Three distinct bugs, all present because the
Collate-side overrides (overrides/EntityUtilClassCollate.ts and the
DaoExtension.KnowledgeExtensionDAO custom SQL) were not carried over
into OpenMetadata core when KnowledgeCenter was merged up.

1) CollectionDAO.KnowledgePageDAO: override listCount / listBefore /
   listAfter (plus helper SQL queries) so that
   `GET /v1/knowledgeCenter?entityId=X&entityType=topic` actually INNER
   JOINs entity_relationship and returns only pages whose
   relatedEntities contains the target entity. Without this the base
   EntityDAO ignored entityId/entityType entirely and returned every
   page, which is why the "Knowledge Articles" widget on a data asset
   page showed the 15 fixture articles instead of the one just attached
   — and why updateDataAsset timed out waiting for the linked article.
   Uses OWNS relation for user/team filters (same semantics Collate
   uses) and HAS for every other entity type.

2) EntityUtilClassBase + EntityUtils.getEntityLinkFromType: add
   EntityType.KNOWLEDGE_PAGE cases that route to getKnowledgePagePath.
   Before this, mention notifications for Knowledge Pages fell through
   to the default `/table/<fqn>` branch (confirmed in the captured
   page-snapshot: the mention link pointed at `/table/Article_eEqrWeeU`),
   which 404'd on the Table API and rendered an error page — so the
   entity-header-display-name textarea never appeared and the User
   Mentions test timed out. Search results on Explore had the same
   problem, rendering every Knowledge Page result card with href="/".

3) EntityUtilClassBase.getEntityByFqn / ENTITY_PATCH_API_MAP /
   getResourceEntityFromEntityType: handle KNOWLEDGE_PAGE end-to-end so
   the detail-page fetch, patches, and policy lookups all route through
   the knowledgeCenter REST API rather than falling back to the generic
   entity utilities (which don't know about the 'page' entity type).

Verified against the real trace artifacts from CI run 24790718035:
- shard 3 Knowledge Center page test — widget shows 10 unrelated
  "Article_*" fixture items instead of the created one → root cause
  is the missing DAO JOIN (#1).
- shard 3 User Mentions test — notification link is /table/, not
  /knowledge-center/ (#2).
- shard 3 Reviewer Workflow — data consumer's knowledge-center goto
  renders "No data available" because getEntityByFqn fell back to a
  table fetch for a page FQN (#3).
- shard 5 ExplorePageRightPanel_KnowledgeCenter (22 failures) —
  search result card links are "/explore/" (empty), same root cause
  as (#2) inside getEntityLinkFromType default branch.

Compiles: mvn -pl openmetadata-service -q -DskipTests compile passes;
tsc --noEmit reports no new errors in the touched files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address remaining PR #27558 review feedback

Seven actionable fixes drawn from the still-open review threads; the
rest of the open threads in copilot's bot reviews are either already
addressed in earlier commits or stale against the current code and
are being resolved on the review UI alongside this commit.

- AssetRepository.getByFQN: the LOG.error message said "asset with id"
  but was printing the FQN. Relabel to "asset with FQN" for accurate
  troubleshooting (thread #42).

- KnowledgePageMapper.createToEntity: stop mutating the inbound
  CreatePage by calling create.withRelatedEntities(...). Build the
  effective list as a local variable and pass it to copy(...). Prevents
  the Organization fallback from leaking into the caller's request
  object, which is surprising when the request is re-used or logged
  (thread #43).

- FolderIndex: default childrenCount to 0 when the entity hasn't yet
  had its children recomputed (e.g. a freshly created folder). Prevents
  the numeric field from being indexed as missing, which broke range
  and sort queries that assume it is always present (thread #46).

- NoOpAssetService and InMemoryAssetService: override
  generateDownloadURL to delegate to generateDownloadUrlWithExpiry,
  matching S3/Azure. Without this, callers using the non-expiry API
  got asset.getUrl() (often empty for these providers), yielding broken
  download links (threads #39, #45).

- ObjectDeleteQueueService: register a JVM shutdown hook in the
  singleton's initializer that calls stop(). The service already
  implements Dropwizard Managed, but nothing currently wires it into
  the application lifecycle, so non-daemon delete-worker threads were
  at risk of keeping the JVM alive after ungraceful termination. The
  hook is a belt-and-suspenders fallback to the Managed path
  (threads #52, #53).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add java-checkstyle skill for Claude + Codex agents

CI keeps surfacing "Java checkstyle failed — please run mvn spotless:apply"
comments on PRs (including this branch). CLAUDE.md and AGENTS.md already
mentioned the command, but a one-line prose note in the middle of each file
wasn't enough to make it a reliable habit.

This commit:

- Adds a dedicated invocable skill at .claude/skills/java-checkstyle/SKILL.md
  (for the Claude Code harness) and a mirror at
  .agents/skills/java-checkstyle/SKILL.md (for Codex-style agents). Both
  describe the same procedure: when / why to run spotless, the `-pl <module>`
  scoping option, the verify-only `spotless:check` form, the expected
  diff shape, and the rule to never hand-edit formatting around a plugin
  error.

- Promotes the existing one-liners in CLAUDE.md and AGENTS.md to explicit
  "run before finishing any Java task" instructions, pointing at the skill so
  agents have a reusable procedure to invoke rather than improvising.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Harden AttachmentResource upload/download against three regressions

Carried over from the latest AttachmentResource review. Three issues:

1. Content-Disposition header injection (security) — downloadAsset() built
   the header by direct string interpolation of asset.getFileName(). A
   filename containing double-quotes or CRLF could inject arbitrary HTTP
   headers. ContextFileResource already has a sanitize + RFC-5987 encode
   helper; rather than duplicate it, promote
   ContextFileUploadSupport.sanitizeFileName / buildContentDisposition to
   public, delete the duplicates from ContextFileResource (now delegators),
   and reuse the shared helpers from AttachmentResource.

2. Unbounded upload buffering (performance / DoS) — createAssetFromUpload
   read the full multipart body into a byte[] via IOUtils.toByteArray
   before checking against MAX_FILE_SIZE. An attacker could send an
   arbitrarily large body and exhaust heap before the validation ran.
   Replace with ContextFileUploadSupport.bufferUpload(), which streams to
   a bounded temp file and throws MaxFileSizeExceededException the moment
   the configured limit is passed; translate that into the same
   AttachmentException size-validation error the previous code raised.
   Promoted BufferedUpload and MaxFileSizeExceededException to public so
   the attachments package can consume them.

3. Startup NPE when objectStorage is null (bug) — initialize() called
   config.getObjectStorage().getMaxFileSize() without a null guard, so a
   deployment that doesn't configure object storage would NPE on server
   start. Added the same guard ContextFileResource.initialize() already
   uses, gave MAX_FILE_SIZE a safe 5 MiB default, and also null-guarded
   the S3-configuration branch of the CDN URL lookup so a pure-Azure or
   pure-NoOp setup doesn't fall off the end of the ternary.

Ran mvn spotless:apply — picks up formatting-only changes in
CollectionDAO.java and FolderIndex.java as a side effect of the shared
helper additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add ui-checkstyle skill + fix residual import-order drift

CI's UI Checkstyle workflow has three per-area jobs (lint-src,
lint-playwright, lint-core-components) that reformat the files changed
in the PR and fail if the reformat produces a diff. CLAUDE.md and
AGENTS.md didn't previously document this flow, so re-running the fix
was a guessing game — the two lint-core-components and lint-playwright
failures on this branch came from stale import order left over from the
main→context_center merge.

This commit:

- Adds a dedicated invocable skill at .claude/skills/ui-checkstyle/SKILL.md
  (Claude Code harness) and a mirror at .agents/skills/ui-checkstyle/SKILL.md
  (Codex-style agents). Both describe the exact three-command sequence CI
  runs — organize-imports-cli → eslint --fix → prettier --write — the
  per-area file scoping, the `--check` dry-run mode, and the rule that
  organize-imports must run BEFORE prettier (otherwise the indentation /
  trailing-comma round-trip leaves a dirty diff).

- Promotes the existing one-liner in CLAUDE.md and AGENTS.md to an explicit
  "run before finishing any UI task" instruction that points at the skill.

- Fixes two residual import-order drifts (KnowledgePagesHierarchy.tsx,
  EntityUtilClassBase.ts) surfaced by running the skill's sequence locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix UI checkstyle on EntityUtilClassBase.ts

ESLint --fix inserted a blank line between the KNOWLEDGE_PAGE guard and the
fallback return in getEntityByFqn. Committing the formatted version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix ContextFileIT.testFileAppearsInSearch flaky 500 from query_string parsing

The previous polling search used the namespaced unique name as a free-text
q= argument. The namespace prefix contains '-' which the ES 9.x query_string
parser treats as a NOT operator, producing a deterministic 500 across the
full 30s polling window even when the document was indexed.

Switch to the direct get-by-id endpoint (/v1/search/get/{index}/doc/{id}),
which performs a real-time ES GET with no query_string parsing and no
analyzer involvement — the most reliable signal that the document was
indexed. Bump the timeout to 60s and capture the response body on any
non-200 so future regressions surface the real ES error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix knowledge center icon

* update knowledge center to context center

Co-authored-by: Copilot <copilot@github.com>

* Revert "update knowledge center to context center"

This reverts commit f0cca5fd65.

* Fix UI checkstyle: sort tag*-related imports in SearchClassBase

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix Jest coverage failures in KnowledgeCenter Layout and right panel

KnowledgeCenterLayout was importing i18n directly from LocalUtil, but the
global setupTests mock for that module only exposes t/on. Switch to the
useTranslation() hook so it picks up the react-i18next mock that already
provides i18n.dir(), matching how LeftSidebar and RichTextEditor use the
direction.

EntityRightPanelClassBase.getKnowLedgeArticlesWidget now returns the
KnowledgePages component instead of null. Update the corresponding test
case to assert the new return value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix playwright tests and bugs

Co-authored-by: Copilot <copilot@github.com>

* Fix checkstyle

* Fix /knowledgeCenter/search/hierarchy 500 by removing _id sort

ES 9.x and OpenSearch 3.x reject sorts on the _id field by default
(indices.id_field_data.enabled is false), causing every call to
listPageHierarchy{,ForActivePage} to fail the search_phase_execution_exception
"all shards failed" we see in the screenshot. The _id sort was added
in 4a75852a7e as a tiebreaker for from/size pagination, but
fullyQualifiedName is already a keyword field with doc_values and is
unique per page (name is unique within a parent's children) — so no
tiebreaker is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Cascade hard-delete to descendant pages in search index

KnowledgeCenter pages are nested via FQN (parent.fqn -> parent.fqn.child),
not via a parent.id field on the child doc. The default deleteOrUpdateChildren
case for entity type "page" uses page.id field matching, which doesn't exist
on child page docs — so a recursive hard-delete on the parent removed the
parent from search but left every descendant orphaned in the index. Stale
docs only disappeared on a full reindex.

This logic was overridden in the collate fork's SearchRepositoryExt; it was
lost during the migration when the override class was removed. Fold the
override into the base SearchRepository as a Page-specific case that calls
deleteEntityByFQNPrefix, which deletes by fullyQualifiedName.keyword prefix
match — covering every descendant.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Add page/folder/contextFile/securityService to SearchIndexingApp picker

The Search Indexing Application's "Entities" picker shows "No data" when
typing "Page" because the enum in src/utils/ApplicationSchemas/SearchIndexingApplication.json
does not include the Knowledge Center / Drive entity types added on this
branch. The collate fork carried these in SearchIndexingApplication-collate.json
(included page); folder, contextFile and securityService are new on this
branch and never made it into the picker enum during the migration.

Without them in the enum, users cannot select these entity types for
targeted reindex, even though every other reindex code path supports them.

src/jsons/applicationSchemas/* is generated by parseSchemas.js from
src/utils/ApplicationSchemas/* at build time and is gitignored, so only
the source schema is updated here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Restore live index settings on per-entity distributed-promote path

DefaultRecreateHandler exposes two finalization paths:

  - finalizeReindex(...)        — centralized end-of-job promotion. Calls
                                  applyLiveServingSettings + maybeForceMerge
                                  before the alias swap, reverting the bulk
                                  overrides (refresh_interval=-1, replicas=0,
                                  async translog) back to live values
                                  (refresh=1s, replicas=1, durable translog).

  - promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
                                  search-indexer's "promote as soon as all
                                  partitions for an entity complete" callback
                                  (DistributedSearchIndexExecutor.promoteEntityIndex).
                                  Swaps the alias and cleans up old indices —
                                  but never restored live settings.

When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.

Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Wire jobData into per-entity reindex promotion handler

DefaultRecreateHandler.applyLiveServingSettings reads from the handler's
jobData field (live + bulk index-settings overrides on the EventPublisherJob).
The per-entity distributed-promotion path in DistributedSearchIndexExecutor
created its own DefaultRecreateHandler instance and never called
withJobData(jobData) on it. With jobData=null, buildRevertJson returns null
and applyLiveServingSettings silently no-ops — meaning the previous fix
(b272de85f9) never actually re-applied live settings on the per-entity
promote path, even though the call was reached.

currentJob.getJobConfiguration() is the EventPublisherJob the strategy
created. Wire it into the new handler at construction time, mirroring the
withJobData call DistributedIndexingStrategy already makes on the strategy's
own handler instance.

With this change, the per-entity promote path now logs

  "Applying live serving settings to staged index '...' for entity 'page':
   {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}"

before the alias swap, and post-promotion `_settings` show
refresh_interval=1s instead of the stuck -1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix delete failure

* Fix java checkstyle

* Fix article deletion issue

* refactor(test): streamline Knowledge Center List setup and teardown processes

* Fix GlossaryTags

* Add missing pieces in knowledge articles

* Fix checkstyle

* Remove reviewer workflow spec

* remove unused util

* Fix the localization changes

* Fix unit tests

* deleted unused svg

* added missing svg

* improved ux of save button & autofocus on title

* lint fixes

* Update page index

* Make calculateFqnDepth static

* fixed the kc imports

* import fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com>
Co-authored-by: Rohit0301 <rj03012002@gmail.com>
Co-authored-by: Rohit Jain <60229265+Rohit0301@users.noreply.github.com>
Co-authored-by: Harsh Vador <harsh.vador@somaiya.edu>
2026-05-08 10:56:04 -07:00
Teddy
219c5683fa
ISSUE #3032 (#27912)
* feat: move flat sampling to sampling config + dynamic sampling option

* feat: move flat sampling on the backend to sample profile conifg object

* feat: fix circular import

* feat: align UI with new profiler config

* feat: fix json schema

* feat: align python imports with new schema path

* feat: update migration to look at extension

* feat: remove enable

* feat: remove enable

* feat: added titles to sample config

* feat: generated ts classes

* feat: addressed comments

* feat: change sample config instantiation to match new structure

* feat: removed backward compatible fields

* feat: ran java linting

* feat: updated imports to point to generated files

* feat: added dynamic sampler resolution logic

* feat: ran python linting

* feat: remove duplicate migration

* chore: merge upstream and clean conflicts

* feat: update logic to support dynamic and static sampling

* feat: adjust sample config call

* feat: test for statis, dynamic, row count and tier methods

* feat: more sample config unit tests

* feat: added tests for metric and sampling

* feat: added tests to validate fallback is not called i nmetric computers

* feat: strengthen profiler validation tests

* feat: fix sampling config

* feat: fix sampling config

* feat: fix sampling config

* feat: generated typescript models

* feat: fixed missing dq pipeline migration

* feat: fixed static check

* feat: fixed ci failures

* feat: fixed ci failures

* feat: fixed unit tests faioure and linting

* feat: fixed integration tests failures

* chore: fixe burstiq refactor

* chore: fix trino ci failures

* chore: revert baseline.json file

* chore: fix sampler availabl burst iq changes

* feat: added smart sampling radio button

* feat: ignore static checks errors

* feat: ran ts linting

* feat burstiq infinite recursion issue with dynamic as default

* feat: translate i8n keys

* feat: fix failing tests
2026-05-07 09:01:18 -07:00
Anujkumar Yadav
80375a7dc6
Add data access request support (#27879)
* Add DAR tasks

* Removed UI related changes of DAR

* nit

* Update generated TypeScript types

* fix linting issue

* Removed all languages changes

* nit

* removed white space

* add request data access button with owner/status conditions

* fix lint issue

* fix minor validation for data access button

* fix lint issue

* fix data access button visiable condition

* fix java lint checks and fix test cases

* nit

* fix test

* fix(tasks): model CreateTask.about as entityLink, validate target entity

Replace `about` (FQN string) + `aboutType` (string) with a single
`about` field of type entityLink (`<#E::{entityType}::{fqn}>`). The
resource layer parses the link and resolves it via
`Entity.getEntityReferenceByName(type, fqn, NON_DELETED)`, which
guarantees the target asset exists and is not soft-deleted.

Why: long-FQN data assets were rejected with `[query param name size
must be between 1 and 256]` because the modal was constructing a Task
`name` from the FQN. The `about` was modelled as a free string with
no schema validation that the target was a real, non-deleted entity.
The Threads API already uses entityLink for this exact purpose; tasks
now align with that pattern. The link is supplied as a hidden field
by the UI — users never see it.

Also fixes the missing `@ExtendWith(TestNamespaceExtension.class)` on
`DataAccessRequestIT` that caused four test failures in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix unit test failure

* fix(test): await workflow stage transition in DataAccessRequestIT

The workflow advances the task from pending-workflow-start to review
asynchronously. Asserting on the object returned by create() was a
race condition. Use Awaitility to poll until the stage is review,
matching the pattern in IncidentTaskIntegrationIT.

---------

Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com>
Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>
2026-05-07 17:56:44 +05:30
sonika-shah
e91c90c144
fix: validate custom property name charset (#27808)
* fix: validate custom property name charset

Tighten custom property name validation to block characters that break
downstream parsers, with verified empirical reproduction:

- `"` causes HTTP 500 on PUT /metadata/types/{id}
- `:` breaks CSV import — exporter writes `key:value;key:value`, importer
  splits at first colon, treats prefix as the field name
- `^` breaks OpenSearch query when the name is in
  searchSettings.searchFields — Lucene reads `^` as the boost separator
  in `field^boost`
- `$` breaks CSV import via java.util.regex.Matcher.replaceAll which
  interprets `$<letter>` as a backreference

Adds a `customPropertyName` definition in basic.json and switches
customProperty.json to reference it. Adds a defensive regex check in
TypeRepository.validateProperty so the API returns 400 with a clear
error message even if schema validation is bypassed.

Tests cover allowed-charset acceptance, the four blocked characters,
leading-character validation, max-length enforcement, and unbalanced
brackets.

* Update generated TypeScript types

* test: add schema-vs-Java consistency test for custom property name

Guards against drift between basic.json#customPropertyName and the
TypeRepository regex/length constants. If either side is updated
without the other, CI fails with a message pointing to both files.

The Java validator is kept (better error message + covers internal
callers that bypass the HTTP layer); the consistency test guarantees
the two definitions cannot drift.

* fix: extend custom property name charset after gap-coverage matrix

Re-ran the matrix on previously-untested chars (+ ? * ~ ` \) across all
17 property types × create/patch/CSV/search:

- + ? * ~ ` all pass cleanly on every operation × every property type — add to allow list
- \ fails CSV roundtrip for entityReference and entityReferenceList types
  (escape inconsistency in CSV serialization) — add to block list

Updates the regex, schema description, Java validator error message, and
adds the new chars to the allow/block integration tests. Consistency
unit tests in TypeRepositoryTest continue to pass.

Final allow set: alphanumeric _ - . / & % # @ ! , ; = | ' + ? * ~ `
                 space ( ) < > [ ] { }
Final block set: " : ^ $ \

* Update generated TypeScript types

* updated the custom property name validation

* added name suffix in custom property name

* lint fixes

* include backslash in invalid char

Co-authored-by: Copilot <copilot@github.com>

* fixed the playwright issue

Co-authored-by: Copilot <copilot@github.com>

* lint fix

* fix check style

* Drop redundant Java validator for custom property name; tighten IT assertions

Schema is the single source of truth: jsonschema2pojo emits @Pattern + @Size
on CustomProperty.name from basic.json#/definitions/customPropertyName, and
@Valid on TypeResource.addOrUpdateProperty enforces them at the HTTP boundary.
The hand-written Pattern constant, validateCustomPropertyName, and the
schema-vs-Java sync test were duplicating that rule and could never reach the
HTTP user (Bean Validation always fires first via @Valid).

Tighten the new TypeResourceIT cases from assertThrows(Exception.class) to
assertThrows(InvalidRequestException.class) so a regression to a different
exception type or status code fails loudly.

* restrict few more special characters from Cp name

* minor fix

* Disallow & < > in custom property names; align IT cases

Schema-side counterpart to the UI changes in the previous two commits:
basic.json#/definitions/customPropertyName now blocks &, <, > alongside the
existing " : ^ $ \\. The DOMPurify pass on the UI sanitizes &, <, > into HTML
entities, which produced inconsistent persisted values; rejecting them at the
schema layer prevents that drift across all write paths.

IT updates:
- Drop &, <, > from the allowed-charset cases (and the "withMatched(pair)And<more>" composite)
- Add &, <, > to the disallowed-charset cases
- Drop "<" leading-character case (now covered as a disallowed character)
- Drop "<" / ">" unbalanced-bracket cases

* Update generated TypeScript types

* Close PATCH bypass for custom property name validation on Type

Bean Validation runs for the dedicated PUT /types/{id} (addOrUpdateProperty)
because the resource declares @Valid CustomProperty, and the createOrUpdate
path can't carry customProperties at all (CreateType schema doesn't include
the field). PATCH /types/{id} accepts an opaque JsonPatch, so @Valid never
reaches into the resulting customProperties[] — a JSON Patch like
[{"op":"add","path":"/customProperties/-","value":{"name":"bad:colon",...}}]
persisted bad-named properties (verified live: HTTP 200 before this fix).

Run Hibernate Validator programmatically inside TypeRepository.prepare() so
every write path enforces the schema-derived @Pattern / @Size / @NotNull on
each CustomProperty. The rule still lives only in basic.json — picked up via
the generated @Pattern annotation, executed via ValidatorUtil.validate.

Tests in TypeResourceIT:
- test_patchCannotAddCustomPropertyWithDisallowedName — seeds a valid property
  to ensure /customProperties exists, then PATCHes appending a name with ':',
  asserts InvalidRequestException and verifies the bad name is not persisted
- test_patchCanAddCustomPropertyWithValidName — guards against the fix
  rejecting valid PATCH-driven additions

* Block * in custom property names — breaks ES field-path lookup

When the property name appears in extension.<propertyName>^boost entries of
searchSettings.searchFields, OpenSearch treats * as a field-path wildcard.
The literal * field never matches its own wildcard pattern, so the field
gets silently skipped from the query and Explore search returns no hit for
the value. Bisected against the running server: of 12 candidate Lucene-special
chars, only * actually breaks the mainline UI search flow. ? ~ ( ) { } [ ] /
! and space all returned hits via the searchFields path because OS looks up
the field literally and only treats * as a wildcard at that layer.

Updates the regex + description in basic.json/customProperty.json, the UI
regex in regex.constants.ts, the validation message across 19 locales, the
generated TS docstrings, the Playwright invalid-name fixtures and spec, and
the IT TypeResourceIT case (with*asterisk moves from allowed to disallowed).

* Validate only newly-added custom properties; isolate PATCH IT to fresh types

prepare() previously validated the entire customProperties[] on every Type
write. An upgraded instance with a legacy property whose name contained a
now-banned char would then reject any subsequent PUT/PATCH on that type,
even when the write only adds a different valid property. Move the name
validation into TypeUpdater.updateCustomProperties() and scope it to the
`added` list computed by recordListChange against the original entity. New
properties are still validated; pre-existing names are left alone.

Replace the IT PATCH cases' shared `topic` Type with a fresh, namespaced
entity-category Type per test (createEntityTypeForTest). The shared `topic`
was mutated concurrently by other tests in the class — combined with
PATCH's lack of per-type locking, that produced lost-update races and
flaky asserts. The fresh per-test type has customProperties: [] from
creation, so the patch sets the array directly without a seed property.

* chore: prettier formatting on the new asterisk-rejection test

* Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* docs: add + ? ~ ` to JSDoc allow-list to match the regex

* fix(it): request customProperties field on read-back in PATCH IT

Type.customProperties is a lazy field — TypeRepository.setFields only
populates it when the request URL includes ?fields=customProperties. The
default getTypeById helper omits the param, so the read-back always saw
customProperties == null. That made test_patchCanAdd... fail (the just-
persisted property wasn't visible) and made test_patchCannotAdd... pass
for the wrong reason (would have stayed green even if the bad name had
slipped through validation).

Add a fields-aware getTypeById overload and use it in both PATCH cases.
Empirically verified against the live server: good name returns 200 +
appears in customProperties, bad name returns 400 + does not.

* minor fix

* playwright test fix

* removed unecessary test

* blocked ~ and / from custom property name

* lint-fix

* Block / and ~ in custom property names (JSON Pointer reservations)

Forward slash and tilde are reserved by JSON Pointer (RFC 6901): / is the
path separator and ~ is the escape lead-in (~0 = ~, ~1 = /). Allowing
them in a property name shifts the burden onto every caller that builds
a JSON Patch by string interpolation; a raw `/extension/${propertyName}`
either splits into the wrong number of segments or contains an invalid
escape sequence, and the server applies the patch to the wrong key (or
400s outright).

This surfaced as a reproducible failure in the table-cp Playwright suite:
the preceding test ended with `path: \`/extension/${propertyName}\`` where
propertyName ended in `/`. The server addressed extension[name-without-/][""]
instead of extension[name-with-/], returned 400, and TableClass.patch
overwrote entityResponseData with the error body — stripping id and FQN.
The next test fell into the search-based navigation path with an empty
search term and timed out at 180s.

Tighten the schema regex in openmetadata-spec/.../basic.json#customPropertyName
to drop / and ~ from the allowed set; update the human-readable description
in basic.json and customProperty.json to call out the RFC 6901 reservation.
Move the with/slash and with~tilde cases from the allowed-charset IT to
the disallowed-charset IT in TypeResourceIT.

* Update generated TypeScript types

* Use fresh per-test Type in custom-property name validation IT

The five charset/length/lead-char tests added in this PR previously mutated
the shared built-in TABLE_ENTITY_TYPE under @Execution(CONCURRENT). The
PUT path acquires TYPE_PROPERTY_LOCKS so concurrent writes serialize, but
relying on that lock for test isolation is fragile — the PATCH-driven IT
in the same class already uses a per-test fresh Type via
createEntityTypeForTest(client, ns, ...) for exactly this reason
(see 1864b0a6ac). Switch the five PUT tests to the same pattern so no
test mutates a shared Type, eliminating cross-test coupling regardless of
whether the server-side lock is in place.

Tests affected:
- test_customPropertyNameAllowedCharacters_succeeds
- test_customPropertyNameDisallowedCharacters_fails
- test_customPropertyNameMustStartWithAlphanumeric_fails
- test_customPropertyNameTooLong_fails
- test_customPropertyNameUnbalancedBrackets_succeeds

* Align UI artifacts with the tightened custom-property-name regex

Three small follow-ups flagged by reviewers:

- regex.constants.ts: JSDoc above CUSTOM_PROPERTY_NAME_REGEX still listed
  / and ~ as allowed even though the pattern below was tightened to drop
  them. Update the comment to match the actual regex and call out the
  RFC 6901 reason so future edits don't reintroduce them.

- CustomProperties.spec.ts: the "should accept a valid name with allowed
  special characters" test fed a hardcoded string containing ~ and /,
  which the new regex rejects — the assertion would fail. Drop those two
  characters so the input stays in the allowed set.

- zh-cn.json: the Simplified Chinese translation of
  custom-property-name-validation was double-escaped (\\\" and \\\\),
  which would render to users as literal \" and \\ rather than " and \.
  Match the escaping pattern used by the other 18 locales.

* addressed gitar comment

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Rohit0301 <rj03012002@gmail.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Rohit Jain <60229265+Rohit0301@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-07 15:35:43 +05:30
Sriharsha Chintalapani
f9d3c85d20
fix(search): restore live settings on per-entity promote path (#27920)
* Restore live index settings on per-entity distributed-promote path

DefaultRecreateHandler exposes two finalization paths:

  - finalizeReindex(...)        — centralized end-of-job promotion. Calls
                                  applyLiveServingSettings + maybeForceMerge
                                  before the alias swap, reverting the bulk
                                  overrides (refresh_interval=-1, replicas=0,
                                  async translog) back to live values
                                  (refresh=1s, replicas=1, durable translog).

  - promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
                                  search-indexer's "promote as soon as all
                                  partitions for an entity complete" callback
                                  (DistributedSearchIndexExecutor.promoteEntityIndex).
                                  Swaps the alias and cleans up old indices —
                                  but never restored live settings.

When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.

Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Wire jobData into per-entity reindex promotion handler

DefaultRecreateHandler.applyLiveServingSettings reads from the handler's
jobData field (live + bulk index-settings overrides on the EventPublisherJob).
The per-entity distributed-promotion path in DistributedSearchIndexExecutor
created its own DefaultRecreateHandler instance and never called
withJobData(jobData) on it. With jobData=null, buildRevertJson returns null
and applyLiveServingSettings silently no-ops — meaning the previous fix
(b272de85f9) never actually re-applied live settings on the per-entity
promote path, even though the call was reached.

currentJob.getJobConfiguration() is the EventPublisherJob the strategy
created. Wire it into the new handler at construction time, mirroring the
withJobData call DistributedIndexingStrategy already makes on the strategy's
own handler instance.

With this change, the per-entity promote path now logs

  "Applying live serving settings to staged index '...' for entity 'page':
   {\"number_of_replicas\":1,\"refresh_interval\":\"1s\", ...}"

before the alias swap, and post-promotion `_settings` show
refresh_interval=1s instead of the stuck -1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Add regression test for live serving settings on per-entity promote

Asserts that DefaultRecreateHandler.promoteEntityIndex calls
searchClient.updateIndexSettings with the live-revert JSON
(refresh_interval=1s, replicas=1, translog.durability=request) before
swapping the alias, given a handler with bulk overrides wired through
withJobData. Without the two preceding fixes the assertion fails with
"Wanted but not invoked" — applyLiveServingSettings was never reached
on the per-entity promotion path, so the staged index inherited
refresh_interval=-1 and post-promotion live writes never became
searchable until a manual _refresh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Expand unit coverage around the per-entity promotion contract

DefaultRecreateHandlerTest.PromoteEntityIndexTests:
  - testPromoteEntityIndexAppliesSettingsBeforeAliasSwap: InOrder
    verification that updateIndexSettings runs BEFORE swapAliases. Catches a
    swap-then-revert misordering (which would briefly serve live reads against
    refresh=-1 settings).
  - testPromoteEntityIndexForceMergesWhenConfigured: forceMerge(staged, 1) is
    invoked when bulkIndexSettings.forceMergeOnPromote=true. Catches a
    regression where the force-merge call gets dropped without anyone
    noticing.
  - testPromoteEntityIndexSkipsSettingsWithoutJobData: locks in the safe no-op
    behavior when a handler is constructed without withJobData. Documents that
    no-jobData → no settings call (vs. crash or silent revert to defaults).

DistributedSearchIndexExecutorTest:
  - initializeEntityTrackerWiresJobDataIntoDefaultRecreateHandler: triggers
    the private initializeEntityTracker with currentJob holding a populated
    jobConfiguration and verifies recreateHandler.withJobData(jobData) is
    called on the per-entity handler. This catches the second half of the
    original regression: even if applyLiveServingSettings is reached on
    promoteEntityIndex, jobData=null makes it a silent no-op. Future edits
    that drop the wiring or move handler construction elsewhere will fail
    here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add integration test for live settings restoration after alias promotion

Triggers SearchIndexingApplication with bulkIndexSettings configured
(refresh_interval=-1, number_of_replicas=0, translog.durability=async),
waits for the run to terminate, then queries _settings on the promoted
table_search_index alias against the real OpenSearch/Elasticsearch
container (via TestSuiteBootstrap.createSearchClient()). Asserts that
each concrete index resolved by the alias has the live values applied
(refresh=1s, replicas=1, translog.durability=request) and not the bulk
overrides.

This is the end-to-end counterpart to the unit-level regression test in
DefaultRecreateHandlerTest. Catches the same class of bug at the layer
where it actually surfaced in production: an alias swap that completed
successfully according to logs but left the new live index unsearchable
because refresh was disabled and writes were buffered indefinitely.

Modeled on SearchIndexingFieldsParityIT for run-trigger / poll structure;
adds the post-completion _settings verification step that no other IT
performs today.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address PR review: harden settings revert + lock InOrder + drop redundant test

DefaultRecreateHandler: move applyLiveServingSettings + maybeForceMerge
inside the try/finally that unregisters the staged index. Without this, a
transient OS/ES failure during _settings update or _forcemerge propagated out
before the finally ran, leaving searchRepository.unregisterStagedIndex
permanently registered — so live writes kept routing to a staged index
nothing reads from. Same fix applied to finalizeReindex for consistency
(its window is shorter since it runs at end-of-job, but the leak shape is
identical). Per gitar-bot review.

DefaultRecreateHandlerTest:
  - testPromoteEntityIndexAppliesLiveServingSettingsBeforeSwap: replace
    independent verify() calls with InOrder so the test actually locks the
    "settings before alias swap" ordering its name and the PR description
    promise. A swap-then-revert refactor would have passed before this.
    Per Copilot review.
  - Drop testPromoteEntityIndexAppliesSettingsBeforeAliasSwap (the standalone
    InOrder test added in the previous commit) — folded back into the test
    above, which now covers both ordering and JSON content in one place.
  - Add testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure —
    regression test for the gitar-bot fix above. Verified to fail with
    "IllegalState connection reset" when the calls are moved back outside
    the try block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop verbose explanatory comments from promote-path edits

The why-it's-in-a-try and why-jobData-is-wired blocks read like commit
messages, not code annotations. Tests and commit history carry the
rationale; the code itself reads fine without them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Close Rest5Client in IT _settings helper

readIndexSettings opened a Rest5Client and never closed it, leaking
HTTP connections on test re-runs. Wrap in try-with-resources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Tighten SearchIndexAliasPromotionIT against false-positive runs

Two reasons the IT could pass without the regression:

  - waitForLatestRunSuccess accepted activeError, which maps to
    COMPLETED_WITH_ERRORS. In that path EntityCompletionTracker can
    invoke promoteEntityIndex(..., false), the staged index is deleted,
    and the alias stays on the pre-existing live index. The
    pre-existing index already has live settings, so the _settings
    assertions pass against it without exercising the promotion path.

  - readIndexSettings on the alias would resolve to that pre-existing
    concrete index even after a no-op promotion, so the assertions
    were never actually checking the staged index.

Reject anything other than success/completed, and assert the alias
resolves to a *_rebuild_* index — proving the swap moved the alias to
a freshly staged index.

Per Copilot review on PR #27920.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Harden alias promotion: defer canonical delete, hard-fail on empty aliases, surface in job status

Four coupled changes that fix latent silent-failure paths in the per-entity
and end-of-job promotion code (predates the recent regression but lands
together since they touch the same blocks).

1. Empty aliases is a hard failure
   `getAliasesFromMapping` returning an empty set used to fall through to a
   skip-swap WARN, then log "Promoted staged index..." and record
   `promoteSuccess`. Canonical was already pre-deleted at that point, so the
   alias resolved to nothing and operators got no error signal. Now: log
   structured ERROR, call `recordPromotionFailure`, return without deleting
   canonical or claiming success. Same fix in both `promoteEntityIndex` and
   `finalizeReindex`.

2. Defer canonical-index deletion until swap success
   Old order: delete canonical → swapAliases (with canonical-name in the
   alias set). If swap fails after delete, the canonical name has nothing to
   resolve to → live data unavailable.

   New order:
     - swapAliases of all non-canonical-name aliases (atomic move from
       canonical to staged). If this fails, canonical still serves with all
       original aliases — no data loss.
     - Delete canonical (only if its name is needed as alias). If this
       fails, parent aliases work; canonical-name lookups still hit the old
       index until retry — degraded, not lost.
     - addAliases for the canonical name on staged. If this fails (data
       loss path: canonical was deleted but alias-add failed), mark
       dataLossPromotions; operator alarm.

3. Promotion outcomes affect job status
   `DefaultRecreateHandler` tracks `failedPromotions` and `dataLossPromotions`
   sets. `RecreateIndexHandler` interface exposes them.
   `SearchIndexExecutor.determineStatus` and
   `DistributedIndexingStrategy.determineStatus` now consult both:
     - any data-loss promotion → ExecutionResult.Status.FAILED
     - any failed promotion (no data loss) → COMPLETED_WITH_ERRORS
   Distributed path checks both the strategy's handler and the per-entity
   executor's handler (different instances, both can record failures).

4. Structured failure log markers
   Replace single-line ERROR with `[ALIAS_PROMOTE_FAILED phase=... entity=...
   stagedIndex=... canonicalIndex=... aliases=...]` markers at every
   promotion-fail exit (empty-aliases / swap1 / delete-canonical / swap2 /
   exception). Each line states whether canonical was deleted and what the
   blast radius is, so operators can grep and triage without reading code.

Tests:
  - testPromoteEntityIndexEmptyAliasesIsHardFailure
  - testPromoteEntityIndexCanonicalNotDeletedWhenStep1Fails
  - testPromoteEntityIndexThreeStepSwapOrder (InOrder swap1 → delete → swap2)
  - testPromoteEntityIndexFlagsDataLossOnAddAliasFailure
  - SearchIndexExecutor: determineStatusFlagsPromotionFailuresAndDataLoss
  - DistributedIndexingStrategy: determineStatusFlagsPromotionFailuresFromEitherHandler

All 146 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Consolidate finalizeReindex and promoteEntityIndex into one core path

Previously two separate methods with ~80 lines each that did the same thing.
PR #25594 forked the code (per-entity vs end-of-job), PR #27865 added a
critical settings-revert step to one half but missed the other — that was
the original regression. The duplication itself is the regression source,
so collapse it: both methods are now thin wrappers around a single private
`promote(EntityReindexContext, boolean)` core. Future features land in one
place by construction; can't drift.

Behavior consolidation:
  - Single source of truth for aliasesToAttach: the EntityReindexContext
    fields (existingAliases ∪ canonicalAliases ∪ parentAliases). These are
    populated by recreateIndexFromMapping at stage-create time and read
    aliases from the live index — strictly a superset of what the old
    getAliasesFromMapping derived from IndexMapping.json (preserves
    operator-added aliases).
  - getAliasesFromMapping deleted. promoteEntityIndex no longer fetches
    IndexMapping at promote time; it reads everything from the context the
    caller built.
  - shouldPromote / staged-delete-on-failure / settings revert / empty-
    aliases hard-fail / three-step swap / cleanup / unregister-staged: all
    in one method body now.

Caller-side semantic change: any code path that called promoteEntityIndex
with a context that had populated existingAliases/canonicalAliases/
parentAliases (which is all production callers — DistributedSearchIndex
Executor and SearchIndexExecutor both already populate them) is unaffected.
A caller that built a bare context with only entity/canonical/staged set
would previously have re-derived aliases from IndexMapping; now it hits
the empty-aliases hard fail. This is strictly safer (fail loud beats
silent skip-with-success) and we've audited callers.

Tests:
  - All 17 existing PromoteEntityIndexTests updated to populate the context
    with the alias fields they previously depended on getAliasesFromMapping
    to produce. One test ("Should handle null indexMapping gracefully")
    rewritten to "Empty aliases on context is handled" — same behavior,
    new wording for the new model.
  - Old GetAliasesFromMappingTests nested class deleted — exercised the
    removed method.
  - New EntryPointParityTests nested class with 3 tests that explicitly
    run the same EntityReindexContext through both finalizeReindex and
    promoteEntityIndex and assert byte-for-byte identical alias state,
    deleted-indices set, and failure-tracking fields. These pin the two
    entry points together against future drift.

Integration test:
  - Added perEntityPromotionIsIdempotentAcrossRepeatedRuns to
    SearchIndexAliasPromotionIT. Triggers the app twice, asserts the
    second run produces a different *_rebuild_* concrete index and that
    live settings still apply — exercises the full
    pre-existing-canonical → three-step-swap path which is what production
    actually does.

Total: 160 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address PR review: post-state checks, FAILED listener, hermetic IT, InOrder

Five additional behavioral fixes from Copilot review on PR #27920.

1. delete-canonical and addAliases: detect failure via post-state, not via
   try/catch. ElasticSearchIndexManager#deleteIndexWithBackoff and
   #addAliasesInternal both swallow transport exceptions and return void
   (same shape on the OS side), so the existing try/catch could never
   observe a failed delete or alias-add. After deleteIndexWithBackoff,
   verify the canonical no longer exists; after addAliases, verify the
   canonical-name alias is actually attached. If either post-condition
   fails, log [ALIAS_PROMOTE_FAILED reason=delete-not-acknowledged] /
   [reason=alias-not-attached] and mark the promotion failed (data loss
   for the alias-not-attached path).

2. SearchIndexExecutor.executeSingleServer: wire onJobFailed for the new
   FAILED status. Previously the listener chain only had callbacks for
   COMPLETED / COMPLETED_WITH_ERRORS / STOPPED, so promotion-driven FAILED
   ended without populating jobData.failure or notifying observers. Pass
   in an IllegalStateException naming the data-loss entities so the app
   run record carries the right failure context.

3. SearchIndexAliasPromotionIT trigger payload: explicitly set
   liveIndexSettings (1s/1/request), liveIndexSettingsByEntity (empty),
   and useDistributedIndexing=true. /v1/apps/trigger/X merges the payload
   into the persisted config rather than replacing it, so without these
   the test could be affected by previous local config or silently exercise
   the single-server path. The hard-coded post-promotion assertions are
   now anchored to values the test itself supplies.

4. testPromoteEntityIndexForceMergesWhenConfigured: replace standalone
   verify() with InOrder(forceMerge → swapAliases) so a refactor that
   swaps aliases first and merges afterward fails the test instead of
   passing.

All 148 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Wrap post-state checks: indexExists / getAliases throws no longer escape

Post-state verification I added in 9a7fa49494 (indexExists after delete,
getAliases.contains(canonical) after add) called the search client
directly. If those calls themselves threw — network timeout, transport
error — the exception escaped promoteWithDeferredCanonicalDelete and was
caught by the outer phase=exception handler with markPromotionFailed(
dataLoss=false). For the getAliases case the canonical index has already
been deleted at that point, so dataLoss=false misclassifies a real data-
unavailability state.

Three small helpers:
  - safeIndexExists(client, index, entityType): gate-time check; returns
    false on throw (conservative — skip delete attempt; if canonical
    actually exists, step-3 addAliases will fail with name collision and
    the alias-attached post-check will record the right blast radius).
  - checkIndexExists(client, index): tristate Boolean for post-delete
    check; null on throw means "couldn't determine state".
  - checkAliasAttached(client, staged, alias): tristate Boolean for
    post-add check; null on throw means "couldn't determine state".

Caller logic:
  - delete-canonical post-check returning null → markPromotionFailed(
    dataLoss=false). Conservative: we don't know if delete actually took.
  - add-aliases post-check returning null → markPromotionFailed(
    dataLoss=true). Canonical IS deleted; alias state unknown is the
    worst case.

Tests:
  - testPromoteEntityIndexHandlesIndexExistsPostCheckThrow: gate returns
    true, post-delete check throws → failed but NOT data loss.
  - testPromoteEntityIndexHandlesGetAliasesPostCheckThrow: post-add check
    throws → failed AND data loss (canonical already gone).

Per gitar-bot review on PR #27920. All 40 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review 4232747647: positive-evidence dataLoss, hermetic IT

Six findings from copilot-pull-request-reviewer.

1+4. Track canonicalDeleted via positive evidence only (DefaultRecreateHandler).
   ES/OS indexExists/getAliases/listIndicesByPrefix all swallow transport
   errors and return false/empty. A "negative" probe result cannot be
   distinguished from "probe failed". The previous shape blindly trusted
   probe values and could misclassify a transient failure as data loss
   when canonical was actually still serving (gate-says-no-but-actually-yes
   case) or, conversely, as not-data-loss when canonical was actually
   gone. Now: a `canonicalDeleted` boolean defaults to false and only
   flips to true after both the delete call and a positive post-state
   check confirm the index is gone. dataLoss classification uses this
   flag — never claim data loss without positive evidence canonical was
   deleted. Added regression test
   `testPromoteEntityIndexAmbiguousGateProbeIsNotDataLoss` for the gate-
   ambiguity case.

2. Wire onJobFailed in DistributedIndexingStrategy.
   Previously only SearchIndexExecutor emitted the listener callback for
   FAILED. The distributed strategy returned the status but never invoked
   listeners.onJobFailed, so jobData.failure stayed empty and the
   AppRunRecord/WebSocket update had no failure context. Now mirrors the
   single-server behavior with an IllegalStateException naming the
   data-loss entities.

3. IT: assertEquals("request", durability) instead of assertNotEquals
   ("async"). The non-equals assertion would pass if a silent translog-
   revert drop left durability at any non-async cluster default, missing
   the regression. Pin the exact configured value.

5. IT: assert exactly one concrete index resolves the alias, not
   at-least-one. A broken swap that leaves the alias attached to BOTH
   the pre-existing live index AND the new *_rebuild_* index would
   satisfy "any rebuild present" but duplicate search results in
   production. Use assertEquals(1, settingsByIndex.size()).

6. IT hermeticity: snapshot/restore SearchIndexingApplication's
   appConfiguration. /v1/apps/trigger/{name} merges payload into the
   persisted config, so without restore later tests in the suite
   inherit this test's bulkIndexSettings / liveIndexSettings /
   useDistributedIndexing values — making suite ordering change what
   they exercise. Both IT methods now do a try/finally with
   snapshotAppConfig() + restoreAppConfig().

All 151 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Wait for restore-triggered run to settle in SearchIndexAliasPromotionIT

restoreAppConfig POSTs to /v1/apps/trigger/{name}, which both merges the
body into the persisted config AND starts a new run. Returning without
waiting left SearchIndexingApplication running into the next test class
(AppsResourceIT.test_triggerApp_200), which then timed out for 2 minutes
on "already running".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix AppsResourceIT.waitForAppJobCompletion case mismatch and timeout

The terminal-state check compared status against uppercase "SUCCESS",
"FAILED", "COMPLETED", but appRunRecord.json defines the status enum
in lowercase ("success", "failed", "completed", ...). The check never
matched and the 30s wait silently fell through to the catch block,
making it a no-op. test_triggerApp_200 then relied on its 2-minute
"already running" trigger retry, which timed out whenever a longer
reindex (e.g. SearchIndexingFieldsParityIT's "all entities" reindex)
was still in flight.

Switch the terminal check to "not running and not started"
case-insensitively, and raise the ceiling to 5 minutes so the wait
actually covers a long in-flight reindex.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Run SearchIndexAliasPromotionIT in the sequential bucket

The test triggers SearchIndexingApplication and waits for it to
complete, but during the parallel-tests slot other classes can also
trigger the same app concurrently. The resulting in-flight run
then leaks into AppsResourceIT.test_triggerApp_200 (which runs in
the sequential slot) and exhausts its 2-minute "already running"
trigger Awaitility.

AppsResourceIT is already in the sequential bucket for the same
reason. Mirror it for SearchIndexAliasPromotionIT across all seven
failsafe profiles (mysql/postgres × elasticsearch/opensearch and the
RDF profile) — include in sequential-tests, exclude from
parallel-tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot PR review 4233452655

DefaultRecreateHandler.promote: when canonicalIndex is null but
stagedIndex is non-null, the early-return previously skipped the
finally clause that unregisters the staged index. Live writes could
stay routed to the staged index after we bail. Release the
registration explicitly before returning. Extend the existing
testPromoteWithNullCanonicalIndex unit test to assert the unregister
call.

SearchIndexAliasPromotionIT.snapshotAppConfig: distinguish "snapshot
failed" from "config absent". The previous Map.of() return on
exception caused restoreAppConfig to POST an empty body to
/v1/apps/trigger/{name}, which is a no-op merge that silently leaks
this test's bulk/live setting overrides into downstream tests and
starts a spurious app run. Return null on failure so the caller
short-circuits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Remove SearchIndexAliasPromotionIT in favor of unit test coverage

The IT triggered the bundled SearchIndexingApplication, which made
it expensive (full reindex per run, ~1-2 minutes) and a source of
bleed-through into AppsResourceIT.test_triggerApp_200 even after
moving to the sequential failsafe bucket.

The same surface is already covered by unit tests:
- DefaultRecreateHandlerTest "Should restore live serving settings on
  staged index before alias swap" (and the per-step swap-order,
  data-loss-flagging, post-state-check tests)
- DistributedSearchIndexExecutorTest verifies the
  withJobData(jobConfig) wiring on the per-entity promotion handler

Drop the IT and revert its pom.xml include/exclude entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot PR review 4236718653

DefaultRecreateHandler.promote: validate aliasesToAttach BEFORE
applyLiveServingSettings/maybeForceMerge so the empty-aliases error
path skips wasted I/O and segment churn on a staged index that will
never be swapped in. Extend testPromoteWithEmptyContextAliases to
verify updateIndexSettings and forceMerge are not invoked. Adjust
testPromoteEntityIndexUnregistersStagedIndexOnSettingsFailure to
populate canonicalAliases so it still reaches applyLiveServingSettings.

Refresh checkAliasAttached Javadoc: with positive-evidence
canonicalDeleted tracking, a null result is classified as data loss
only when canonicalDeleted=true; otherwise it is degraded (retryable).
The previous wording claimed every null was data loss, which no longer
matches the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix per-entity promote when canonical is an alias, not a concrete index

On every reindex after the first, the canonical name (e.g.
openmetadata_table_search_index) is an alias on the previous staged
index — not a concrete index. The three-step swap then attempted
deleteIndexWithBackoff(canonicalAlias), which OS/ES rejects with
illegal_argument_exception ("matches an alias, specify the
corresponding concrete indices instead"). The 6-attempt exponential
backoff burned ~31s per entity. With 30+ default entity types the
SearchIndexingApplication ran past CI's 300s setup budget, blocking
every Playwright shard and python-integration setup at "Setup
OpenMetadata Test Environment".

Two-pronged fix in DefaultRecreateHandler.promote:

1. Detect canonicalIsAlias via getIndicesByAlias(canonicalIndex).
   When true, take the new promoteByAtomicAliasSwap path: a single
   swapAliases call atomically moves every alias (parents + canonical)
   from old → new staged. No name collision, no per-entity
   delete-by-alias-name, no degraded/data-loss windows.

2. listIndicesByPrefix returns the canonical alias name itself among
   its results (alongside concrete *_rebuild_* indices). Filter that
   out of oldIndicesToCleanup when canonicalIsAlias, so the cleanup
   loop's deleteIndexWithBackoff doesn't replay the same 31s backoff.
   Keep canonical in cleanup when it is concrete — that's where the
   first-reindex flow drops the original concrete after the three-step
   swap moves aliases off it.

Local repro: SearchIndexingApplication now completes in ~7s instead
of hanging past 300s. New unit test
testPromoteEntityIndexAtomicSwapWhenCanonicalIsAlias locks the new
shape (single swap, no delete-by-alias, old concrete cleaned up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add ALIAS_PROMOTE_BEGIN diagnostic log per entity

Local repro on the CI-equivalent stack (mysql + elasticsearch + sample
data + ingestion) completes in ~30s with all 60 entities going through
the atomic-swap path. CI on the same commit still hangs past 300s.
Add a structured log line at the top of every promote() so the next
CI run shows which entities reach promotion and what shape (atomic
vs three-step) was selected — pinpoints whether a specific entity
gets stuck or if reindex never reaches promote().

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop heavy alias-promotion refactor; rely on PR #27930 fix already in main

The original "live settings not restored on staged after promote"
regression is fixed by PR #27930 (commit e56abb80d5), which is already
on main: applyLiveServingSettings + maybeForceMerge run before alias
swap, and DistributedSearchIndexExecutor wires job configuration into
the per-entity DefaultRecreateHandler. That minimal fix is sufficient
for the original symptom (newly created entities not searchable until
manual _refresh).

This commit reverts every file we refactored further on top of that
minimal fix back to origin/main:

  - DefaultRecreateHandler.java         — drop deferred-canonical-delete
                                           three-step swap, post-state
                                           checks, dataLoss tracking,
                                           safeIndexExists/checkIndexExists/
                                           checkAliasAttached helpers,
                                           promoteByAtomicAliasSwap path,
                                           ALIAS_PROMOTE_BEGIN diagnostic
  - RecreateIndexHandler.java           — drop getFailedPromotions /
                                           getDataLossPromotions defaults
  - DistributedIndexingStrategy.java    — drop FAILED-status emit and
                                           dataLoss aggregation
  - SearchIndexExecutor.java            — drop FAILED-status emit
  - DistributedSearchIndexExecutor.java — drop @Getter on
                                           recreateIndexHandler
  - The matching unit tests in DefaultRecreateHandlerTest /
    DistributedIndexingStrategyTest / SearchIndexExecutorControlFlowTest /
    DistributedSearchIndexExecutorTest all revert to origin/main.

The branch's only remaining contribution is the AppsResourceIT
case-mismatch fix in waitForAppJobCompletion (a pre-existing bug
discovered while diagnosing).

Reason: the further refactor has consistently failed CI on this branch
since the first commit. Local repro on the CI-equivalent stack
(./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true) showed
the canonical-is-alias fix working end-to-end (~30s, 60 entities), but
CI still timed out at 300s. Without server-side logs from CI we can't
target the remaining gap. PR #27930 already addresses the user-visible
regression, so the safest move is to drop the further refactor and ship
just the case-mismatch fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Skip delete-by-alias-name when canonical is currently an alias

After the first reindex, the canonical name (e.g. table_search_index)
is an alias on the previous staged, not a concrete index. OpenSearch's
listIndicesByPrefix returns the alias name as one of its result keys,
which then drives a deleteIndexWithBackoff(canonicalIndex) attempt
that fails with "[illegal_argument_exception] matches an alias,
specify the corresponding concrete indices instead". The 6-attempt
exponential backoff burns ~31s per entity (1+2+4+8+16s); on a 60-entity
reindex that wastes ~30 minutes in cleanup with search degraded
throughout.

Drop the alias name from oldIndicesToDelete when getIndicesByAlias
proves canonical is an alias right now. The atomic swapAliases call
moves the canonical alias from the old concrete to the new staged in
one step; the underlying old concrete is already in oldIndicesToDelete
and gets cleaned up normally by the post-swap loop. No three-step swap
or deferred-canonical-delete restructure needed.

Adjusts testFinalizeReindexPromotesPartialData to use a realistic
canonical-concrete setup (no self-alias on its own name — that state
cannot exist in real OS/ES) so the new guard does not misfire on the
existing test fixture.

New unit test testFinalizeReindexSkipsDeleteWhenCanonicalIsAlias locks
the new behavior: when canonical is an alias, deleteIndexWithBackoff
is never called with the canonical name, and the old concrete rebuild
is cleaned up via the swap path.

Verified locally on the CI-equivalent stack
(./docker/run_local_docker.sh -d mysql -m no-ui -s true -i true): both
first reindex (canonical-concrete) and second reindex (canonical-alias)
complete in ~8s with no "matches an alias" errors and clean cleanup
logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-05-06 15:16:57 -07:00
Ram Narayan Balaji
8dd98fa765
fix(it): Stabilize Flaky integration tests (#27546)
* fix(it): stabilize three flaky integration tests

- TagResourceIT.test_searchTagByClassificationDisplayName: raise Awaitility
  timeout from 30s to 90s — under full-suite concurrent load the tag search
  index can lag well past 30s before the tag is discoverable by classification
  display name
- GlossaryOntologyExportIT.testExportGlossaryAsRdfXml: replace legacy
  model.write("RDF/XML") with RDFDataMgr.write(RDFXML_PLAIN) — the legacy
  Jena API attempts external DTD/entity resolution from w3.org, hanging ~104s
  in network-isolated CI before the client times out at 60s; RIOT writes
  purely in-memory with no network I/O
- SearchResourceIT.testExportWithFromAndSizeForPagination: add _id as a
  final tiebreaker sort on export requests in both ElasticSearch and
  OpenSearch managers; from/size pagination without a unique tiebreaker
  produces duplicate rows across pages when concurrent CI tests mutate the
  same index between requests; also deduplicate the redundant name.keyword
  secondary sort when the caller already sorts by name.keyword

* fix(search): use id.keyword instead of _id for export sort tiebreaker

_id is an Elasticsearch meta-field that requires fielddata to sort on,
disabled by default. Use the indexed id.keyword sub-field instead, which
is a proper keyword field with doc values and is sortable without any
cluster setting changes.

* fix(it): retry pagination assertion in Awaitility to tolerate transient index shifts

from/size pagination on a shared search index can return duplicate rows
across two consecutive requests when concurrent tests mutate the same
index in between. Wrapping both page fetches and the assertion in
untilAsserted lets the check retry until the index stabilises rather
than failing on the first transient collision.

* revert(search): drop id.keyword tiebreaker; rely on test-side Awaitility retry

* fix(search): strengthen pagination test assertions and restore id.keyword tiebreaker sort

* fix(it): revert RdfRepository prod change; increase GlossaryOntologyExportIT timeout to 150s for Jena DTD stall in CI

* fix(it): restore tag search index aliases in IndexTemplateIT after index deletion

testDocUpdateOnDeletedIndexUsesTemplateNotAutoInference deletes the physical
openmetadata_tag_search_index and previously restored it with a bare PUT — leaving
all aliases (openmetadata_tag, openmetadata_classification, openmetadata_all)
missing for the remainder of the run. This caused TagResourceIT.checkCreatedEntity
to time out (searches on tag_search_index hit an empty bare index) and delete_by_query
cleanup ops to fail with index_not_found_exception on openmetadata_tag.

Fix: replace the bare PUT with Entity.getSearchRepository().createIndex() which
recreates the physical index with proper OpenMetadata mappings and restores all aliases.

* fix(it): isolate IndexTemplateIT tag test to avoid wiping production search index

testDocUpdateOnDeletedIndexUsesTemplateNotAutoInference was deleting the
production openmetadata_tag_search_index backing index, racing with
TagResourceIT.test_searchTagByClassificationDisplayName which polls that
index for 90s. Use a test-scoped index name matching the template pattern
instead, consistent with the other tests in this class.

* fix(it): make testClaimPendingIncludesRetryStatuses race-tolerant

The production SearchIndexRetryWorker (4 daemon threads, 5s poll) races
the test by calling the same global claimPending SQL. Replace the brittle
size-based assertion with an Awaitility loop that checks claimedAt != null
for each inserted record — proving claimPending's SQL filter accepted the
record's status regardless of which thread won the race.

* fix(it): avoid stale entityStatus in patch_addDeleteReviewers

The GlossaryTermApprovalWorkflow fires asynchronously when reviewers are
added, setting entityStatus=IN_REVIEW. The final patch sent the stale
entityStatus=APPROVED from the previous response, causing a spurious
IN_REVIEW→APPROVED transition in the diff which requires the caller to
be a reviewer — admin is not. Re-fetch the entity before the reviewer
removal so the diff contains only the reviewer change.

* fix(it): handle claimedAt reset in testClaimPendingIncludesRetryStatuses

updateFailureAndRetryCount sets claimedAt=NULL after the worker processes
a record. Add retryCount > 0 as a secondary proof-of-claim signal so
records that were claimed, processed, and had claimedAt reset are still
counted — covers the FAILED exhaustion path and intermediate PENDING_RETRY_*
states where claimedAt is temporarily null.

* fix(it): avoid governance workflow race in testApplyFeedback_withRecognizerMetadata

repository.create() publishes a ChangeEvent that triggers
ApplyRecognizerFeedbackImpl asynchronously. That workflow call races
with the direct applyFeedback below: by the time the workflow runs, the
GENERATED tag is already removed by the direct call, so
getRecognizerIdFromTagLabel returns null and the workflow falls back to
ALL recognizers, contaminating recognizer2.

Fix: insert directly to DAO (bypassing publishChangeEvent) so the
governance workflow is never triggered for this unit-level test.

* fix(it): handle worker deleteByEntity path in testClaimPendingIncludesRetryStatuses

The worker's processRecord takes the delete path (removeStaleEntityById +
deleteByEntity) when resolveEntityReference returns null but entityId is
non-empty — which applies to our fake UUID test records. If the worker
wins and deletes a record, findByStatus finds nothing and the assertion
fails.

Fix: track which IDs are still visible in any status. An ID absent from
all statuses was deleted by the worker after a successful claim —
deleteByEntity is only reached after claimPending accepted the record,
so absence is equally valid proof that claimPending's SQL filter worked.

* fix(it): re-fetch before reviewer patches in test_glossaryTermReviewersMultipleUpdates

Same root cause as patch_addDeleteReviewers: GlossaryTermApprovalWorkflow
fires asynchronously after reviewers are added, setting entityStatus=IN_REVIEW.
Subsequent patches using the stale APPROVED status from the previous response
trigger a spurious IN_REVIEW→APPROVED transition, rejected because admin is
not a reviewer. Re-fetch before each subsequent patch to avoid the stale status.

---------

Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
2026-05-06 17:50:13 +00:00
Sriharsha Chintalapani
7039be31e9
Container children: search + Deleted toggle, rename tab to Containers (#27893)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Container children: search + Deleted toggle, rename tab to Containers

Adds case-insensitive substring search via ?q= and a Deleted include
toggle to /v1/containers/.../children, mirroring the controls already on
the service-level Containers tab. Renames the Container detail page's
"Children" tab to "Containers" (URL slug unchanged) for consistency with
the file-system-style nested navigation.

Backend
- ContainerResource.listChildren accepts ?q=; ContainerRepository
  forwards it as a LOWER(name) LIKE ESCAPE '!' bind. '!' is preferred
  over backslash because JDBI's ColonPrefixSqlParser mishandles literal
  '\\' inside SQL string literals and silently leaves a downstream
  :includeDeleted bind un-substituted.
- Searches bypass ChildrenPageCache (cache hit rate per substring ≈ 0).

UI
- ContainerChildren.tsx adds a SearchBar and a Deleted Switch (existing
  Table component pattern); state changes reset to page 1.
- ContainerDetailUtils + ContainerDetailsClassBase render the tab with
  label.container-plural; EntityTabs.CHILDREN slug preserved.

Tests
- Backend IT: 3 new ContainerResourceIT tests cover q= substring match,
  LIKE-wildcard escaping, and per-level scoping of include=deleted (a
  soft-deleted descendant must NOT bubble up to ancestor /children
  listings).
- Playwright: 5 new tests across two describe blocks — search scoping
  vs sibling parent, Deleted toggle reveal/hide, search + toggle
  composition, plus multi-level scoping (grandparent toggle ON returns
  empty; parent toggle ON reveals the deleted leaf).
- Updated ContainerChildren.test.tsx and ContainerPage.test.tsx for the
  new include= bind on the API call and the renamed tab label.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address PR review: 5 Copilot comments on container children

- ContainerChildren.tsx: replace FQN-keyed stale-response guard with a
  monotonic token. The old guard caught navigation-between-containers but
  not overlapping fetches against the SAME container — a slow request from
  before a search/toggle change could overwrite a newer filtered result.
- ContainerChildren.tsx: reset searchValue + showDeleted on
  decodedContainerName change. Without this, a Deleted toggle ON or a
  search query carried over to the next container the user navigated to,
  making it look empty with no URL state to explain why.
- ContainerDetailUtils.tsx: honor labelMap for the CHILDREN tab in the
  dataModel-non-empty branch (was hard-coded to label.container-plural;
  the dataModel-empty branch already honored labelMap).
- ContainerDetailsClassBase.ts: override displayName for EntityTabs.CHILDREN
  in the customize-page editor — TAB_LABEL_MAP renders it as "Children"
  globally (still correct for Directory entities), but the live Container
  detail page now shows "Containers", and the editor needs to match.
- ContainerResourceIT.java: docstring for filterByQuery_escapesLikeWildcards
  referenced backslash escaping; updated to match the actual ESCAPE '!'
  implementation in buildNameLikeBind.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 10:09:12 -07:00
Sriharsha Chintalapani
b5374f9fec
Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Reindex robustness: selective fields, cache fail-fast, stop actually stops

Three independent fixes that all surfaced from the same incident: a 580k-
container reindex that froze for hours, then refused to actually stop when
the user clicked Stop.

Selective fields in the distributed reader path. PartitionWorker was
hardcoding List.of("*"), triggering every fieldFetcher in setFieldsInBulk —
including fetchAndSetOwns on Team/User where every owned entity becomes a
getEntityReferenceById round-trip. PR #27723 fixed this for EntityReader
(single-server) but the distributed pipeline never picked it up. Lifted the
field-resolution into ReindexingUtil so both paths share one source of
truth.

Cache layer no longer flaps on a single Redis hiccup. RedisCacheProvider
used to flip the whole provider unavailable on the first 300 ms timeout and
flip back on the next PING success — which combined with a 1 s health-check
made the indexer pay one timeout per cycle indefinitely. Replaced with a
sliding-window failure detector (5 failures in 30 s to trip, 3 consecutive
successes to recover) on the BulkCircuitBreaker pattern.

CacheWarmupApp parsed user config as EventPublisherJob (the SearchIndex
schema), which broke the Configuration page once cacheWarmupAppConfig.json
gained a type discriminator. Switched to CacheWarmupAppConfig in all four
parse sites and decoupled runtime status/stats from the parsed config.
Removed the readAppConfigFlags() workaround that read warmBundles /
enableDistributedClaim out of a raw map. Bails with ACTIVE_ERROR (not
COMPLETED) when an entity type is only partially warmed; retries on
transient cache unavailability instead of giving up on the first miss.

Stop actually stops. Three pieces:
- DistributedJobStatsAggregator skips the WebSocket status broadcast while
  the job is STOPPING so it doesn't overwrite the AppRunRecord.STOPPED that
  AppScheduler.updateAndBroadcastStoppedStatus pushed. Self-stops after a
  30 s grace if the executor never gets to call stop() on it.
- DistributedSearchIndexExecutor.stop() now calls workerExecutor.shutdownNow()
  after flagging workers, so threads parked inside the bulk-sink semaphore,
  initializeKeysetCursor, or waitForSinkOperations (5-min deadline) get
  interrupted instead of grinding for minutes.
- OpenSearchBulkSink replaces concurrentRequestSemaphore.acquire() with a
  60-second tryAcquire, recording permanent failure on timeout. A leaked
  bulk future (callback never fires) can no longer permanently freeze every
  subsequent flush at a fixed record count.
2026-05-04 13:22:15 -07:00
Sriharsha Chintalapani
8cec97b52c
Containers: FQN-driven hierarchy listings + cascade-delete orphan fix (#27878)
* Containers: FQN-driven hierarchy listings + cascade-delete orphan fix

Stops `?root=true&service=...` and `/containers/.../children` from leaking
deeply-nested orphans, fixes the source bug that produced them, and corrects
the 1.13.0 fqnHash pattern index opclass.

Listing path
- ListFilter.getFqnPrefixCondition now binds both <param>Hash and
  <param>HashChild ('<hash>.%' and '<hash>.%.%') so depth-aware listings
  can require "exactly one segment below the prefix" via a single LIKE +
  NOT LIKE pair on fqnHash. Same shape works at any tree depth.
- ContainerDAO.listRoot{Before,After,Count} swap the NOT EXISTS anti-join
  on entity_relationship for fqnHash NOT LIKE :serviceHashChild. The FQN
  is the canonical hierarchy in OpenMetadata; the relationship table is
  no longer consulted for hierarchical listings.
- ContainerRepository.listChildren rewritten: no parent-by-name lookup, no
  findToWithOffset/countFindTo on entity_relationship, no second-hop
  hydration. Single SQL roundtrip + slim projection via
  listDirectChildSummariesByParentHash. Orphans whose parent CONTAINS row
  is missing are now correctly placed under their FQN-implied parent.
- Both endpoints honour ?include=non-deleted|all|deleted; ChildrenPageCache
  key includes the include tag so toggling the UI Deleted switch doesn't
  return a stale page from the other side.
- ContainerResource.listChildren accepts ?include= for parity with the
  root listing.

Cascade-delete orphan source (EntityRepository.processDeletionBatch)
- Removed the redundant pre-batch-delete of relationships and the
  swallow-all try/catch in the per-child loop. cleanup() per entity now
  owns row removal AND relationship deletion atomically; exceptions
  propagate so the loop stops on first failure with per-child atomicity.
  Stops the orphan-without-relationships pattern that the listing change
  defends against.

Migration correction (1.13.0 postgres fqnHash pattern indexes)
- Recreate 23 idx_*_fqnhash_pattern indexes with text_pattern_ops instead
  of varchar_pattern_ops. The planner casts the column to text when the
  LIKE RHS is text-typed (every JDBC setString call), so
  varchar_pattern_ops doesn't match the resulting (fqnhash)::text ~~
  expression. Confirmed via EXPLAIN ANALYZE on a 580k-row table: the same
  query drops from ~470ms cold (Parallel Seq Scan) to <1ms (Index Scan).

Tests
- ListFilterTest: 3 unit tests covering both binds, dotted/quoted service
  name special-char handling, and include= flowing through alongside the
  service prefix.
- ContainerResourceIT: 8 integration tests covering depth correctness at
  every level (5-level chain), orphan exclusion at root, orphan
  discoverability under FQN-implied parent, sibling subtree isolation,
  the include toggle on both endpoints, and large-batch hard-delete
  leaving no orphan rows or relationships.

Closes #27870 (subset of its listing-side intent shipped here as a single
FQN-depth predicate; PR's cascade fix and both new tests picked up
verbatim).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address review comments on #27878

- ContainerDAO.listRoot* override now defaults :serviceHashChild to '%.%.%'
  via rootListingParams() when ?service= is absent. Previous code
  unconditionally referenced the bind, so ?root=true without a service
  filter crashed at runtime with a missing-named-parameter error.
- Migration 1.13.0/postgres/schemaChanges.sql now DROP INDEX CONCURRENTLY
  IF EXISTS before each CREATE so already-upgraded environments (which
  have the original varchar_pattern_ops indexes) get the index recreated
  with text_pattern_ops on next deploy. Fresh installs see the DROP as
  a no-op. Comment block updated to record the recreate intent.
- ChildrenPageCache include tag for ALL changed from "all" to "a" so the
  CacheKeys.childrenPage Javadoc's "1-2 char" promise holds (now nd/a/d
  are all <=2 chars).
- ContainerRepository.includeToBindString Javadoc corrected: it described
  the SQL as a CASE expression, but listDirectChildSummariesByParentHash
  actually uses a three-branch OR chain.
- ListFilterTest: added test_noServiceFilter_doesNotBindServicePatterns
  as a regression guard for the missing-bind bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix java style

* Address second review pass on #27878

- EntityRepository.processDeletionBatch wraps per-child cleanup exceptions
  with entityType + entityId context before re-throwing. The exception
  still propagates (so the loop still stops, failure-semantics contract
  unchanged); operators now get a stack trace that names the row that
  blocked a large recursive delete instead of an opaque error.
- CacheKeys.childrenPage Javadoc now lists the actual include tags
  ("nd" / "a" / "d") and points at ChildrenPageCache.includeTag as the
  authoritative source. Earlier comment still mentioned "all" after the
  switch to single-letter tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Test: ?root=true without service filter end-to-end (#27878 review)

Adds test_rootListing_withoutServiceFilter_returnsRootsAcrossAllServices
to ContainerResourceIT. Creates two distinct storage services, each with
a root container and a child container, then asserts that GET
/containers?root=true (no service filter):

- Succeeds (rootListingParams() defaults :serviceHashChild to '%.%.%' so
  the SQL has its bind even when ListFilter.getServiceCondition didn't
  add it).
- Includes root containers from both services (cross-service listing
  works without a service prefix narrowing the candidate set).
- Excludes child containers from either service (depth check still
  applied via the default bind).

Regression guard for the bug Copilot's review pass flagged at
CollectionDAO.java:784: 'GET /containers?root=true (no service) crashes
at runtime due to a missing named parameter.'

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Use generated name column instead of JSON extract in container summary queries

storage_container_entity has 'name' as a STORED generated column derived
from json->>'name' (see bootstrap/sql/schema/postgres.sql). Both slim
projection queries (findContainerSummaryRows and listDirectChildSummariesByParentHash)
were redundantly extracting it via JSON_UNQUOTE(JSON_EXTRACT(...)) on MySQL
and json->>'name' on Postgres — work the database had already done at insert
time.

Reading 'name' as a column directly:
  - Saves one JSON op per row on every page fetch
  - Lets ORDER BY name sort on the indexed generated column rather than a
    per-row JSON-extracted expression

displayName, fullyQualifiedName, and description stay as JSON extracts —
they aren't generated columns. (description in particular shouldn't be:
free-text fields can be many KB and a STORED generated column would
double the row size on disk.)

Row mapper unchanged — column labels in the SELECT list still match.

* Fix inaccurate ListFilterTest comment and Javadoc link to private method

ListFilterTest: the prefix-pattern comment said the LIKE patterns 'exclude'
direct/grandchildren — patterns themselves match, the SQL's NOT LIKE is
what excludes. Rewrote to show how ContainerDAO.listRoot* combines LIKE
and NOT LIKE on the two binds.

CacheKeys.childrenPage: the @link pointed at ChildrenPageCache#includeTag
which is private static; Javadoc tooling renders that as an unresolved
link. Redirected to the public Include enum the tag is derived from.

* Log original exception in recursive batch delete catch before wrapping

Wrapping the caught RuntimeException into a new one (with entity context
in the message) preserves the original via the cause chain, but the outer
exception mapper sees the wrapper and renders a generic 500 — the original
type information doesn't surface to operators investigating a failed
delete.

Adds a LOG.error before the wrap so the original exception (with full type
and stack) lands in the logs adjacent to the entity context, giving
operators enough signal to diagnose what actually blocked the delete.

* Restore failure-semantics comment block on recursive batch delete wrap

* use Entity.SEPARATOR instead of hard-coding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix check style

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sonika-shah <58761340+sonika-shah@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
2026-05-04 18:44:42 +05:30
Sriharsha Chintalapani
2a11bd53b0
test: cover glossary CSV typed relations across unit, IT, Playwright (#27740)
* test: cover glossary CSV typed relations across unit, integration, and Playwright

Add round-trip and validation coverage for typed glossary term relations
(synonym, broader, narrower, custom types from settings) in CSV
import/export. Unit tests in CsvUtilTest exercise addTermRelations
serialization edge cases. Integration tests in GlossaryCsvRelationTypesIT
add a true export-reimport round-trip, mixed-format per-term API
verification, asymmetric relation visibility from both sides, and a
custom-relation-type round-trip via GlossaryTermRelationSettings.
Playwright specs in GlossaryImportExport.spec.ts drive the UI bulk
import/export flow with typed relations and assert the validation
surface rejects unknown relation types.
2026-05-02 18:08:56 -07:00
Sriharsha Chintalapani
72e528c695
Fix flaky IT tests: incident-id tracking + GlossaryOntologyExportIT isolation (#27867)
* Fix flaky IT tests by snapshotting create/update/patch responses in SDK
2026-05-02 17:50:11 -07:00
Sriharsha Chintalapani
620d1b6ad9
Cache audit fixes: tag rename + relationship invalidation, bundle warmup (#27864)
* Cache audit fixes: tag rename invalidation, relationship invalidation, bundle warmup

Fix two write-through cache correctness bugs and enhance the warmup app:

Bug A — Tag/Glossary/Classification rename now invalidates the cached entity
JSON for every entity tagged with the renamed tag. Adds
invalidateCacheForTaggedEntities helpers in EntityRepository that use the
search index (same source updateClassificationTagByFqnPrefix already uses) to
enumerate affected entities, then call invalidateCacheForEntity for each.
Wired into TagRepository, ClassificationRepository, and both rename + parent-
move paths in GlossaryTermRepository.

Bug B — Direct addRelationship/deleteRelationship and bulk variants now
invalidate the cached bundle/owners/domains on both sides. Bot/Domain/Data
Product (already in UNCACHED_ENTITY_TYPES) short-circuit in
invalidateCacheForEntity so cascade-heavy delete paths don't pay for Redis
ops on keys that were never written.

Warmup enhancements — new BundleWarmupBatcher pre-warms the per-entity bundle
cache (tags + certification) using batched tag_usage queries; relations stay
lazy. CacheWarmupApp adds per-entity-type checkpoint resume, opt-in
SETNX-based distributed claim for multi-instance deploys, and per-entity-type
cache.warmup.coverage / cache.warmup.bundle.coverage gauges. CacheProvider
gains scanCount; CacheMetrics gains coverage gauges and a completed_runs
counter. cacheWarmupAppConfig.json drops dead consumerThreads/queueSize and
adds warmBundles + enableDistributedClaim flags.

Tests — BundleWarmupBatcherTest covers the batcher with mocked DAO/cache.
TagRenameCacheIT and RelationshipCacheInvalidationIT cover the bug fixes
end-to-end under the cache-tests / postgres-os-redis profile.


* Update generated TypeScript types

* Address PR review: claim ownership, config wiring, coverage, keyspace

Six follow-ups against PR #27864:

- releaseClaim now does a compare-and-delete: GET the owner first and only
  DEL when it matches our instanceId. Previously a 10-min TTL could expire
  mid-warm, another instance could claim, and we'd blindly delete its lock.
- warmBundles / enableDistributedClaim are now read from the user-supplied
  app config map (matching the cacheWarmupAppConfig.json schema fields)
  instead of being JVM-system-property-only. System properties remain a
  fallback for bootstrap / unedited records.
- Per-entity-type checkpoint and claim keys now embed cacheConfig.redis
  .keyspace so two environments sharing one Redis with different keyspaces
  no longer collide on warmup metadata.
- reportCoverage now prefers scanCount over current-run success delta when
  available, so resumed runs report end-state coverage instead of the
  artificially low partial number.
- CacheWarmupApp class Javadoc updated to reflect that bundle pre-warm is
  on by default; intentionally-not-done text was stale.
- RelationshipCacheInvalidationIT Javadoc clarifies that the assertion is
  on the cacheable table side; Domain itself is in UNCACHED_ENTITY_TYPES.
- CacheProvider.scanCount Javadoc documents the O(DBSIZE) cost so callers
  don't drop it on the request path.
- EntityRepository.invalidateCacheForTaggedEntities Javadoc strengthens
  the search-lag tradeoff section with the actual fallback (entity TTL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR review: accept string and Boolean forms for warmup config flags

readAppConfigFlags previously only accepted Boolean.TRUE / Boolean.FALSE.
Depending on how the app config arrives (typed POJO, raw JSON, YAML
env-var override, API string body) the same logical value can land as a
"true"/"false" string — and the instanceof Boolean check would silently
ignore it and fall back to the JVM system property. Operators setting the
flag in the UI would see no effect. Added a small parseBooleanFlag helper
that handles both shapes.
2026-05-02 14:50:01 -07:00