mirror of
https://github.com/open-metadata/OpenMetadata
synced 2026-05-24 09:39:11 +00:00
235 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
28ffe76785
|
fix(alerts): guard isBot() against deleted users to avoid aborting change-event batches (#28304)
* fix(alerts): guard isBot() against deleted users to avoid aborting batches AlertsRuleEvaluator.isBot() called Entity.getEntityByName with Include.NON_DELETED without catching EntityNotFoundException. When a change event's userName referenced a user that had been deleted (common for short-lived test fixtures torn down by afterAll), the exception escaped the SpEL filter and was caught in AbstractEventConsumer.execute as "Error in polling events for alert : Entity not found: user ...". That outer catch logs the error and lets the finally block advance the offset by batchSize, silently dropping every other event in the batch. The ActivityFeedAlert subscription was hit hardest because every event flows through its isBot() filter rule. Catch EntityNotFoundException and treat an unresolvable actor as not-a-bot. * fix(alerts): null-safe getIsBot() + IT + unblock ActivityAPI spec - AlertsRuleEvaluator.isBot(): use Boolean.TRUE.equals(user.getIsBot()) so a null isBot field on the resolved user doesn't NPE on auto-unbox. - Add AlertsRuleEvaluatorResourceIT#test_isBot_returnsFalseWhenActorUserDeleted covering the original bug — create a user, evaluate isBot() (false), delete the user, evaluate isBot() again and assert it still returns false instead of throwing. - Remove test.describe.fixme from ActivityAPI.spec.ts so the spec runs again now that the underlying batch-abort is fixed. --------- Co-authored-by: Siddhant <siddhant@MacBook-Pro-751.local> |
||
|
|
dbc0d997bb
|
fix(table,dataModel): persist inline column.extension from POST/PUT-create (#28328)
POST/PUT-create with column.extension inlined in the request body never wrote a row to entity_extension — the data only landed in the table JSON, while every reader (and the PATCH original reload) looks in entity_extension via getColumnExtension(). The override clobbered the inline data with null on every read, leaving the Custom Properties tab blank and causing any JsonPatch op walking /columns/N/extension/... to fail with "no mapping for the name 'extension'" on the reloaded original. One shared single-column writer for both create and update paths: - Add EntityRepository.storeColumnExtension(UUID, Column) — the upsert that used to live inside EntityUpdater.updateColumnExtension, lifted out so the create paths can reach it. - Add EntityRepository.storeColumnExtensions(UUID, List<Column>) — single entity walker that flattens via EntityUtil.getFlattenedEntityField and calls the primitive per column. - Add EntityRepository.getColumnsForExtensionPersistence(T) hook overridden in TableRepository and DashboardDataModelRepository to return getColumns(). - Wire into createNewEntityFlush alongside storeExtension. Update path simplifications: - updateColumns now uses the shared storeColumnExtension primitive. - updateColumnExtension (the EntityUpdater-private duplicate writer) is deleted. - Existing-column branch skips the upsert when stored.extension equals updated.extension — previously every update pass rewrote identical JSON for every column with a non-null extension. - Added-columns branch now persists extensions via the walker. updateColumns' main loop skips columns with no stored match, so inline column.extension on a freshly-added column was silently dropped on PUT-update too. Regression coverage in ColumnCustomPropertiesIT covers both tableColumn and dashboardDataModelColumn inline-on-create persistence, plus the PUT-update-adds-column path. |
||
|
|
b62db6224f
|
feat(tasks): policy-driven authorization with self-approval guard (#28315)
* feat(tasks): policy-driven authorization with self-approval guard Moves Task resolve/close/reassign authorization from ~150 lines of custom Java in TaskRepository into the policy engine. Adds ResolveTask, CloseTask, ReassignTask MetadataOperation values, isTaskFiler/isTaskAssignee/isTaskReviewer SpEL conditions, and a new TaskAuthorPolicy seed. Closes the self-approval gap where a filer who was also in the assignees list could approve their own task (now denied via deny rule). TaskResourceContext.getOwners now returns target entity owners so isOwner() retains its conventional meaning; v200 migration backfills the new policy attachment on the DataConsumer role for upgrades. |
||
|
|
81d30a7498
|
Fix Failing LineageImpactAnlaysisTest (#28340) | ||
|
|
7042153e32
|
feat(context-center): search indexing + vector body text for memories/pages (#28314)
* feat(context-center): enable search indexing + vector body text for memories/pages ContextMemory was indexable in the schema layer but supportsSearch was false, so live indexing and bulk reindex did not include it. Vector embeddings for ContextMemory and Page fell through to the default description-only body text extractor, which produced near-empty embeddings since the actual content lives in title/question/answer (ContextMemory) and displayName/page payload (Page). Changes: - Add ES/OS index mapping for context_memory_search_index across en/ru/zh/jp - Register contextMemory in indexMapping.json with parentAliases=[all] - ContextMemoryIndex (TaggableIndex) flattens shareConfig into visibility + sharedWithIds, normalizes source UUIDs, and populates entity refs with display names - Wire SearchIndexFactory.buildIndex() + flip ContextMemoryRepository supportsSearch=true so create/update/delete fire live indexing - Flip supportsSearchIndex=true in ContextMemoryIT to inherit BaseEntityIT's 4 search-index tests - ContextMemoryBodyTextContributor concatenates title/summary/question/answer/ description for the vector embedding instead of just description - PageBodyTextContributor adds title (displayName) and, for QuickLink pages, the destination URL alongside the markdown description - Register both contributors via static initializers in their owning EntityRepositories, per the VectorBodyTextContributor convention Tests: 25 new unit tests across ContextMemoryIndexTest (10), ContextMemoryBodyTextContributorTest (6), PageBodyTextContributorTest (9). All passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(context-center): address Copilot review feedback on indexing PR - PageBodyTextContributor: fall back to page.getName() when displayName is null/blank so vectors always have a title (matches the convention in SearchIndex.populateCommonFields) - PageBodyTextContributor: log the exception object (not e.getMessage()) so the stack trace is available when debug logging is on - ContextMemoryIndex: null-guard each principal entry in shareConfig.sharedWith before dereferencing, so a malformed payload cannot NPE the indexer Added 2 tests covering both behaviors; existing tests adjusted for the new title-fallback default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
04e8ed1a1f
|
Improve Vector Embedding recalculation logic (#28313)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Remove Vector Embedding recalculation logic * Add test * Address Review Comments * Address comment * ADdress review comments * Fix VectorEmbeddingIntegartionIT * More review addressed |
||
|
|
fd8d66c899
|
perf(glossary): batch related-term hydration via /glossaryTerms/byIds (fix N+1) (#28279)
* perf(glossary): batch related-term hydration via new /glossaryTerms/byIds endpoint
Sentry on release-1-13 flagged an N+1 on the Glossary Term Relations Graph
tab (transaction /glossary/.../relations_graph, p95 ~1.4s) — eight
sequential GET /api/v1/glossaryTerms/{id}?fields=relatedTerms,children,
parent,owners calls at ~180ms each. Two recursive resolution loops in
useOntologyExplorer.ts (loadNextTermPage:658-683 and fetchGraphDataFromDatabase:822-847)
fan out per-Id getGlossaryTermsById calls to hydrate cross-glossary
related terms after the initial paginated load, recursing up to 5 levels
deep. The customer hit a depth-1 cascade that produced ~8+ HTTP round
trips for a single page visit.
Adds GET /v1/glossaryTerms/byIds?ids=u1,u2,...&fields=... that returns a
single hydrated List<GlossaryTerm>, capped at 200 ids per request to
stay well under URL length limits and to isolate a single bad Id to one
batch. Missing/deleted/unauthorized Ids are silently dropped, matching
the old Promise.allSettled semantics so callers don't need to change
their error handling. Both resolution loops now call the batch endpoint
once per BATCH_SIZE (100) chunk instead of fanning out per-Id; depth-1
goes from 8 round trips to 1.
Tests: backend IT covers happy-path, fields hydration, silent-skip of
missing ids, and empty-input semantics. Playwright spec opens the
Relations Graph tab on a term with a cross-glossary relation and
asserts zero per-Id /glossaryTerms/{id}?fields=relatedTerms... requests
fire — failing if anyone re-introduces the resolution N+1. The new
batch endpoint is asserted to be called at least once, evidencing the
new path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(glossary): address review feedback on byIds endpoint and N+1 fix
Reviewers (gitar-bot + Copilot) raised 7 findings across the backend, the
client hook, and the Playwright spec. Addressed all:
Backend (GlossaryTermResource.byIds):
- Catch AuthorizationException alongside EntityNotFoundException so a
single unauthorized id can't 403 the whole batch — matches the
documented "silently omit missing/unauthorized" contract.
- Switched the OpenAPI response schema from a single GlossaryTerm to
@ArraySchema so generated clients see the correct array shape.
- Dropped `required = true` on the `ids` query param: the implementation
tolerates blank/missing (returns []) and the IT pins that behavior, so
the spec was lying. Description now states the contract explicitly.
Backend tests:
- Added the two negative tests the PR description claimed but the file
was missing: malformed UUID -> 400 and >200 ids -> 400.
Frontend (useOntologyExplorer resolution loops):
- On a whole-batch failure, set `aborted = true`, break out of the
current chunk loop, and clear missingIds before the next pass. Before
this, the same failing batch was silently retried up to
MAX_RESOLUTION_DEPTH - 1 more times for no benefit.
Playwright (GlossaryRelationsGraphPerf):
- Authenticate the new page via AdminClass.login(page) before the
request listener attaches; previously `browser.newPage()` + a
separate `performAdminLogin(browser)` left the test page unauth'd, so
the request listener never saw the API calls and the spec hung on
the wait-for-response timeout.
- Fixed the `relatedTerms` JSON-Patch shape (it stores TermRelation
objects, not bare EntityReferences). With the old shape the relation
never landed, the resolution loop never fired, byIds was never
called, and the spec hit a wait-for-response timeout (the 1.2-minute
retries observed in CI).
- Replaced the dual `/rdf/glossary/graph` OR `/glossaryTerms/byIds`
signal with byIds-only: for `scope === 'term'` the rdf graph endpoint
isn't called even when rdfEnabled, so listening for it just added
flake. Bumped the timeout to 60s for cold-CI runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(glossary): IT failures on byIds — wrong JSON path + URL header limit
Two integration tests added in the previous commit failed in CI:
1) getGlossaryTermsByIds_honorsFieldsParam_hydratesRelatedTerms
The assertion read `relatedTerms.get(0).path("id")` but glossary term
relatedTerms is a List<TermRelation> serialized as
{relationType, term: {id, ...}} — the id is nested under `term`, not
at the top of the array element. Fixed to `.path("term").path("id")`.
2) getGlossaryTermsByIds_tooManyIds_returns400
Sending 201 UUIDs in a query string puts the URL at ~7.5 KB which
trips Jetty's 8 KB request-header limit; the request was rejected
with 431 Request Header Fields Too Large before reaching the
server-side cap check, so the test never saw the documented 400.
Two-part fix:
- Lower MAX_BATCH_BY_IDS from 200 to 100. 100 * 37 chars per UUID +
separators is ~3.7 KB, well below 8 KB. This also matches the
client's BATCH_SIZE in useOntologyExplorer.ts (so the client can
now use the whole window without hitting the cap defensively).
- Test uses 101 ids (still tiny URL) so the cap check actually
fires and returns the documented 400.
Updated Javadoc and the client-side BATCH_SIZE comment to reflect
the new alignment.
The python failure on test_validations_datalake.py reproduces across
3.10/3.11/3.12 in the same parameterized case and is unrelated to the
glossary changes in this PR — pre-existing on main.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(glossary): address two new review comments on byIds
1) glossaryAPI.ts comment said the batch endpoint supports up to 200 ids
but the backend cap was lowered to 100 in the previous commit. Updated
the comment to match (and link the rationale — 100 keeps the URL
under Jetty's 8 KB header limit).
2) The two 400-response IT tests asserted `contains("400") || contains
(<substring>)`, which would let a 500 with "invalid" or "too many"
in the response body silently pass. Tightened both to require BOTH
the HTTP 400 status AND the expected message substring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(glossary): assert HTTP 400 by exception type, not message substring
The previous tightening required `e.getMessage().contains("400")`, but
the SDK's OpenMetadataHttpClient.handleErrorResponse puts ONLY the
error body in the exception message (no "HTTP 400" prefix in the parsed
path), so the assertion failed on real 400 responses with bodies like
"ids parameter contains an invalid UUID" / "Too many ids: 101 (max 100)".
Use assertThrows(InvalidRequestException.class, ...) instead — the SDK
throws InvalidRequestException ONLY for HTTP 400 (other statuses surface
as ApiException or status-specific subclasses like ForbiddenException),
so the type assertion locks the status code as strongly as a body
substring check would. Substring check stays for the body content.
Removes the no-longer-used `Assertions.fail` static import.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(glossary): address four review comments on byIds
Backend (GlossaryTermResource.java):
- Add includeRelations query param for parity with GET /{id} so callers
can pass per-relation include controls (owners:non-deleted, etc.)
through the batch path. Forwards to the existing 6-arg getInternal.
- Add a catch (RuntimeException) per-id so unexpected failures
(validation, downstream 5xx surfaced as WebApplicationException, etc.)
don't fail the whole batch. EntityNotFoundException /
AuthorizationException stay on debug; the broader catch logs at warn
so a real bug isn't silently swallowed.
Frontend (useOntologyExplorer.ts):
- Extract the duplicated related-term resolution loop into a top-level
`resolveRelatedTerms(terms)` helper. Both call sites
(loadNextTermPage, fetchGraphDataFromDatabase) now do
`await resolveRelatedTerms(...)`, ~100 fewer LoC and no risk of the
two implementations drifting.
- Change the failure semantics from "abort the whole resolution on
first batch failure" to "remember the failed Ids in a skip set, keep
going". This restores best-effort hydration (matching the old
Promise.allSettled behavior on the client) without falling back into
the gitar-bot-flagged retry-the-same-batch-MAX_DEPTH-times footgun:
the skip set causes collectMissingRelatedTermIds to never hand the
same Ids back on subsequent depth passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(glossary): pin cross-glossary relation surfacing in graph + list paths
User reported the glossary graph endpoint returning partial data (focused
term + a sibling, both marked glossaryTermIsolated, edges = []) when the
focused term has cross-glossary related terms. Adds two ITs that
exercise the exact scenarios.
1) GlossaryTermResourceIT#listGlossaryTerms_hydratesCrossGlossaryRelatedTerms
Hits GET /v1/glossaryTerms?glossary=<A>&fields=relatedTerms with a
focused term in glossary A pointing at a related term in glossary B.
Asserts BOTH the single-entity GET and the bulk-list endpoint return
the cross-glossary relation — pins the bulk hydration path
(GlossaryTermRepository.setFieldsInBulk -> fetchAndSetRelatedTerms)
against the single-entity path (setFields -> getRelatedTerms) as
producing the same shape.
2) RdfGlossaryGraphIT#crossGlossaryRelationSurfacesAsEdgeAndNodeInScopedGraph
Creates focused (in glossary A) + relatedAcross (in glossary B),
adds the relation, hits GET /v1/rdf/glossary/graph?glossaryId=<A>,
asserts the response contains: focused NOT marked
glossaryTermIsolated, relatedAcross as a secondary node, edge
between them.
Both tests pass against current main on a clean DB, which means the
user's reported failure mode does not reproduce from a freshly seeded
fixture. Likely causes:
- Stale RDF data in the user's deployment (Fuseki has the terms but
is missing the relation triples; SPARQL returns nodes but 0 edges
and the nodes.isEmpty() fallback in RdfRepository.java:1618
doesn't fire because nodes ARE present)
- The deployment may be on a code version that predates the canonical
relation storage fix from PR #25886
- User-specific relation type / glossary structure not captured here
The tests now stand as regression guards: any future change that
breaks cross-glossary surfacing on either path will fail CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Revert "test(glossary): pin cross-glossary relation surfacing in graph + list paths"
This reverts commit
|
||
|
|
09af9fc801
|
Fixes #4003: bulk + async restore for large entity hierarchies (#27997)
* fix(restore): bulk + async restore for large entity hierarchies EntityRepository.restoreEntity walked descendants synchronously, taking 4+ minutes on a 12k-table database and exceeding typical proxy timeouts. restoreChildren now groups CONTAINS children by type and dispatches one bulkRestoreSubtree per type, batching DB writes, version history, change events, and cache invalidation; the existing ES cascade handles descendant index updates in one update_by_query. Adds an async option (?async=true) on the deep-hierarchy restore endpoints that returns 202 Accepted with a job id and runs the restore on AsyncService, emitting WebSocket notifications on restoreEntityChannel. Java SDK adds .restore().async().execute() fluent builders on Tables/Databases plus restoreServerAsync on EntityServiceBase; Python SDK mirrors this with restore_request().with_async().execute() and restore_async() helpers on BaseEntity, exposing a new AsyncJobResponse type. Tests: EntityRepositoryRestoreTest verifies the per-type grouping and bulk dispatch path; RestoreFluentAPITest covers the Java SDK fluent behavior; RestoreHierarchyIT exercises sync and async restore against a real DB→schemas→tables tree end-to-end; test_restore_async.py covers the Python SDK paths. Fixes #4003 |
||
|
|
86e4127e9b
|
feat(context-center): /v1/contextCenter namespace + Java/Python SDK support (#28237)
* refactor(context-center): move endpoints under /v1/contextCenter namespace
Renames the Context Center API paths to disambiguate from the external Drive
Service connector. Singular /v1/drive/* and plural /v1/drives/* were trivially
easy to confuse; the new prefix makes namespace ownership obvious.
/v1/drive/files -> /v1/contextCenter/files
/v1/drive/folders -> /v1/contextCenter/folders
/v1/knowledgeCenter -> /v1/contextCenter/pages
Drive Service endpoints (/v1/drives/*, /v1/services/driveServices) and the
generic /v1/attachments endpoint are unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(context-center): nest drive files/folders under /drive sub-namespace
Keeps the Drive concept visible in the URL while staying under the
contextCenter prefix.
/v1/contextCenter/files -> /v1/contextCenter/drive/files
/v1/contextCenter/folders -> /v1/contextCenter/drive/folders
Pages stays at /v1/contextCenter/pages (not a Drive concept).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context-center): add Java + Python SDK support for files, folders, pages
Wires the Context Center entities into both SDK fluent surfaces so callers do
not have to hit /v1/contextCenter/drive/files, /v1/contextCenter/drive/folders,
or /v1/contextCenter/pages directly.
Java (openmetadata-sdk):
* ContextFileService + PageService alongside the existing FolderService.
Standard CRUD comes from EntityServiceBase; entity-specific methods
expose move (files), vote / follower / hierarchy (pages), and folder
contents (folders).
* Folders, ContextFiles, Pages fluent wrappers with Creator and Finder
builders, plus static convenience methods.
* OpenMetadataClient.folders() / contextFiles() / pages() accessors.
* OM.Folder / OM.ContextFile / OM.Page entry points.
Python (ingestion):
* Folders, ContextFiles, Pages entity facades over BaseEntity for CRUD,
list, retrieve, and search via the existing ometa client.
* Top-level metadata.sdk re-exports + lowercase aliases.
Binary download and multipart upload are intentionally not exposed yet — those
endpoints need streaming / multipart support that the SDK HTTP layer does not
currently provide. Page voting / follower / hierarchy operations are Java-only
for the same reason (no underlying ometa methods).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* style: apply spotless line wrap in DriveFileUploadIT
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix playwright test
* fix(context-center): route entity delete + playwright regex through new path
DeleteWidgetClassBase still returned the old 'knowledgeCenter' URL segment for
KnowledgePage / KnowledgeCenter entity deletes, so the UI was issuing
DELETE /api/v1/knowledgeCenter/{id}?hardDelete=... — which the server no
longer serves, returning 404. ContextCenter playwright spec also had a leftover
escaped regex (/\/api\/v1\/knowledgeCenter\/.../) matching the same legacy URL.
Point both at /api/v1/contextCenter/pages so the Knowledge Center / Context
Center delete flows reach the server again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(context-center): re-align playwright delete matchers with new path
A prior fix-up commit (
|
||
|
|
0d5792e2ed
|
refactor(data-insights): isolate enricher step failures + per-step EntityStats (#28264)
* fix(data-insights): resolve owner by id in processTeam to handle bare historical refs * refactor(data-insights): isolate enricher step failures + per-step EntityStats * test(data-insights): add enricher behavior IT + rename equivalence IT * refactor(data-insights): extract VersionResolver + SnapshotMaterializer * refactor(data-insights): extract enricher concerns into step classes + bounded owner cache + rate-limited loss warns * refactor(data-insights): apply review feedback — tighten enricher docs, drop dead code * docs(data-insights): drop unsupported 'intentional' claim from TierStep prefix-match docstring * fix(data-insights): guard VersionResolver against null updatedAt on historical version rows * refactor(data-insights): tighten VersionResolver walk — consume isFirst at top, drop historyDone flag |
||
|
|
d744e397da
|
fix(alerts): strict literal matching across alert filter functions (#27987)
* fix(alerts): make matchAnyEntityFqn match FQNs literally, not as regex * fix(alerts): strict literal matching across alert filter functions * fix(policy): AST-based SpEL validation so string-literal content is not parsed as syntax * fix(data-insight): evaluate chart formulas outside the policy/alert validator * test(playwright): ObservabilityAlerts must pick filter options by FQN, not bare name * test(playwright): focus combobox before fill in alert action input loop * test(playwright): extend inputValueId pattern to alert action inputs * refactor(alerts): move UUID regex to constants and use size variable |
||
|
|
b1781f7324
|
feat(workflow): add inputPorts, outputPorts, glossaryTerms as WorkflowTriggerFields (#28120)
* feat(workflow): add inputPorts, outputPorts, glossaryTerms as WorkflowTriggerFields - Add inputPorts, outputPorts, glossaryTerms to workflowTriggerFields.json enum - Unit tests in FilterEntityImplTest verifying all three fields pass the passesFieldBasedFilter check, including include/exclude config - Integration test scenario appended to test_CustomApprovalWorkflowForNewEntities: adds table as inputPort, removes it, adds as outputPort — each port change must produce an approval task proving the workflow trigger fires correctly * Update generated TypeScript types * fix(workflow): fix inputPorts/outputPorts changes not triggering governance workflows - Add inputPorts, outputPorts, glossaryTerms to WorkflowTriggerFields enum - Fix executeBulkPortsOperation to update entity changeDescription so FilterEntityImpl reads the correct changed fields - Add governance-bot impersonation to applyPatchEntityFieldAction to prevent entityStatus updates from triggering spurious workflow signals - Pass caller username through port endpoints to fix self-approval check incorrectly removing the reviewer when entity updatedBy was set to the reviewer from a prior task resolution --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> |
||
|
|
d388a5264b
|
Fixes 28267: Preserve test case suite search membership (#28271)
* Fix test case suite search hydration * Stabilize test case repository unit test |
||
|
|
fc74d08cb0
|
fix(deps): remove explicit jersey version pins to inherit 3.1.11 from BOM (#28106)
jersey-client and jersey-apache-connector were pinned to 3.1.9 in openmetadata-sdk and openmetadata-integration-tests. The root pom.xml jersey-bom already manages all org.glassfish.jersey.* artifacts at 3.1.11. Removing the explicit pins lets both modules inherit 3.1.11 from the BOM, consistent with the rest of the project. |
||
|
|
7485c5b421
|
feat: add ContextMemory entity (Context Center memories) (#28224)
* feat(spec): add ContextMemory + CreateContextMemory JSON schemas * feat(jdbi3): add ContextMemoryDAO * feat: register contextMemory entity type constant * feat(service): add ContextMemory repository, resource, mapper * feat(bootstrap): add context_memory table DDL * test(service): ContextMemory resource CRUD test * fix(context-memory): address review (relationship types, stable FQN, status msg, test name) - storeRelationships: rootMemory -> Relationship.CONTAINS, parentMemory -> Relationship.HAS so the root-ancestor and direct-parent hierarchies are distinguishable. - setFullyQualifiedName: derive from the immutable name only (drop mutable primaryEntity/owner derivation that destabilized nameHash on update). - validateStatusTransition: separate "no transitions defined" from "disallowed transition". - Rename ContextMemoryResourceTest -> ContextMemoryStatusTransitionTest (pure unit test). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(context-memory): add ContextMemoryIT + SDK ContextMemoryService Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(spec): register contextMemory in EntityLink.g4 ENTITY_TYPE grammar EntityLinkGrammarTest.testAllEntityTypesHaveGrammarOrExclusion enumerates every Entity.java constant and requires each to be in the EntityLink grammar or the test's exclusion list. ContextMemory is a normal EntityRepository-backed top-level entity (like learningResource / contextFile), so it belongs in the ENTITY_TYPE rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(context-memory): override owner ITs for creator-as-owner default ContextMemoryMapper.defaultOwners() intentionally assigns the creating user as owner when the create request omits owners. BaseEntityIT's patch_entityUpdateOwner_200 and patch_entityUpdateOwnerFromNull_200 assert "no owner initially" for any supportsOwners entity, so both failed for ContextMemory. Override both in ContextMemoryIT: keep the PATCH-replace-owner contract, change only the precondition to expect the creator as the sole initial owner (asserted by count, not a hardcoded principal). Mapper unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update generated TypeScript types Add the generated ContextMemory TS types (entity/context/contextMemory.ts, api/context/createContextMemory.ts). The schemas were on the branch but their generated types were missing, failing the TypeScript Type Generation check on this fork PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(context-memory): address review (relationship cleanup, owner scope, validations) Copilot review on the ContextMemory entity: - #1 record primaryEntity/relatedEntities/root/parent/source*/machineRepresentation in version history; usageCount/lastUsedAt documented as untracked telemetry - #2 clear stale HAS/RELATED_TO/CONTAINS edges before re-adding in storeRelationships - #4 default creator as owner only on create; PUT without owners no longer silently replaces previously set owners - #5 schema documents that any status is allowed at creation; transitions enforced only on update - #6 setFullyQualifiedName via FullyQualifiedName.build with skip-if-set guard - #7 validate shared principal type is user/team/domain - #8 reject self-reference for parentMemory/rootMemory - #10 inline Entity.CONTEXT_MEMORY, drop redundant constant Regenerate ContextMemory TS types for the schema doc change; add IT coverage for the self-reference and invalid-shared-principal validations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(context-memory): don't blanket-delete relationships (domain data loss) The #2 cleanup via deleteTo(memory, CONTEXT_MEMORY, HAS, null) also matched the framework's domain --HAS--> memory edge (storeDomains runs before storeRelationships in storeRelationshipsInternal, on every create and update), silently dropping domain assignments. storeRelationships is now add-only (addRelationship upserts, so re-running on update is idempotent). Stale-edge cleanup moved to ContextMemoryUpdater using the framework's updateFromRelationship(s) helpers, which delete only the specific changed refs and record the version change. parentMemory now uses Relationship.PARENT_OF (distinct from primaryEntity's HAS and the framework's domain HAS) so the parent edge can be maintained without collision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(bootstrap): move context_memory DDL from 2.0.1 to 2.0.0 The context_memory table belongs in the 2.0.0 migration. Relocated the MySQL and Postgres DDL verbatim; the 2.0.1 schemaChanges.sql files are restored to their original task_migration_mapping-only content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(bootstrap): add ENGINE=InnoDB to context_memory MySQL DDL Explicit engine clause, consistent with the task/search-index tables in the same migration and robust to any server default change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(context-memory): preserve sanitized/validated fields; validate relatedEntities Review follow-ups: - ContextMemoryMapper no longer re-sets description/owners/domains/tags/displayName after copy(). copy() sanitizes description (stored-XSS) and validates owners and domains; re-setting the raw request values bypassed both. Only ContextMemory- specific fields are set now. - prepare() now assigns the result of EntityUtil.populateEntityReferences back onto relatedEntities so orphaned/invalid refs are filtered instead of persisted. - ContextMemoryIT Javadoc now references ContextMemoryRepository#setCreatorAsDefaultOwner (the defaultOwners mapper method no longer exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
d6cec7ed1c
|
Fixes #28229: cascade Table certification PATCH to child search docs (#28236)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Maven Collate Tests / maven-collate-ci (push) Waiting to run
Java Checkstyle / java-checkstyle (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Fixes #28229: cascade Table certification PATCH to child search docs When a Table's certification is added, changed, or removed via PATCH the existing cascadeCertificationToChildren path never fired because SearchRepository.requiresPropagation returned false on a cert-only ChangeDescription. TableRepository did not list certification in its propagation descriptors, so the gate stayed closed and the DQ dashboard's Certification filter kept returning the stale cert on test_case / test_case_result / test_case_resolution_status / test_suite / column docs until a full reindex. Add an EXTERNAL_HANDLER PropagationType for fields whose cascade is driven by a dedicated SearchRepository handler instead of the generic descriptor-driven script (cert needs full-object replace on add/update and explicit removal on delete, which RAW_REPLACE can't express because it restores the old value on delete). Register certification with this type on TableRepository so the gate opens and the existing cascadeCertificationToChildren method runs. Add no-op cases in the three appendAdd/Update/DeleteScript switches so the new enum value doesn't accidentally trigger generic auto-propagation. |
||
|
|
0ab31cb647
|
fix(rdf): reclaim Fuseki disk via compaction + upgrade Jena 4.10 → 5.6.0 (#28242)
* fix(rdf): reclaim Fuseki disk via compaction + upgrade Jena 4.10 → 5.6.0 PR #28117's SPARQL cleanup converged the logical RDF state but never freed disk: TDB2 deletes only mark blocks free and the journal grows monotonically until /$/compact runs. RdfIndexApp.clearRdfData() now calls a new RdfStorageInterface.compactStorage() between clearAll() and reloadOntologies() so each recreate run reclaims to a fresh dataset directory. JenaFusekiStorage posts to /$/compact/{dataset}?deleteOld=true and polls /$/tasks/{id} until finished, with failures logged and swallowed (disk hygiene, not correctness). Also unifies the Jena classpath: openmetadata-service was on 4.10.0 and openmetadata-integration-tests on 5.0.0. Both now pin to 5.6.0 via a single root pom property, dropping the apache-jena-libs BOM in favour of explicit jena-core/arq/rdfconnection deps (we're a remote-Fuseki client and never embed TDB; pulling jena-tdb1/2 triggers a Jena 5/6 static-init regression). Picks up CVE-2025-49656 and CVE-2025-50151 (admin-side fixes shipped in Jena 5.5.0). Jena 6.x parked: both 6.0.0 and 6.1.0 hit a recursive clinit bug where TypeMapper.reset reads RDF.dtLangString before RDF.<clinit> completes. Fuseki server bumped 4.10/5.0 → 5.6.0 across all in-repo Dockerfiles; the unmaintained stain/jena-fuseki:* image references in dev compose files switched to building from docker/rdf-store/Dockerfile, and Testcontainers moved to secoresearch/fuseki:5.5.0 (maintained, CVE-fixed; the dataset is created by JenaFusekiStorage.ensureDatasetExists() so the stain-only FUSEKI_DATASET_1 env var is no longer needed). |
||
|
|
5c53151d16
|
Fixes #24294: support re-parenting a Container via PATCH (#28201)
* feat(container): support re-parenting a Container via PATCH (#24294) Allow the PATCH API to update a Container's `parent`, cascading the FQN change to every descendant container, nested column FQN, tag-usage row, entity-link, policy condition, and search-index document — same shape as GlossaryTerm re-parenting. Scoped to same-StorageService moves; cross-service parents are rejected with HTTP 400. Adds parent-aware fluent SDK methods in Java (`Containers.under(...)`, `FluentContainer.withParent(...)`/`withoutParent()`) and Python (`Containers.set_parent`, `Containers.clear_parent`), unit tests for the validation logic, 11 integration tests covering the cascade and rejection paths, and 9 + 2 SDK tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(container): reject re-parent PATCH on oversized subtrees (#24294) The single-transaction cascade in updateParent locks every descendant row in storage_container_entity, rewrites their JSON, renames their tag_usage rows, and issues an Elasticsearch updateByQuery — all while holding row locks that block any concurrent write on the subtree. At ~10k+ descendants this becomes a multi-minute outage on the cluster. Add an indexed COUNT(*) preflight in ContainerUpdater.updateParent that short-circuits with HTTP 400 when the moved subtree exceeds openmetadata.container.maxReparentDescendants (default 10000), pointing the operator at the system property if they have measured the impact and accept it. Run BEFORE invalidateCacheForRenameCascade so a rejected request pays no cache-eviction cost. Tests: 5 new unit tests in ContainerRepositoryParentValidationTest cover under/at/over-limit and the system-property override; 2 new IT methods in ContainerResourceIT exercise the end-to-end reject path with a test-scoped low threshold and confirm at-limit moves still succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(container): address PR #28201 review — perf, cycle detection, IT race (#24294) Four fixes from the gitar-bot review: 1. Performance — validateContainerParent short-circuits when the proposed parent id matches the original. Skips an unnecessary Entity.getEntity round-trip on every container PATCH/PUT that doesn't touch the parent field (description edits, tag adds, etc.). 2. Cycle detection — second-line ID-based ancestor-chain walk added in ContainerUpdater.validateAncestorChainCycle. Uses relationshipDAO.findFrom (direct DB) so a descendant with a briefly stale FQN can't bypass the FQN-prefix check. Visited-set bounded. 3. IT race condition — drop System.setProperty in patch_containerParent_rejectsOversizedSubtree_400 and patch_containerParent_allowsMoveAtConfiguredLimit_200. Add a package-accessible test-override field (set/clear via public static methods) plus @ResourceLock(MAX_REPARENT_DESCENDANTS_TEST_LOCK) to serialize any test that mutates the override, even though the class runs methods concurrently. 4. SQL build comment — document why ContainerDAO.updateFqn interpolates values via String.format (mirrors EntityDAO pattern, FQN values are server-computed, escapeApostrophe handles the only SQL metacharacter that can appear in a validated entity name). Tests: ContainerRepositoryParentValidationTest extended to 12 cases (adds parent-unchanged short-circuit assertion + override-priority coverage). Full ContainerResourceIT still 255/255. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(container): address PR #28201 copilot review (#24294) Two findings from copilot-pull-request-reviewer: 1. setFields perf regression — restored the conditional fields.contains(FIELD_PARENT) ? getContainerParent(c) : c.getParent() guard. The unconditional load forced an extra entity_relationship lookup on every container GET, which is a measurable regression on the hot path. The PATCH flow still loads parent because CONTAINER_PATCH_FIELDS includes parent (so fields.contains is true there). Full ContainerResourceIT still 255/255. 2. updateEntityLinks shallow walk — previously only iterated direct children, leaving deep descendants' (grandchildren+) legacy feed thread entityLinks pointing at the old FQN and breaking activity- feed navigation after a multi-level move. Now takes the renamedContainers snapshot captured by invalidateCacheForRenameCascade and rewrites each descendant's entityLink by swapping the FQN prefix (oldFqn → newFqn) — consistent with the same prefix-substitution ContainerDAO.updateFqn applies to the JSON/fqnHash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b69a4026fd
|
feat(context-center): sortBy on lists, drive-file move, IT coverage (#28096)
* feat(context-center): add sortBy, drive-file move, and IT coverage
- Add `sortBy`/`sortOrder`/`offset` to the article list (`/v1/knowledgeCenter`)
and drive-file list (`/v1/drive/files`). When `sortBy` is set, the request is
routed through `listInternalFromSearch` against OpenSearch; cursor pagination
(`before`/`after`) is rejected with HTTP 400 to keep the contract explicit.
Supported values: `name`, `createdAt` (aliased to `updatedAt` for v1),
`updatedAt`. Default `sortOrder` is `desc` when `sortBy` is set.
- Add `PUT /v1/drive/files/{id}/move` for moving a drive file between folders
or to the drive root. Sync handler (ContextFile is a leaf, no FQN cascade).
Body: new `MoveContextFileRequest` schema with an optional `folder`
EntityReference.
- Fix `ContextFileUpdater.entitySpecificUpdate` so folder changes are recorded
in the change description and the `CONTAINS` relationship edge is rewired
to the new folder. Without this, the move would silently update JSON only.
- Integration tests in `ContextFileIT` (move between folders, move to root,
non-existent folder, unprivileged-user 403, sort by updatedAt/name, cursor
combo rejection) and `KnowledgeCenterIT` (sort by updatedAt, name, createdAt
alias, cursor combo rejection).
|
||
|
|
d3430d6abb
|
fix(logging): non-blocking streamable log handler + split flush executors (#28198)
* fix(logging): non-blocking streamable log handler + split flush executors
Client side (ingestion):
- Rewrite StreamableLogHandler as fire-and-forget: producer never blocks,
bounded buffer drops on overflow, daemon drainer + single daemon sender
thread, thread-local recursion guard, isolated REST client (own
Session/connection pool), no-wait close() that orders /close behind
in-flight batches via a sentinel marker.
- Add quiet client path: REST.post_best_effort and OMetaLogsMixin's
send_logs_batch_best_effort / send_close_best_effort. Bypasses the
retry/sleep loop, the connection-error second-try, and all
ometa_logger emissions (which would otherwise feed back into the
streamable handler attached to the metadata logger).
- Backwards-compatible timeout=/retries= kwargs threaded through
REST._request / REST.post for callers that want per-call overrides
without going through the best-effort path.
Server side:
- Split the single s3-log-cleanup ScheduledExecutorService into two:
s3-log-partial-flush (the path that gets logs onto S3) and
s3-log-abandoned-cleanup. A stuck cleanup task can no longer starve
partial.txt writes.
- Wrap scheduled tasks with safeScheduledTask(...) so a Throwable from
one tick can never silently de-schedule the executor.
- StreamableLogsMetrics: add lastPartialFlushTimestamp,
partialFlushHeartbeat, abandonedCleanupHeartbeat, pendingStreamsCount,
pendingFlushBytes, pendingFlushLines, flushFailuresCounter. Wired
into writePartialLogs / writePartialLogsForStreamLocked /
cleanupAbandonedStreams / recordFlushFailure.
Tests: 35 Python unit tests (overflow non-blocking, recursion guard,
saturated drop, daemon threads, isolated session, no-wait close,
close ordering, PR #28160 regression) and 46 S3LogStorageTest unit
tests (executor split, safeScheduledTask Throwable guard, per-stream
failure does not poison the loop, metric updates) — all passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(logging): address PR #28198 review — auth race, close ordering, visibility
Three review findings:
1. Race: _build_request_headers no longer refreshes the auth token. The
streamable handler shares ClientConfig with the main metadata client,
so two threads could concurrently call _auth_token() and mutate
config.access_token / config.expires_in. The sender thread now reads
the current access_token only; refresh remains the exclusive
responsibility of REST._request() on the main thread. Avoids
double-refresh against rate-limited / single-use refresh tokens and
the TOCTOU on expires_in.
2. close() race: _stop.set() is now called AFTER _drain_remaining_buffer
and the CLOSE_MARKER enqueue. Previously the sender's get(timeout=1s)
could return Empty in the narrow window between set() and enqueue,
hit the if-stop-return branch, and exit without ever processing the
marker — silently dropping the /close notification under light load.
3. Visibility: safeScheduledTask is now private. It's an internal
scheduler-boundary wrapper, not part of the class API.
Regression tests:
- test_build_request_headers_does_not_refresh_token
- TestCloseOrderingRegression.test_close_enqueues_marker_before_setting_stop
- TestCloseOrderingRegression.test_close_marker_arrives_even_when_sender_was_idle
All 60 Python tests and 46 S3LogStorageTest tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(logging): trim verbose code comments + PR-number references
Strip the design-doc-style docstrings and inline narration that crept
into the previous commits. Keep only comments where the WHY is not
obvious from the code: lock-ordering rationale in close(), the auth
non-refresh rationale in _build_request_headers, the executor-split
rationale in S3LogStorage. Everything else (what the function does,
what counter increments, "regression for PR #N", numbered design
contracts) is removed — that context belongs in PR history, not source.
No behavior change. 44 Python tests + 46 Java tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(logging): add type annotations on new kwargs flagged by basedpyright
The static-check warning shown on the PR view (reportMissingParameterType
on `timeout=None` and `client=None`) for the new fire-and-forget log path
kwargs. Adding the explicit Optional[Union[float, tuple[float, float]]] /
Optional[int] / Optional[REST] annotations to client.post,
client.post_best_effort, client._request, OMetaLogsMixin.send_logs_to_s3,
send_logs_batch, send_logs_batch_best_effort, send_close_best_effort.
logs_mixin.py is now 0 errors / 0 warnings on basedpyright.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(client): type rest of REST.post/post_best_effort + harden self._retry guard
basedpyright was still flagging on the PR view:
- path / data / json / headers untyped on post, post_best_effort,
_request, _build_request_headers (reportMissingParameterType)
- self._retry > 0 with self._retry: Optional[int] (reportOptionalOperand)
- headers=headers passing Optional to non-Optional _request param
(reportArgumentType)
All three classes of warning are fixed. baseline.json reduction: 20.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: restore baseline.json — should not have been touched in this PR
The previous commit auto-pruned 160 lines from .basedpyright/baseline.json
as a side effect of running basedpyright locally. Baseline regeneration
belongs in its own PR; mixing it with a feature PR risks masking other
in-flight PRs' regressions.
The source-code type fixes in
|
||
|
|
4350922707
|
fix(ingestion-pipeline): inherit owners from service and authorize trigger (#28109)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(ingestion-pipeline): inherit owners from service and authorize trigger (#27962) * fix(policy): grant Trigger to default bot + steward policies for /trigger authz * fix(test): use java.time.Instant instead of Joda-Time per code review * fix(policy): use a dedicated TriggerRule on DataStewardPolicy instead of mixing into EditRule * fix(migration): drop generic exception catch + inline rule description per code review * fix(policy): add trailing newline to DataStewardPolicy.json per code review |
||
|
|
2ca2677bdd
|
fix(users): Online Users "All time" filter no longer returns empty (#28145)
The /v1/users/online endpoint computed `now - timeWindow*60*1000` for the SQL threshold. When the UI sent `timeWindow=0` for the "All time" option, the threshold collapsed to `now`, producing `lastActivityTime > now` — which matched zero rows. Skip the time predicate entirely when `timeWindow <= 0` so the filter falls through to "all non-bot users". Reported via #27993. |
||
|
|
a160fcb145
|
fix(entity): null-safe updateColumns for entities without columns (#28047)
* fix(entity): null-safe updateColumns for entities without columns PATCH on a File without columns (PDF/image/etc.) NPEs in ColumnEntityUpdater.updateColumns because FileRepository.entitySpecificUpdate unconditionally invokes the columns updater. recordListChange already null-coalesces internally, but the subsequent `for (Column updated : updatedColumns)` iteration does not — any updater calling updateColumns with a null list hits the same path. Null-coalesce origColumns and updatedColumns at the top of ColumnEntityUpdater.updateColumns so any optional-columns entity is safe. * test(entity): generic PATCH-add-tag and PATCH-add-glossary-term coverage Add two generic tests to BaseEntityIT that any entity inheriting from it must pass: - patch_addClassificationTag_200_OK - patch_addGlossaryTerm_200_OK Both create a minimal entity, set a single tag/glossary-term label, PATCH, and assert the label is present. Gated on supportsTags && supportsPatch, so an entity that opts out keeps doing so. This gives every new EntityRepository subclass (e.g. the next entity type someone adds) automatic defense against the class of bug where the tag/glossary PATCH path NPEs on an unrelated optional field (updateColumns with null columns being the original instance). * test(it): migrate drive entity ITs to BaseEntityIT, add FolderResourceIT Migrate FileResourceIT, DirectoryResourceIT, and SpreadsheetResourceIT onto BaseEntityIT<T, K> so they automatically inherit the ~60 generic entity tests (CRUD, owners, tags PATCH, glossary PATCH, soft-delete, versions, custom extensions, etc.). Add a brand-new FolderResourceIT covering the previously untested folder entity type. The previous standalone harnesses were a migration gap from the bulk "Faster tests" PR (#24948) — only WorksheetResourceIT got the full BaseEntityIT plumbing; the rest of the drive family did not. This meant bugs like the updateColumns NPE fixed in this PR did not surface in the IT suite for File and friends. Each entity sets feature flags conservatively: - supportsFollowers/Domains/DataProducts/CustomExtension/BulkAPI/ DataContract = false (matches the existing minimal surface area) - Folder additionally sets supportsVersionHistory/GetByVersion = false since the FolderResource doesn't expose /versions Entity-specific tests (column handling, directory hierarchy, root filter, FQN structure, etc.) are preserved. Existing redundant smoke tests (createMinimal, deleteById, getByName, ID/name not-found) are removed since BaseEntityIT covers them. Also add openmetadata-sdk/.../drives/FolderService.java so Folder has SDK access (basePath /v1/drive/folders). * fix(it): make FolderResourceIT pass inherited BaseEntityIT tests Three real issues surfaced by the inherited generic tests after the drive-IT migration: 1. FolderResource.list was missing @Min(0) / @Max(1000000) on the limit query param, so negative or excessive values were silently accepted. Add validation to match DirectoryResource/WorksheetResource. This fixes BaseEntityIT.get_entityListWithInvalidLimit_4xx. 2. FolderResource hard-delete is asynchronous — it kicks off deleteByIdAsync and returns 200 before the row is gone. The generic delete_entityAsAdmin_hardDelete_200 test asserts the entity is no longer fetchable immediately. Override hardDeleteEntity in FolderResourceIT to poll on include=deleted until the async delete completes. 3. BaseEntityIT.get_deletedEntityVersion_200 calls getVersion(...) but only gated on supportsSoftDelete/supportsPatch, missing the supportsGetByVersion gate. For entities like Folder that don't expose /versions, the test threw UnsupportedOperationException from the subclass override. Add the missing gate. * chore: address gitar-bot review on PR #28047 - SpreadsheetResourceIT.test_listSpreadsheetsWithRootParameter was vacuously passing if the list came back empty (the for-loop body silently runs zero times). Add an explicit assertFalse(getData().isEmpty()) so a regression in the ?root=true filter actually fails the test. - FolderResource.list parameter description said "1 to 1000000" but the annotation is @Min(0) — 0 is a valid limit (returns empty page). Align the description with the annotation. * test(it): restore entity-specific drive tests dropped during BaseEntityIT migration The previous migration commit was too aggressive in deleting tests it assumed BaseEntityIT covers. Restore the entity-specific tests: SpreadsheetResourceIT (8 restored): - test_createSpreadsheetWithOptionalFields (displayName + description) - test_updateSpreadsheet (Spreadsheet-specific path/size fields) - test_spreadsheetWithWorksheets (@Disabled — worksheet relationship) - test_listSpreadsheetsByService (?service filter) - test_spreadsheetFQNPatterns (nested directory FQN construction) - test_spreadsheetsWithAndWithoutDirectory (directory-presence variations) - test_listSpreadsheetsWithRootParameterAndPagination (root + pagination) - test_listSpreadsheetsWithRootParameterEmptyResult - test_listSpreadsheetsWithRootParameterAcrossMultipleServices (@Disabled) FileResourceIT (2 restored): - test_createFileWithDisplayName - test_fileWithAllOptionalFields DirectoryResourceIT (1 restored): - test_createDirectoryWithAllFields BaseEntityIT still covers the genuinely redundant tests that stayed deleted (CRUD smoke tests, get-by-id, get-by-name with fields, delete, non-existent ID/FQN, fluent-SDK find variants). * chore: address remaining gitar-bot + copilot review threads FolderResource: - Grammar: "Limit the number folders" -> "Limit the number of folders" FolderResourceIT: - Javadoc said Folder "supports ... followers" but supportsFollowers=false. Align doc with actual capability flags. - Awaitility hard-delete poll was catching `Exception` and treating any failure as success (would mask transient 500s). Narrow to ApiException with statusCode==404; re-throw everything else so Awaitility surfaces real errors. FileResourceIT: - Pass String IDs to service.update for consistency with the rest of the IT suite (createdFile.getId().toString()). - Compare service references via getFullyQualifiedName() instead of getName(); the latter only matches for top-level services and would silently break if the reference schema changes. SpreadsheetResourceIT: - Import @Disabled and use the short annotation form instead of @org.junit.jupiter.api.Disabled (project standards prohibit FQNs in annotations). - Strengthen test_listSpreadsheetsWithRootParameter: assert the root spreadsheet we created appears in ?root=true results AND the child spreadsheet does NOT, so a broken root filter actually fails the test instead of passing on an empty list. * test(it): restore all remaining originally-deleted drive tests Per reviewer request, restore the rest of the tests that the migration commit removed under the (incorrect) assumption they were fully redundant with BaseEntityIT. FileResourceIT (+8): test_createFileMinimal, test_createFileWithDescription, test_deleteFile, test_findFileById, test_findFileByNameWithFields, test_getFileByNameWithFields, test_getFileWithNonExistentId_shouldFail, test_getFileByNameWithNonExistentFQN_shouldFail. DirectoryResourceIT (+10): test_createDirectoryMinimalRequest, test_getByName, test_getByNameWithFields, test_deleteDirectory, test_findDirectoryById, test_findDirectoryByName, test_findDirectoryWithFields, test_createMultipleDirectories, test_getNonExistentDirectory_fails, test_getByNameNonExistent_fails. SpreadsheetResourceIT (+10): test_createSpreadsheet, test_createSpreadsheetMinimal, test_getSpreadsheetById, test_getSpreadsheetByName, test_deleteSpreadsheet, test_finderWithFields, test_finderByNameWithFields, test_getByNameWithFields, test_createMultipleSpreadsheetsUnderSameService, test_patchSpreadsheetAttributes. BaseEntityIT generic coverage stays intact — the subclass tests and the inherited tests now coexist (deliberate overlap, opted in by reviewer). Counts vs |
||
|
|
9498cc9a11
|
fix(bulk-assets): enforce dryRun on tag/glossary/dataProduct/team remove (#28005)
* fix(tags): enforce dryRun on PUT /v1/tags/{id}/assets/remove (#27954)
The bulk remove endpoint hardcoded dryRun=false and ignored the flag from
the request, so callers passing dryRun=true still had the tag removed from
the target asset. Mirror the dryRun guard already present in the bulk add
flow: read the flag, propagate it onto the BulkOperationResult, and short
circuit before any deletes when dryRun is true.
Adds TagBulkAssetsDryRunIT covering dryRun=true, default (omitted) dryRun,
and dryRun=false against PUT /v1/tags/{id}/assets/remove.
* fix(bulk-assets): enforce dryRun across glossary/dataProduct/team remove
The same dryRun bug fixed for tags in the previous commit also exists in
the other bulk-asset endpoints. Each site hardcoded withDryRun(false) on
the result and went straight to the destructive path:
- GlossaryTermRepository.bulkRemoveGlossaryToAssets — mirrors the tag-add
guard: read dryRun, propagate it onto the result, short-circuit before
any deletes.
- EntityRepository.bulkAssetsOperation (base; used by Team) — propagate
dryRun, skip relationship writes and the change-event emit when true.
- DataProductRepository.bulkAssetsOperation (override) — same pattern.
Validation still runs for adds so dryRun returns the same errors a real
call would, but no relationship writes / cache invalidations / change
events occur.
Adds BulkAssetsRemoveDryRunIT with dryRun=true and dryRun=false coverage
for the three synchronous remove endpoints. The existing
DomainBulkAssetsDryRunIT continues to cover the domain override and
TagBulkAssetsDryRunIT covers the async tag endpoint.
* test(bulk-assets): move dryRun remove tests into existing IT files
Address review feedback:
- Drop the standalone TagBulkAssetsDryRunIT / BulkAssetsRemoveDryRunIT
and fold the dryRun=true / dryRun=false coverage into the per-entity
IT files that already own each endpoint:
* TagResourceIT — tag bulk remove (async) with Awaitility during(...)
window
* GlossaryTermResourceIT — glossary term bulk remove (sync)
* DataProductResourceIT — data product bulk remove (sync), domains
properly aligned so add validation passes
* TeamResourceIT — team bulk remove (sync), reusing the existing
sanitized createTestUser helper to avoid the email-too-long /
invalid-chars failures the standalone file hit
- Rename the misleading addTagToAssetsRequest local in
TagRepository.bulkRemoveAndValidateTagsToAssets to assetsRequest
(copilot review)
- Make the dryRun branch in TagRepository / GlossaryTermRepository
iterate per asset and return one BulkResponse per request entry,
matching the DataProduct/Team paths so callers get a real preview
with rowsProcessed / rowsPassed counts (gitar-bot review)
* test(bulk-assets): add dryRun-omitted regression cases for bulk remove
Address Copilot review on PR #28005. The two bulk-asset request schemas
have asymmetric dryRun defaults:
- addTagToAssetsRequest.json / addGlossaryToAssetsRequest.json: default true
(omitted -> preview, no mutation)
- bulkAssets.json (data product / team / domain): default false
(omitted -> destructive)
Add a regression test per endpoint that posts a raw JSON body without the
dryRun field, pinning the full chain (schema default -> jsonschema2pojo
field initializer -> Jackson deserialization -> server-side
Boolean.TRUE.equals check). Guards against any of those layers silently
flipping the per-endpoint default.
* fix(bulk-assets): tag-add dryRun preview + add-side IT coverage
TagRepository.bulkAddAndValidateTagsToAssets short-circuited on dryRun=true
with a single "Nothing to Validate." message and zero row counts — the same
degraded-preview anti-pattern gitar-bot flagged on the remove side and that
was fixed in
|
||
|
|
6c8441a06d
|
fix(search): skip soft-delete script propagation to time-series child aliases (#28064)
* Entity Time Series * test case results * Add Playwright * fix: declare 'summary' on TestSuiteIndex so reindex preserves lastResultTimestamp TestSuiteIndex.buildSearchIndexDocInternal computes the top-level lastResultTimestamp field from testSuite.getTestCaseResultSummary(). TestSuiteRepository registers a fetcher for that under the field name "summary"; the reindex path only invokes fetchers whose field is in getRequiredReindexFields(). Without "summary" declared, the fetcher does not run, testCaseResultSummary stays null on the entity, and the Index takes its else branch (TestSuiteIndex.java:41) writing lastResultTimestamp=0L on every reindexed suite. That field is exactly what the DQ /data-quality/test-suites list page sorts by (TestSuites.component.tsx:175), so a full reindex collapses every basic suite to the 1970 epoch and "most recently run first" stops working. Co-authored-by: mohitdeuex <mohit.y@deuexsolutions.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com> |
||
|
|
ac5211451a
|
fix(rdf): scope glossary graph by membership and surface glossary labels (#28199)
* fix(rdf): scope glossary graph by membership and surface glossary labels
The /rdf/glossary/graph endpoint silently ignored its `glossaryId` filter
and returned every term in every glossary, because the SPARQL query bound
`?glossary` via `OPTIONAL { ?term1 om:belongsTo ?glossary }` while the
JSON-LD context (governance.jsonld) maps GlossaryTerm.glossary to
`om:belongsToGlossary`. The downstream `FILTER(?glossary = <…>)` was a
no-op, and the UI hierarchy view rendered every glossary's terms grouped
by their raw UUID — most visibly for cross-glossary scenarios.
|
||
|
|
383d05b575
|
fix(cache): seal hard-delete stale-read window via NotFoundCache marker (#28197)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(cache): mark hard-deleted entities in NotFoundCache to seal stale-read window EntityRepository.cleanup() invalidates Guava L1, Redis L2, and pub/sub before and after the dao.delete() inside its transaction. A concurrent reader that arrives between the two invalidate passes can still snapshot the row from the not-yet-deleted DB state, write-through it back into Redis (and Guava on the loader callback), and silently re-populate the cache past the DELETE response. The NotFoundCache (notFoundTtlSeconds, default 30 s) is already consulted by find(UUID,…) and findByName(fqn,…) on L1 miss as a known-not-found short-circuit. Populate it explicitly when cleanup() finishes: the next read sees a stale L1 entry → false → checks the loader path → loader is short-circuited by isMarkedNotFound* → 404. Recreate-with-same-id paths already clear the marker via CacheBundle.invalidateEntity() in postCreate, so this doesn't shadow legitimate reuses. Read-path order is unchanged (L1 → negative cache → loader) so cache-hit GETs still pay zero extra Redis round-trips. |
||
|
|
4bb6574815
|
fix(glossary): preserve all relation types between same term pair (#28172)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* fix(glossary): preserve all relation types between same term pair The entity_relationship primary key (fromId, toId, relation) caused the second INSERT for the same (term, term) pair to UPSERT and overwrite the json discriminator, silently dropping any previously stored relationType. Adding the same target term with "synonym" then "seeAlso" left only one relation visible on GET. Extend the PK to (fromId, toId, relation, relationType) so each typed relation lives in its own row. The new column defaults to '' for every non-glossary edge, leaving existing call sites and queries semantically unchanged. CollectionDAO.deleteWithRelationType and countByRelationType now filter on the column directly instead of JSON_EXTRACT. The 1.13.0 schemaChanges migration backfills relationType from the existing json for glossaryTerm RELATED_TO rows, then atomically swaps the PK. Adds three integration tests in GlossaryTermResourceIT covering the add path, the targeted-removal path, and tag-usage cleanup when a table tagged with a glossary term is hard-deleted. |
||
|
|
1352d67cf4
|
feat(dar): Granted lifecycle, filters, sort, and self-service create policy (#28044)
* feat(dar): add Granted lifecycle, filters, sort, and self-service create policy Splits the Data Access Request lifecycle into Approved (awaiting grant) and Granted (active access) so the UI can show an "approved – awaiting grant" banner that clears once an admin marks the request as granted. Adds an indexed approvedBy/approvedById/approvedAt on Task, captured at the approve transition through a new direct-persist helper. Introduces a dedicated /v1/tasks/dataAccessRequests endpoint pre-scoped to category=DataAccess with DAR filters (dataset, service, status, requestedBy, approver, accessType) and an asc/desc sort on createdAt; generic /v1/tasks gains service/approver filters too. DataConsumerPolicy now grants Create on resource=task so authenticated non-admins can file a DAR (fixes "operations [Create] not allowed"). Reworks the workflow handler so transitions whose targetTaskStatus is non-terminal (Approved, Granted) don't close the task, and updates CreateTask.isTerminalTaskStatus to allow advancing between Approved → Granted stages. Adds a new "active" statusGroup that includes the DAR lifecycle states while preserving the existing open/closed semantics that Glossary-style workflows depend on. Includes a Postgres + MySQL migration for the indexed approvedById generated column and integration coverage in DataAccessRequestIT spanning the new lifecycle, filters, sorting, approver capture, and the non-admin policy path. Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anuj-kumary <anujf0510@gmail.com> Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com> Co-authored-by: Shailesh Parmar <shailesh.parmar.webdev@gmail.com> |
||
|
|
5696286b27
|
Address Transitive vulnerabilities (#28169)
* Address transitive vulnerabilities * Address transitive vulnerabilities * fix(deps): resolve pyOpenSSL/cryptography conflict and align constraint pins CI dependency resolution failed because pyOpenSSL~=24.1.0 caps cryptography at <43, conflicting with the cryptography>=44.0.1 bump. Widens pyOpenSSL to >=24.3.0 (first version compatible with cryptography 44.x) and aligns the airflow constraint file pins for cryptography and GitPython with the upstream setup.py bumps so pip install -c can resolve. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
92a212a3a5
|
fix(search): scope column-grid metadata-status ITs + surface ES shard-failure reason (#28162)
`test_getColumnGrid_withMetadataStatusIncomplete` has been failing consistently on the new Integration Tests - PostgreSQL + Elasticsearch + Redis lane (added in #28012), even with @Isolated + 2 GB ES heap on the existing fix-cache-it-failures branch — so it's not parallel-test contention or generic heap pressure. The underlying ES `[search_phase_execution_exception]` reason is being swallowed by the aggregator's catch block, leaving us guessing. Two changes: 1. `ElasticSearchColumnAggregator` — add `logShardFailureDetails(e, indexes, query)` to the four `catch (ElasticsearchException)` sites. Logs the serialized query plus `e.error().rootCause()` before rethrowing, so the next failed CI run will reveal the actual shard-level reason (circuit_breaker, too_many_buckets, mapping conflict, etc.) instead of just "all shards failed". 2. `ColumnGridResourceIT` — scope the three `metadataStatus=*` tests (`Missing`/`Complete`/`Incomplete`) to their own `serviceName` like every other test in the class. They were the only tests in the file that aggregated over the entire index instead of just their own service's data — a defensive fix that drops the per-test query workload to ~1 doc regardless of how much accumulated data is in the cluster. |
||
|
|
0949bcc3ec
|
fix(cache): close rename-cascade race and bot-task delete eviction gap (#28100)
* fix(cache): close rename-cascade race and bot-task delete eviction gap |
||
|
|
4217e6db8d
|
fix(log-storage): plug clobber bugs in streamable S3 logs (partial.txt + logs.txt) (#27926)
* fix(api): make closeStream idempotent when log storage is not configured
closeStream used to throw IllegalStateException("Log storage is not
configured") which the resource layer translates to a 500 response.
That made the contract surprising for callers: any defensive cleanup
path (exit handlers, retry logic, generic teardown) had to know in
advance whether streaming was configured before calling close, or eat
spurious server errors.
Closing a stream is naturally idempotent — same shape as DELETE on a
non-existent resource. When log storage is not configured, return
silently with a debug log so callers can call close() defensively
without checking state first.
Adds a unit test covering the no-op path.
* Add design spec for streamable logs stability fix
Captures the design discussion for fixing partial.txt and logs.txt
clobber bugs in S3LogStorage when ingestion runs hit idle gaps longer
than the 5-minute stream timeout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add full design flow doc for streamable ingestion logs
End-to-end documentation of the streamable logs feature: architecture,
storage layout, run lifecycle, read paths, abandoned-run recovery,
configuration, concurrency model, and observability. Reflects the
post-fix design captured in the streamable-logs-stability spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add implementation plan for streamable-logs stability fix
Step-by-step TDD plan grouped into 8 PR-sized tasks: config schema
additions, per-stream lock, pendingFlush + merge-always flush, multipart
removal, sweeper rewrite, /close rewrite, read-path correction, and
integration tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(log-storage): add config fields for streamable-logs stability fix
Adds streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures. Deprecates
streamTimeoutMinutes in favor of streamTimeoutHours. Pure schema-only
change; no Java code consumes these fields yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): add deprecated:true keyword and clarify watermark unit
Addresses code review on Task 1: project convention uses the JSON Schema
deprecated keyword alongside description annotation. Also clarifies that
earlyFlushWatermarkBytes default (5242880) equals 5 MB.
* feat(log-storage): wire new stability-fix config fields into S3LogStorage
Reads streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures from
LogStorageConfiguration with sane defaults. No behavioral change yet —
values are stored but not consumed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): broaden streamTimeoutMinutes deprecation warning + drop FQN
Addresses code review on Task 2: warning now fires whenever
streamTimeoutMinutes is set (not only for values < 30 min), since the
field is deprecated for all deployments. Also imports java.lang.reflect.Field
in the test helper instead of using a fully-qualified name (CLAUDE.md
no-FQN rule).
* refactor(log-storage): add per-stream ReentrantLock for S3LogStorage
Introduces streamLocks map and acquire/release helpers. appendLogs,
writePartialLogsForStream, closeStream, and cleanupExpiredStreams all
serialize on the per-stream lock. No behavior change; locking is
pure mutual-exclusion at this point.
* fix(log-storage): close iterator.remove race in cleanupExpiredStreams
Move iterator.remove() inside the per-stream lock to prevent a window
where a concurrent appendLogs sees the still-present closed StreamContext
and writes to a closed stream. Also clarifies the comment on flush(fqn,runId)
ordering and documents that streamLocks accumulates monotonically until
Tasks 7 and 8 add cleanup.
* feat(log-storage): track pendingFlush queue and totalLinesAppended counter
Each appendLogs now also populates per-stream pendingFlush (lines awaiting
flush) and totalLinesAppended (monotonic logical line counter). State is
written but not yet consumed; the new flush logic in the next commit reads it.
* fix(log-storage): document thread-safety + lifecycle on Task 4 maps, add test
Addresses review on Task 4: documents that pendingFlush ArrayList values
may only be accessed under the per-stream lock; clarifies that
consecutiveFlushFailures is written and consumed in Task 5 (not just
consumed); aligns its type with AtomicInteger for consistency with
the other counters; adds a test for the trailing-newline trim path.
* fix(log-storage): merge-always partial.txt PUT and persist offset in S3 metadata
Replaces the old writePartialLogsForStream that skipped the read-merge step
when partialLogOffsets[streamKey] was 0 (the canonical 80MB->KB clobber bug).
The new flush always reads existing partial.txt, appends a snapshot of
pendingFlush, and PUTs with offset state in S3 user-defined metadata.
Also adds an early-flush watermark trigger so high-burst writes don't
pile up unbounded in pendingFlush.
Closes the partial.txt-clobber half of the streamable-logs-stability spec.
* fix(log-storage): replace task-number comments with intent-describing language
Addresses code review on Task 5: production code comments should describe
invariants, not the planning-doc task that filled the gap. Also clarifies
the parse-before-lock and the byte-counter atomicity assumption.
* refactor(log-storage): remove MultipartS3OutputStream, rewrite closeStream as server-side copy
appendLogs no longer initiates a multipart upload; bytes flow only through
pendingFlush -> partial.txt PUTs.
closeStream now: (1) drains pendingFlush via final partial.txt PUT,
(2) issues CopyObjectRequest from partial.txt to logs.txt server-side,
(3) deletes partial.txt and the .active marker, (4) drops in-memory state.
Idempotent: a second /close sees no partial.txt (NoSuchKeyException) and
returns gracefully.
Closes the logs.txt-clobber half of the streamable-logs-stability spec
and finalizes the canonical /close flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): plug listener/lock leaks, propagate SSE on copy, recover counter from metadata
Addresses code review on combined Tasks 6+8:
- dropStreamState now removes activeListeners entries (SSE listener leak fix).
- cleanupExpiredStreams now removes streamLocks entries on expire (lock leak fix).
- copyPartialToLogs applies SSE configuration to CopyObjectRequest (was unencrypted on copy).
- writePartialLogsForStreamLocked reads last-flushed-line metadata from existing
partial.txt and uses it to keep totalLinesAppended monotonic across restarts.
- consecutiveFlushFailures reset uses computeIfAbsent + set(0) instead of allocating
a new AtomicInteger every successful flush.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(log-storage): rewrite sweeper as cleanupAbandonedStreams (24h/1h)
Bumps the idle threshold from 5 min to streamTimeoutHours (default 24h)
and the poll interval from 1 min to cleanupIntervalMinutes (default 1h).
On expire, finalizes the abandoned run by copying partial.txt -> logs.txt
server-side, deleting partial.txt, and dropping in-memory state — same
end-state as closeStream.
Also wires partialFlushIntervalMinutes into the periodic flush schedule
and removes the legacy streamTimeoutMs field that no longer drives behavior.
* fix(log-storage): preserve streamLocks entry on cleanup retry path
Addresses code review on Task 7: streamLocks.remove was unconditionally
in the finally block of finalizeAbandonedStream, so it ran even when the
sweeper returned early to retry next tick on a copy failure. That meant
the next sweep tick would create a fresh ReentrantLock, and any
concurrent appendLogs in the meantime would contend on a different lock
object than the retry, defeating mutual exclusion.
Now we only remove the lock entry once finalization has succeeded
(after dropStreamState). The retry path leaves the lock in place so
the next tick and any concurrent appendLogs see the same lock identity.
* fix(log-storage): include pendingFlush snapshot in mid-run reads
getCombinedLogsForActiveStream now appends the in-memory pendingFlush
snapshot to the partial.txt body when reading mid-run, so the UI's
paginated GET surfaces the most recent tail even before the next
scheduled flush has happened.
Only appends pendingFlush when a partial.txt file exists, avoiding
duplication in the fallback path where recentLogsCache already
includes those lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): tighten Task 9 read path safety + invariant comment
Addresses review on Task 9: the unsafe null-lock fallback in the
pendingFlush append path is removed (it was structurally unreachable
but a latent hazard for future lifecycle changes). The pendingFlush
read now happens entirely under the per-stream lock, with a
conservative skip if no lock entry exists.
Also documents the recentLogsCache-vs-pendingFlush invariant in the
fallback path and adds a total-count assertion to the new test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(log-storage): add bug-reproducer ITs for streamable-logs stability
- testIdleGapDoesNotClobberPartial: two log bursts within an open run;
asserts both are present in the read response.
- testCloseProducesLogsTxtMatchingPartial: write, close, read; asserts
content survives the close.
- testCloseIsIdempotent: a second /close is a graceful no-op.
Tests are tolerant of the storage backend in the test environment
(DefaultLogStorage in CI may not persist; S3LogStorage in S3-configured
environments). Deep behavioral coverage is in S3LogStorageTest unit tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): address final-review critical bugs
- closeStream and finalizeAbandonedStream now propagate PUT failures
from writePartialLogsForStreamLocked (which returns boolean).
closeStream throws IOException; the sweeper retains state for retry.
Fixes silent data loss when the final flush PUT fails.
- streamLocks entries are no longer removed; this prevents an
acquire-vs-remove race that would break mutual exclusion. Memory
growth is bounded by maxConcurrentStreams in practice.
- cleanupAbandonedStreams re-checks expiration inside the per-stream
lock so a stream that was bumped by appendLogs between the scan
and the lock acquisition is not finalized.
- deleteLogs now acquires the per-stream lock before mutating state.
- getCombinedLogsForActiveStream appends pendingFlush in BOTH the
S3-found and memory-fallback branches, so reads aren't truncated
when recentLogsCache evicts oldest at its 1000-line cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): use pendingFlush as canonical mid-run read source (no duplicates)
The previous Issue 5 fix appended pendingFlush unconditionally, which
caused duplicate lines in the read response when the fallback branch
used recentLogsCache (since both are populated by the same appendLogs).
Now: in the foundPartialFile branch, append pendingFlush AFTER the S3
body (non-overlapping by construction). In the fallback branch
(no partial.txt yet), use pendingFlush directly as the canonical
source — this is more complete than recentLogsCache (1000-line cap)
and avoids the duplicate issue. recentLogsCache remains a defensive
fallback for the rare case where pendingFlush is empty in the fallback
path.
* Update generated TypeScript types
* chore(log-storage): drop dead abortIncompleteMultipartUpload lifecycle rule
The multipart upload write path was removed; the bucket lifecycle's
abortIncompleteMultipartUpload(7 days) rule served only as migration
cleanup for in-flight uploads from the old code at deploy time. After
the migration window it does nothing.
Drops the rule from configureLifecyclePolicy, the AWS SDK import, the
"7 days multipart cleanup" string in the startup log, and the
corresponding bullet in docs/streamable-logs.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ignore docs/superpowers/
Local-only working notes (specs, plans) live there and shouldn't be tracked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(log-storage): tolerate DefaultLogStorage in CI for streamable-logs ITs
CI runs the integration tests against the bootstrap config which uses
DefaultLogStorage (delegates to k8s/Airflow which isn't running). The
storage returns:
- "No pods found for this pipeline" sentinel for getLogs
- non-2xx status (the SDK wraps it as statusCode -1) for /close
Adjustments:
- testIdleGapDoesNotClobberPartial: parse JSON, only assert when total>0.
When storage actually persists (S3 deployments), assert BOTH bursts
are present — that's the real "no clobber" check.
- postClose helper: tolerate any exception from the close call
(idempotency is the contract; transient errors are acceptable).
The deep behavioural coverage continues to live in S3LogStorageTest unit
tests where mock S3 is the storage backend.
* test
* fix
* Update generated TypeScript types
* fix
* Update generated TypeScript types
* fix(log-storage): record UTF-8 byte length in partial.txt total-bytes metadata
String.length() returns UTF-16 code units; for non-ASCII content this
diverged from the actual S3 object size, breaking the drift cross-check
documented in docs/streamable-logs.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): address PR review findings on S3LogStorage
Plumbs the documented timing knobs (cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures) through LogStorageConfiguration
so operators can actually tune them. Replaces the unbounded streamLocks ConcurrentHashMap
with a Guava Striped<Lock> capped at 256 stripes, eliminating the per-(fqn, runId) memory
leak and the acquire-vs-remove race that a per-key map would have. Adds a Multipart
Upload + UploadPartCopy concatenation path for partial.txt >= 5 MB, avoiding the O(n^2)
total transfer and full in-JVM body merge that the prior GET+PUT-everything strategy hit
on long-running pipelines. Realigns docs/streamable-logs.md with the actual schema and
implementation, drops the broken superpowers/* spec link, and renames the misleading
testIdleGapDoesNotClobberPartial IT (which posted bursts back-to-back without simulating
any gap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|
||
|
|
17c3b8b9b2
|
fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488) (#27746)
* fix(profiler): N+1 / missing-index regression on /tables/.../columns?fields=profile (#3488) Root cause ---------- The 1.9.9 migration introduced two separate index regressions on `profiler_data_time_series`: 1. **PostgreSQL**: `schemaChanges.sql` explicitly dropped the unique constraint `profiler_data_time_series_unique_hash_extension_ts` (entityFQNHash, extension, operation, timestamp) to allow altering the generated `operation` column expression, but never recreated it. After the migration the table kept only the `(extension, timestamp)` index, which is useless for queries filtering by `entityFQNHash`. 2. **MySQL/both**: `postDataMigrationSQLScript.sql` created temporary indexes (idx_pdts_entityFQNHash, idx_pdts_composite, etc.) for its bulk UPDATE pass and then dropped **all** of them, including the only index covering `entityFQNHash`. The batch query issued by `getLatestExtensionsBatch()` when `fields=profile` is requested: SELECT entityFQNHash, MAX(timestamp) FROM profiler_data_time_series WHERE entityFQNHash IN (...N hashes...) AND extension = 'table.columnProfile' GROUP BY entityFQNHash required an `(entityFQNHash, extension, timestamp)` index. Without it the database performs a full table scan. On production deployments with millions of profiler rows this caused 100+ second response times (Grafana: 106 770 ms; 99 % in DB; 93 dbOps). Without `profile` in the fields param the same endpoint returned in ~150-220 ms. A secondary N+1 bug existed independently of the index: `customMetrics` in fields called `getCustomMetrics(table, column)` once per paginated column, issuing up to N identical queries against `entity_extension` and then filtering in Java. Fix --- * **migration 2.0.2** (MySQL + PostgreSQL): `CREATE INDEX IF NOT EXISTS idx_pdts_fqnhash_ext_ts ON profiler_data_time_series(entityFQNHash, extension, timestamp)`. The `IF NOT EXISTS` guard makes the migration safe to re-run and handles both upgrade and fresh-install paths. * **`getTableColumnsInternal`** — `customMetrics` block: fetch all column custom metrics for the table in one query, group by column name in Java, then distribute. Reduces N queries to 1. * **`getTableColumnsInternal`** — `profile` block: skip the duplicate `populateEntityFieldTags` call when `tags` was already fetched earlier in the same request, saving one prefix-scan on `tag_usage` per request. Related: PR #26855 (fixed N+1 tag queries on the list-tables path but left the profiler-index and customMetrics N+1 untouched on the columns sub-path). * fix(profiler): restore unique constraint on profiler_data_time_series + batch column extension/customMetrics fetch Move the migration from 2.0.2/ to 1.12.8/ and switch from a non-unique covering index to restoring the original unique constraint dropped in 1.9.9. The two-phase CREATE UNIQUE INDEX CONCURRENTLY + ADD CONSTRAINT USING INDEX pattern avoids the ACCESS EXCLUSIVE lock on the hot profiler_data_time_series table during the upgrade. Closes the 1.9.9 regression and brings Postgres back in line with MySQL (which never lost the constraint). The leading (entityFQNHash, extension) prefix serves the column-profile batch query — same shape MySQL has been running without 504s. MySQL needs no migration. Java side, eliminates two more N+1 patterns that compound the latency at customer scale: * getTableColumnsInternal extension block: replaced per-column getColumnExtension() loop with a single getExtensionsByJsonSchema() call, grouped by column FQN-hash in Java. * searchTableColumnsInternal customMetrics block: applied the same batch-fetch pattern already used in getTableColumnsInternal, replacing per-column getCustomMetrics() with one getExtensions() call. New DAO method on EntityExtensionDAO: getExtensionsByJsonSchema(id, jsonSchema) — selects extensions for a table id filtered by the jsonschema discriminator. Required because column extensions are stored with MD5-hashed extension keys and have no shared prefix the existing getExtensions(id, prefix) could use. * chore(profiler): address review feedback — empty-list literal + accurate test comments * Replace `new ArrayList<>()` default in `metricsByColumn.getOrDefault(...)` with `List.of()` at both call sites in `TableRepository` (getTableColumnsInternal and searchTableColumnsInternal). `getOrDefault` evaluates its default eagerly, so the new ArrayList allocates per-column even when the key is found — unnecessary work on a hot path. * Reword two stale test comments in `test_getColumnsWithProfileField_correctnessAndNoBatchRegression`: - "all four field combinations" → "the three field combinations exercised below" - "(c) duplicate populateEntityFieldTags must not run twice" → describe the observable contract the assertions actually verify (tags + profile both present), not the internal call count. * fix(profiler): force outer index scan in getLatestExtensionsBatch by pushing IN list to the join The getLatestExtensionsBatch query was the right shape for correctness but the planner — on Postgres at customer scale, with the new unique constraint in place — was still choosing a parallel sequential scan over the full profiler_data_time_series table for the outer side of the JOIN, rather than a merge join with index scan on both sides. Inner subquery: filtered by `entityFQNHash IN (...)`, used the index. Outer: only filtered by `p.extension = :extension`, no IN list, planner couldn't infer the transitive constraint that p.entityFQNHash must equal one of the inner hashes (because it's enforced through the JOIN ON clause, not a WHERE predicate). Result: full table scan reading 6.7M+ rows even when the actual answer is 23 rows. Adding the redundant `AND p.entityFQNHash IN (<entityFQNHashes>)` to the outer WHERE makes the constraint explicit. The result set is unchanged (implied by the join condition), but the planner can now use the unique index for the outer access too. Verified on the AUT dump (6.94M-row pdts): EXPLAIN of the batch query: 7,234ms → 79ms (Hash Join + Parallel Seq Scan → Merge Join + Index Only Scan). Live API /columns?fields=profile&include=all: 6-36 seconds → 22-28ms (warm) / 1.9s (very first call). 250-1000x improvement, depending on cache state. Same SQL works on both engines; no @ConnectionAwareSqlQuery split needed. * test(profiler): shorten classification/tag fixture names in IT to fit varchar(256) The IT fixture for test_getColumnsWithProfileField_correctnessAndNoBatchRegression was building a tagFQN of `<classification>.<tag>` where each part went through TestNamespace.prefix(). With the descriptive method name (62 chars) + class name (15 chars) + namespace UUID (32 chars) plus the `profile_test_cls` / `profile_test_tag` base names (16 chars each), the resulting tagFQN was 263 characters — over the tag_usage.tagFQN VARCHAR(256) limit: ERROR: value too long for type character varying(256) Shorten the fixture base names from `profile_test_cls`/`profile_test_tag` to `cls`/`tag`. The namespace prefix already encodes test isolation (class + method + UUID), so the base name doesn't need to repeat that context. New tagFQN length: 237 chars (cls__<32>__TableResourceIT__<62>.tag__<32>__TableResourceIT__<62>), comfortably under 256. * fix(table): include extensionKey in column-extension deserialize warn log Addresses gitar-bot review on PR #27746: the warning log on failed column- extension deserialization only had table.getId(), so operators could not pinpoint which row was bad. Add record.extensionName() (the entity_extension row key) to the log. No extra iteration - record is already in scope inside the catch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(migration): move profiler unique-constraint migration to 1.12.9 1.12.8 was already published with the PII classification fix from #27910. Move the profiler_data_time_series unique-constraint restore (this PR's postgres migration) to 1.12.9 so customers upgrading past the published 1.12.8 still pick it up. Add a MySQL placeholder schemaChanges.sql for 1.12.9 consistent with the 1.12.7 convention — MySQL was unaffected by the 1.9.9 regression (MODIFY COLUMN re-evaluates generated expressions in place without touching the constraint, so MySQL still has the constraint from 1.1.5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(table): extract batchFetchCustomMetricsByColumn helper Addresses PR #27746 Copilot review: - Dedupe custom-metric batch logic between getTableColumnsInternal and searchTableColumnsInternal. - Reword IT inline comment to reflect what the test actually validates (completes within timeout + correct profiles) instead of claiming it inspects query plans. --------- Co-authored-by: Claude <noreply@anthropic.com> |
||
|
|
64f49c1747
|
Cache improvements: lineage + search layers, observability, CI gate (#28012)
* cache: lineage cache, per-type metrics, invalidation registry, search-cache Add Redis-backed lineage response cache and search response cache, both gated by the existing CACHE_PROVIDER toggle and falling through to direct computation when the cache is unavailable. The cache remains optional — verified end-to-end by toggling CACHE_PROVIDER=none on a live stack and confirming all paths continue to work (just without the L2 hit). Coverage: - CachedLineage wraps LineageRepository.getLineage with hybrid TTL + direct invalidation (60s default). Direct edits invalidate the affected root cache entries; transitive changes fall through to TTL. - CachedSearchLayer wraps /api/v1/search/query with auth-aware caching (cache key includes principal so users with different ACLs don't share results). 30s default TTL. Observability: - /api/v1/system/cache/stats response now includes a metrics block with hits/misses/hitRatio/evictions/errors/writes plus read/write latency Timers, and a byType breakdown so coverage gaps are visible per entity-type and per cache-layer. Correctness: - New Invalidatable interface + CacheBundle registry + invalidateEntity helper so future cache layers plug in by implementing one method instead of editing multiple mutation paths. - Edge mutations in LineageRepository.addLineage/deleteLineage invalidate both endpoints; entity mutations in EntityRepository.postUpdate / postDelete / restoreEntity invalidate the lineage rooted at the entity. - Pub/sub handler in CacheBundle iterates registered Invalidatables so remote-pod evictions flow to all layers automatically. Tooling: - docker-compose.cache-off.yml overlay flips CACHE_PROVIDER=none for local A/B testing without tearing down DB/ES volumes. - CachedSearchLayerIT exercises hit-on-second-call, distinct-query misses, distinct-page-size misses, and byType shape via the metrics endpoint. Each test gracefully no-ops when the cluster runs cache-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cache: phase 2 ops + correctness — single-flight, slow-read, negative cache, admin endpoints Builds on the phase 1 commit ( |
||
|
|
ed5e68c2a5
|
Make SearchIndexing distributed-only (#27971)
* Make search indexing distributed-only * Update generated TypeScript types * Address search index review comments * Normalize search index entity aliases * Return defensive search index config copies * Update search indexing application docs * Share staged reindex context mapping * Speed up distributed job polling discovery * Use database polling for distributed job discovery * Address distributed search indexing review comments * Address distributed indexing polling review * Add SearchIndex promotion test coverage * Fix distributed reindex finalization review comments * Fix SearchIndex app Playwright run history parsing --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> |
||
|
|
2859cbeeb5
|
Add db-tune ops subcommand + production RDS runbook (#27890)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Add db-tune ops command + production RDS runbook Generate a per-table autovacuum (Postgres) / InnoDB stats (MySQL) tuning report from the live database; --apply executes the ALTER statements after operator confirmation; --analyze refreshes planner stats on the changed tables. Heuristic skips small dev tables (row-count gated) and recognizes already-tightened settings as OK. Catalog values come from the 600k-container production audit. The companion runbook in docs/rds-postgres-production-runbook.md captures the parameter-group settings, extensions, and the indexes that must be applied by hand because the migration runner can't dedup cross-version. Tests: 38 unit tests on the heuristic + report formatter; DbTuneIT covers analyze / apply / analyzeOne / dry-run against the IT bootstrap's live Postgres or MySQL Testcontainer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove file * db-tune: address review 4230257768 — direction-agnostic status, empty/match disambiguation, script path - ServerParamCheck.STATUS_UNDERSIZED -> STATUS_MISMATCH. Some recommended values are deliberately lower than engine defaults (random_page_cost 1.1 vs 4.0, autovacuum scale factors). Labelling those as "UNDERSIZED" was wrong. Added a regression test that pins random_page_cost as MISMATCH when the current value is higher than recommended. - dbTune --apply: distinguish "no tracked tables on this DB" from "all tables already match" in the Nothing-to-apply log line. Mirrors the same disambiguation already in DbTuneReport.appendNextSteps. - DbTuneReport next-steps now points at ./bootstrap/openmetadata-ops.sh (the actual location of the wrapper script) instead of the bare ./openmetadata-ops.sh which would be file-not-found from the repo root. - DbTuneIT analyzeReturnsRecommendationsForKnownTables: assertion message now references TEST_TABLE so it stays correct after the switch from storage_container_entity to entity_relationship. Tests: 40 unit tests passing (+1 for the higher-than-recommended MISMATCH case). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * DbTuneIT: contain apply path to a private isolated table The earlier IT had applyChangesReloptionsAndIsIdempotent + analyzeOne running ALTER TABLE / ANALYZE TABLE against entity_relationship — a real catalog table that the rest of the suite writes to. On MySQL this broke ~17 downstream ITs (GlossaryResourceIT, GlossaryTermRelationsIT, DomainResourceIT, LineageBrokenReferenceIT, OpenLineageLineageResolutionIT) with "Cannot create a JSON value from a string with CHARACTER SET 'binary'" on inserts into entity_relationship.json. Root cause: ALTER TABLE bumps MySQL's per-table metadata version, which invalidates JDBC prepared-statement caches across the whole shared Testcontainer. After re-prepare, the JSON column metadata sometimes comes back as VARBINARY in Connector/J's cache, breaking subsequent JSON inserts. Confirmed via CI log timing: DbTuneIT finished at 17:58:12, GlossaryTermRelationsIT failed starting 17:58:53. The recommendations themselves are correct — STATS_PERSISTENT=1, STATS_AUTO_RECALC=1, STATS_SAMPLE_PAGES=100 are safe and modern InnoDB defaults except for SAMPLE_PAGES. The test just cannot afford the side effect on a shared DB. Fix: introduce dbtune_it_isolated_table created/dropped per test in @BeforeEach/@AfterEach, build a TableRecommendation for it directly (bypassing the static catalog), and run apply + analyzeOne against it. The read-only tests (analyze, dry-run) continue to use the real catalog since they don't mutate anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * db-tune: add --diagnose for read-only DBA findings The tuning report we already had is a static playbook (table → recipe + row-count gate). Operators rightly asked whether db-tune actually inspects the live database for *its own* signals — unused indexes, bloat, slow queries, cache hit ratios — and surfaces recommendations from those measurements. Until now: no. This adds a parallel read-only path: ./bootstrap/openmetadata-ops.sh db-tune --diagnose Postgres categories (each isolated in its own try block; missing extension or permissions surfaces in DbTuneDiagnosis.notes rather than failing the run): - UNUSED_INDEX — pg_stat_user_indexes idx_scan=0, size > 10 MB, non-unique non-pkey - HIGH_DEAD_TUPLES — n_dead_tup/n_live_tup > 0.2 with n_live_tup > 10k (autovacuum falling behind) - LOW_CACHE_HIT — heap_blks_read > 1000 AND hit ratio < 90% - STALE_STATS — last_autoanalyze NULL or > 14 days, n_live_tup > 1000 - SEQ_SCAN_HEAVY — seq_scan/idx_scan > 10 with > 1000 seq scans (suggests missing index) - SLOW_QUERY — pg_stat_statements top 10 by mean_exec_time, calls > 100 (gracefully skipped if extension absent) MySQL categories: - UNUSED_INDEX — sys.schema_unused_indexes filtered to current schema - LOW_BUFFER_POOL_HIT — Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests < 99% - SLOW_QUERY — performance_schema.events_statements_summary_by_digest top 10 by avg_timer_wait - FULL_TABLE_SCAN — sys.statements_with_full_table_scans Concept stays separate from AutoTuner: Diagnostic is read-only and never participates in --apply. Categories with zero findings are suppressed in the report; notes capture what couldn't be checked. Each finding is structured as (category, severity, attributes) with the attribute keys driven by DiagnosticCategory.columns() so the renderer dispatches a category-specific layout. Tests: 50 unit tests passing (40 → 50, +10 in DiagnosticReportTest covering EnumMap grouping order, empty/non-empty rendering, suppression, notes appending, query truncation, and column-list immutability). DbTuneIT gains diagnoseCompletesWithoutErrorAndReturnsStructuredResult that exercises the full end-to-end against the live Testcontainer (read-only, safe). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * db-tune: restore apply DB verification + include idx_scan=0 in seqScanHeavy Two findings from the review round on |
||
|
|
d3bbbefe37
|
fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999)
* fix(rdf): dedupe lineage edges and broaden PROV-O coverage
The RDF Knowledge Graph endpoint was emitting two edges per lineage
relationship — once as `om:UPSTREAM` (forward) and once as
`prov:wasDerivedFrom` (reverse) — because the parser preserved each
predicate's native subject/object orientation instead of canonicalizing
both into a single `(upstream, downstream)` edge.
Also extend PROV-O coverage so external SPARQL clients can use the W3C
Provenance vocabulary directly:
- `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on
datasets / pipelines / users
- `prov:wasAttributedTo` mirror of `om:owners`
- `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used`
on lineageDetails so the Entity → Activity → Entity chain is complete
- `prov:hadPlan` + `prov:Plan` for SQL transformation recipes
- `prov:startedAtTime` / `prov:endedAtTime` on Activity instances
- `prov:wasAssociatedWith` Activity → Agent linking
- `prov:invalidatedAtTime` on soft-deleted entities
Other RDF cleanups in the same area:
- LineageDetails URIs are now deterministic (driven by from/to ids
instead of a timestamp), so re-indexing collapses duplicate Activity
resources via the existing DELETE+INSERT idempotency
- Skip emitting the redundant `om:owners` JSON-string literal — the
mapped path already produces clean `om:hasOwner <agent>` triples
- Skip empty `[]` array literals in the unmapped path
- Propagate failures from `RdfRepository.{addRelationship,
addLineageWithDetails, bulkAddRelationships,
bulkAddGlossaryTermRelations}` instead of silently swallowing them,
so downstream callers can surface the failure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf-index-app): surface Fuseki failures in app run record
Per-entity and per-batch failures from the RDF index app used to be
logged via SLF4J only — they never made it into the AppRunRecord, so
the UI/run history showed "completed" even when every entity had
silently failed to write to Fuseki.
- `RdfBatchProcessor.processEntities` now captures the last error per
entity, returns it in `BatchProcessingResult.lastError`, and
accumulates relationship-processing failures into the same result.
- Relationship and lineage processing methods (`processBatchRelationships`,
`processLineageRelationship`, `processGlossaryTermRelations`) return
structured results with failure counts and last-error messages instead
of `void`, so failures are visible to the partition worker.
- `RdfIndexApp` records the failure on `jobData` for both the
distributed and non-distributed code paths, so users see a real
error message in the run history (e.g.
"Failed to write entity X to Fuseki: ConnectException").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* perf(rdf-index-app): port distributed-mode improvements from SearchIndex
The RDF distributed-indexing fork was lagging behind several SearchIndex
improvements that addressed concrete reliability and throughput issues.
Port them across:
Core perf / reliability
- Precomputed partition start cursors: coordinator walks each entity
once via keyset pagination at job init and caches the boundary cursor
per (jobId, entityType, rangeStart). Workers consult the cache before
falling back to the OFFSET-based path. Eliminates the previous O(N²)
per-partition cursor lookup.
- `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion`
on the coordinator. Stop now cancels both PENDING and PROCESSING
partitions in a single SQL update and immediately drives the job
status from STOPPING → STOPPED, so the UI status no longer hangs
while workers drain.
- Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset`
uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of
`List.of("*")`, avoiding expensive fetchers (e.g. fetchAndSetOwns)
per batch.
- Partition heartbeat thread: virtual thread refreshes
`lastUpdateAt` every 30s for partitions actively being processed by
this server, so the stale reclaimer no longer interrupts active work.
- `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path
rejects when the server already holds 5 PROCESSING partitions, giving
fair distribution across pods. Verified the existing claim DAO uses
`FOR UPDATE SKIP LOCKED` for both MySQL and Postgres.
- Gate WebSocket stat broadcasts during the STOPPING phase so the
Quartz-scheduler-driven STOPPED status push isn't overwritten.
Multi-server scaffolding (single-pod is unaffected)
- `RdfPollingJobNotifier`: DB-polling discovery for other server pods
to find an in-flight RDF reindex they can join.
- `RdfEntityCompletionTracker`: per-entity-type partition tracking with
callback firing once all partitions for an entity complete, foundation
for early per-entity index promotion.
Tests: precomputed-cursor cache lookup, in-flight backpressure,
cancelInFlight delegation, completion tracker callback semantics,
notifier start/stop.
DAO additions on `rdf_index_partition`:
- `cancelInFlightPartitions(jobId, now)` — covers both PENDING and
PROCESSING in one statement
- `countInFlightPartitionsForServer(jobId, serverId)` — backpressure
- `countPartitionsByStatus(jobId, status)` — used by completion check
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row
When an app has no run history, AppRunsHistory fabricated a synthetic
placeholder row that looked like a real run — `runType: "CurrentConfig"`,
a fake `Run At` timestamp pulled from `appData.updatedAt`, an
ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button
that targeted nothing.
Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows,
and hide the `Stop` button so users no longer see "Run now → 19-minute
Running with Stop button" when the actual job never registered. Real
app runs are unaffected — they still display `runType` from the
backend (OnDemandJob, Hourly, Daily, Custom, etc.).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): address PR review findings
Four issues raised in PR #27999 review:
- **Cursor format consistency in walkAndRecord** (bug):
The defensive branch produced cursors via a custom `{name, id}` map
while the regular path used `repo.getCursorValue()`. For entities
with quoted names these encodings diverge — a quoted-name entity
could land in the cache with a cursor incompatible with what the
worker fetches via keyset pagination. Track the last seen entity
reference and run it through `repo.getCursorValue()` in both paths.
`encodeBoundaryCursor` is removed.
- **Adaptive scheduling in RdfPollingJobNotifier** (perf):
The previous implementation woke the scheduler thread every 1s and
short-circuited inside the poll method when idle. Reschedule the
task at the appropriate interval (1s active / 30s idle) when
`setParticipating` flips, so the thread genuinely sleeps when idle.
- **Cursor cache cleanup on startup recovery** (edge case):
`partitionStartCursors` was only evicted by `refreshAggregatedJob`
/ `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job
and never reached either, the cache entry leaked until process
restart. Add `evictStaleCursorCacheEntries()` invoked by
`performStartupRecovery` that drops entries for jobs that no longer
exist in the DB or are already terminal.
- **Consolidate describeError helpers** (quality):
`describeError`, `describeBulkError`, and `describeLineageError` in
`RdfBatchProcessor` all walked the cause chain and formatted a
prefixed message with the same logic. Reduced to a single
`describeError(prefix, error)` plus a thin `describeEntityError`
adapter for the per-entity call site.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop()
stop() called workerExecutor.shutdownNow() inline AND through
cleanupLocalExecution -> shutdownWorkerExecutor, which broke the
DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce
verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the
inline call — cleanupLocalExecution is the single owner of the
shutdown path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* ci: drop redundant DB matrix from openmetadata-service unit tests
The {mysql, postgresql} strategy matrix on openmetadata-service unit
tests doubled CI cost without adding signal: both jobs ran the same
surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined
only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single
`test.database` property, and that property is consumed exclusively by
the failsafe plugin (integration tests `*IT.java` / `*IntegrationTest.java`),
which only runs under `-Pintegration-tests` — not enabled here.
`openmetadata-service` itself has zero tests that read `test.database`
or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The
only testcontainer-based DB code in the repo lives in
`openmetadata-integration-tests`, a different module that this workflow
doesn't build.
Run the unit suite once. The `openmetadata-service-unit-tests-status`
required-check aggregator is unaffected (it depends on the renamed job
which still has the same name).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): address Copilot PR review findings
Six correctness issues raised on PR #27999:
- **Lineage-details DELETE was too broad** (RdfRepository): the cleanup
step deleted *all* `<fromUri> om:hasLineageDetails ?d` triples,
so reindexing one (fromId, toId) edge wiped lineage-details links
for every other downstream of the same source entity. Pin the
delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>`
triple. Same with prov:generated cleanup — anchor it to the
specific detailsUri instead of any details resource.
- **Predicate not flipped during canonicalization** (RdfRepository):
`parseEntityGraphEdgesFromResults` swapped subject/object for
reverse-direction predicates (`prov:wasDerivedFrom`,
`prov:wasInfluencedBy`) but kept the original predicate URI on the
resulting EdgeInfo. Exported graphs could carry semantically
invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`.
Add `forwardEquivalentPredicate` to substitute the OM-native
forward predicate when the direction flips.
- **`dct:modified` was an invalid xsd:dateTime** (RdfPropertyMapper):
`entity.getUpdatedAt().toString()` returns the epoch-millis Long as
a string, but the literal was tagged `xsd:dateTime`. Convert via
`Instant.ofEpochMilli(...).toString()` so the lexical form matches
the type — same fix already in place for prov:invalidatedAtTime.
- **Unmapped EntityReference arrays were dropped entirely**
(RdfPropertyMapper): the previous fix to skip noisy JSON-string
literals also dropped fields like `domains`, `reviewers`, `voters`
for entity contexts that don't have a JSON-LD mapping for them —
the unmapped path was the only path emitting them, so nothing
landed in RDF. Expand each array element through
`addEntityReference` so the data still produces proper
`om:<fieldName> <ref>` triples; mapped-path duplicates are
collapsed by Jena's Model dedupe.
- **Partition failure detection missed reader errors**
(DistributedRdfIndexExecutor): the EntityCompletionTracker was fed
`result.errorMessage() != null`, but `RdfPartitionWorker` can
increment `failedCount` from `readerErrors` without ever setting
`lastError`. Use `result.failedCount() > 0` so partitions whose
failures came from `ResultList.getErrors()` are also marked as
failed when promoting an entity.
- **`COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0**
(RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS
whenever any partition is FAILED or CANCELLED, including for
user-initiated stops where no record-level failures accrued. The
monitor's `completedWithErrors` gate required `failedRecords > 0`,
so those terminal states never hit `jobData.setFailure(...)` and
the run record showed success. Drop the failedRecords precondition
and tailor the fallback message based on whether there are
record-level failures or partition-level only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): separate relationship failures + type lineage as prov:Activity
Two more PR review findings on #27999:
- **Relationship failures inflated failedRecords stat**: `processEntities`
was folding relationship/lineage edge failures into `failedCount`,
which becomes `failedRecords` in the index stats. Records there mean
entities, computed from entity counts in `totalRecords`. Counting
per-edge relationship failures could push `failedRecords` above
`processedRecords`/`totalRecords` and produce nonsensical
per-entity stats.
Track them separately: add `relationshipFailureCount` to
`BatchProcessingResult` and `PartitionResult`. `failedCount` now stays
entity-level. The completion tracker is fed the broader
`result.hasAnyFailure()` so partitions where relationship triples
failed don't get prematurely promoted as success even though their
entity writes succeeded.
- **`detailsResource` wasn't typed as prov:Activity**: the resource
carries Activity-shaped predicates (prov:startedAtTime,
prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy,
prov:wasAssociatedWith) but only the OM-specific
`om:LineageDetails` rdf:type. Add an explicit
`rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL
clients recognize it as an Activity without having to learn the
OM type.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): label lineage edges relative to focal node
The Knowledge Graph view was labeling every edge with relation
type "upstream" as "Upstream" regardless of direction relative to the
focal node. For a focal node F, the raw stored relation `(F, X, upstream)`
means "F is upstream of X" — i.e. X is *downstream* of F. The previous
output labeled both `F → X` and `X → F` edges as "Upstream", which made
bidirectional lineage look like a duplicated relation.
Re-orient the label in `convertEdgesToGraphData` based on whether the
focal is the edge's source or target:
- focal → X → "Downstream"
- X → focal → "Upstream"
- non-focal-touching edges keep the raw relation label.
Reported on a sample-data table with a circular lineage cycle
(`dim_customer ↔ fact_orders`) where both directions showed "Upstream".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close remaining Copilot review gaps
Three findings from PR #27999's third review pass — all about failure
signals being silently dropped between layers:
- **`RdfIndexApp.processTask` ignored relationship failures**: only
`result.failedCount() > 0` was treated as a failure, so partitions
whose Fuseki relationship/lineage writes failed (incrementing
`relationshipFailureCount` but not `failedCount`) never wrote
`jobData.failure`. Switch to `result.hasAnyFailure()` and report the
combined count.
- **`checkAndUpdateJobCompletion` ignored partition `lastError`**: a
partition can finish COMPLETED with `lastError` set when a relationship
bulk write was caught and recorded but didn't bump `failedRecords` or
flip the partition to FAILED. The job would then go to COMPLETED even
though there were real failures. Treat the presence of any
`rdf_index_partition.lastError` as an error signal — promote to
COMPLETED_WITH_ERRORS and aggregate sample errors into the job's
errorMessage if it was blank.
- **`forwardEquivalentPredicate` mapped to a non-existent
`om:DOWNSTREAM` URI**: OpenMetadata only stores lineage with
`om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O
pair); there is no `om:DOWNSTREAM` predicate written anywhere — the
downstream view is derived by reading the same UPSTREAM edge from the
other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy`
to `om:UPSTREAM` (both are reverse-direction causation predicates: in
`B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and
effect is B, so the canonical forward predicate is the same).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Fix RDF tag mapper
* Fix all the comments
Cherry-picked from #27562 (without bin/ autogenerated noise).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Align RdfPropertyMapper tests with refactor and isolate ontology export IT
RdfPropertyMapperTest still referenced the removed addVotes helper and
expected addStructuredProperty to dispatch votes — both gone after votes
was added to IGNORED_PROPERTIES. Update the assertions accordingly.
GlossaryOntologyExportIT timed out on the full suite because it flips a
global RDF singleton in @BeforeAll and each test blocks a server thread on
synchronous Fuseki writes. SAME_THREAD only serialized methods within the
class — concurrent classes still raced for server threads. Adding @Isolated
matches the pattern already used by RdfResourceIT for the same reason.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(rdf): align addCertification typing + relationType after predicate flip
Two findings on PR #27999 from the post-cherry-pick review pass:
- **`addCertification` mis-typed glossary-source certifications and
skipped skos:Concept**: it always emitted `om:Tag` regardless of
source, even though `resolveTagResource` returns a glossaryTerm URI
when the certification points at a glossary term. It also didn't add
`skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for
classification tags), so SPARQL queries filtering certification
targets by `a skos:Concept` missed them while `addTagLabel`-emitted
tags were findable. Mirror `addTagLabel`: branch on source
(`Glossary` vs `Classification`), emit the right primary type plus
`skos:Concept` (glossary) or `om:Tag` (classification), and include
`om:tagSource`.
- **`relationType` left stale after predicate flip**: when
`parseEntityGraphEdgesFromResults` flipped subject/object for a
reverse-direction predicate and rewrote `canonicalPredicate` to
`om:UPSTREAM`, it kept the original `relationType` derived from the
reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo
with `relationType=downstream` + `predicate=om:UPSTREAM` —
internally inconsistent, and the mismatched `edgeKey` prevented
dedup against an existing UPSTREAM edge with the same endpoints.
Re-derive `relationType` from the canonical predicate after the
flip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close 2 review findings + add parser-helper unit tests
Two outstanding Copilot findings on PR #27999 plus targeted unit
coverage for the helpers that drive lineage canonicalization.
Findings:
- **`colLineageUri` collision risk** (RdfRepository): the deterministic
key replaced non-alphanumerics in `toColumn` with `_`, so distinct
column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which
would lose / overwrite column-lineage resources during reindex.
Append the loop index as a tiebreaker so distinct columns keep
distinct URIs.
- **`createTypeResource` missing dprod prefix** (RdfPropertyMapper):
the `getNamespace` switch didn't recognize `dprod`, so
`RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`)
produced an invalid `dprod:DataProduct` URI on the wire. Added the
`DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod`
case in the switch.
Coverage:
- New `RdfParserHelpersTest` exercises the canonicalization helpers
via reflection: `isReverseDirectionPredicate` (recognizes
PROV-O causation predicates, ignores forward predicates),
`forwardEquivalentPredicate` (both `wasDerivedFrom` and
`wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works),
`relativeRelationLabel` (focal-relative Upstream/Downstream
flipping with all the boundary cases — non-focal edges,
non-lineage relations, null focal).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): merge array contexts before per-field resolution
The third (low-confidence "suppressed") finding on review 4256830399
turned out to be a real duplication: when a field is mapped in one
context map of an array context but absent from another, the previous
processArrayContext ran processContextMappings once per map. The pass
where the field IS mapped emits the proper `om:hasOwner <ref>` triples
(plus `prov:wasAttributedTo`); the pass where the field is absent
falls through to processUnmappedField and emits an additional
`om:owners <ref>` triple. Net: two predicates for the same logical
relationship.
Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112
`om:owners` triples — one set per pass.
Fix: flatten all context maps in the array into a single merged map
once, then iterate entity fields exactly once against that combined
view (later contexts win on key conflicts, matching JSON-LD context
merge semantics). Each field is resolved against the union of
mappings, so the unmapped fallback only fires for fields truly absent
from every context. Net effect: `prov:wasAttributedTo` count is
unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners`
triples disappear.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close 2 review findings on coordinator finalization race
Two findings from PR #27999 review 4259628860:
- **`checkAndUpdateJobCompletion` early-returned before lastError check
could promote**: `refreshAggregatedJob` already marks the job COMPLETED
when partitions all finish without `failedRecords`/`failedPartitions`,
so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())`
short-circuit silently dropped the lastError signal. Move the
partition-lastError check INTO `refreshAggregatedJob` so both code
paths produce consistent terminal status — a partition that finished
COMPLETED but carries a non-null lastError now correctly promotes the
job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the
race.
- **`completePartition` / `failPartition` overwrote CANCELLED state**:
the unconditional partition row update lost a concurrent Stop's
CANCELLED status if a worker finished its batch after the Stop
request landed but before noticing it. Add a status-guarded
`updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND
status = 'PROCESSING') and have both completion paths use it; if 0
rows update, log and skip the side effects (no server-stat increment,
no refreshAggregatedJob call) so the authoritative CANCELLED status
stays. Mirrors the pattern SearchIndex's coordinator uses for the
same race.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
|
||
|
|
22a6c10072
|
Context center (#27558)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Add Context Center: Migrate Knowledge Center , Images/ PDFs document support * Add Context Center: Migrate Knowledge Center , Images/ PDFs document support * Address PR #27558 review comments - KnowledgePageRepository: null-safe pageType in getHierarchyWithSearch and getHierarchyWithSearchForActivePage so the /search/hierarchy endpoint no longer NPEs when the pageType query param is omitted. The ES/OS client helpers already skip the pageType term when the value is null or empty, so this is a pure null-guard. - ContextFileResource.uploadFile: when a failure happens after the ContextFileContent row is created (e.g. inside extractionService.submit), the cleanup path now hard-deletes that content row so the DB is not left with an orphaned record. - ContextFileResource: replace the raw Content-Disposition string with a buildContentDisposition helper that emits both the legacy quoted filename= and the RFC 5987 filename*=UTF-8'' parameter with percent-encoded bytes, so international filenames round-trip while staying header-injection safe. sanitizeFileName also falls back to "download" on null/blank input. - ContextFileResourceTest: new cases for sanitizeFileName null/blank fallbacks and for buildContentDisposition ASCII/unicode/space/injection behaviour (18 tests, all passing). * Address copilot review comments on PR #27558 - AssetRepository.getByFqnPrefix: swap arguments so (assetType, fqnPrefix) matches the DAO signature — previous ordering always missed the index. - FolderResource / ContextFileResource getEntitySpecificOperations: return List.of() instead of null so callers iterating the returned list cannot NPE. - SearchUtils.getPageHierarchy: replace UUID.fromString with a parseUuid helper that returns null for missing/malformed values and logs a warning instead of failing the whole hierarchy response. - DaoListFilter: qualify the pageType column with the caller-provided tableName, rename getArticleCondition to getPageTypeCondition (legacy no-arg method kept as @Deprecated wrapper for compatibility). - Elastic/OpenSearch client processPageHierarchyHits: replace the per-hit getChildrenCountForPage search (N+1) with a single pass over the batch that derives childrenCount from pages whose parent is in the same result set. Also drops the now-unused helper and its throws clause. - openmetadata-sdk/pom.xml: mark JWT, JAX-RS client, Apache HttpClient, jakarta.json, parsson, and JUnit Jupiter as <optional>true</optional> so they don't leak into SDK consumers that only use the core client. - InMemoryAssetService: use the shared AsyncService executor for upload /read/delete instead of the JVM common ForkJoinPool. - sample-pricing.xlsx: replace the plain-text placeholder with a real minimal XLSX workbook so detection-based and extraction-based code paths see a valid Microsoft Excel 2007+ file. * Use one filters aggregation for page hierarchy childrenCount Follow-up to |
||
|
|
219c5683fa
|
ISSUE #3032 (#27912)
* feat: move flat sampling to sampling config + dynamic sampling option * feat: move flat sampling on the backend to sample profile conifg object * feat: fix circular import * feat: align UI with new profiler config * feat: fix json schema * feat: align python imports with new schema path * feat: update migration to look at extension * feat: remove enable * feat: remove enable * feat: added titles to sample config * feat: generated ts classes * feat: addressed comments * feat: change sample config instantiation to match new structure * feat: removed backward compatible fields * feat: ran java linting * feat: updated imports to point to generated files * feat: added dynamic sampler resolution logic * feat: ran python linting * feat: remove duplicate migration * chore: merge upstream and clean conflicts * feat: update logic to support dynamic and static sampling * feat: adjust sample config call * feat: test for statis, dynamic, row count and tier methods * feat: more sample config unit tests * feat: added tests for metric and sampling * feat: added tests to validate fallback is not called i nmetric computers * feat: strengthen profiler validation tests * feat: fix sampling config * feat: fix sampling config * feat: fix sampling config * feat: generated typescript models * feat: fixed missing dq pipeline migration * feat: fixed static check * feat: fixed ci failures * feat: fixed ci failures * feat: fixed unit tests faioure and linting * feat: fixed integration tests failures * chore: fixe burstiq refactor * chore: fix trino ci failures * chore: revert baseline.json file * chore: fix sampler availabl burst iq changes * feat: added smart sampling radio button * feat: ignore static checks errors * feat: ran ts linting * feat burstiq infinite recursion issue with dynamic as default * feat: translate i8n keys * feat: fix failing tests |
||
|
|
80375a7dc6
|
Add data access request support (#27879)
* Add DAR tasks
* Removed UI related changes of DAR
* nit
* Update generated TypeScript types
* fix linting issue
* Removed all languages changes
* nit
* removed white space
* add request data access button with owner/status conditions
* fix lint issue
* fix minor validation for data access button
* fix lint issue
* fix data access button visiable condition
* fix java lint checks and fix test cases
* nit
* fix test
* fix(tasks): model CreateTask.about as entityLink, validate target entity
Replace `about` (FQN string) + `aboutType` (string) with a single
`about` field of type entityLink (`<#E::{entityType}::{fqn}>`). The
resource layer parses the link and resolves it via
`Entity.getEntityReferenceByName(type, fqn, NON_DELETED)`, which
guarantees the target asset exists and is not soft-deleted.
Why: long-FQN data assets were rejected with `[query param name size
must be between 1 and 256]` because the modal was constructing a Task
`name` from the FQN. The `about` was modelled as a free string with
no schema validation that the target was a real, non-deleted entity.
The Threads API already uses entityLink for this exact purpose; tasks
now align with that pattern. The link is supplied as a hidden field
by the UI — users never see it.
Also fixes the missing `@ExtendWith(TestNamespaceExtension.class)` on
`DataAccessRequestIT` that caused four test failures in CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix unit test failure
* fix(test): await workflow stage transition in DataAccessRequestIT
The workflow advances the task from pending-workflow-start to review
asynchronously. Asserting on the object returned by create() was a
race condition. Use Awaitility to poll until the stage is review,
matching the pattern in IncidentTaskIntegrationIT.
---------
Co-authored-by: Sriharsha Chintalapani <harsha@getcollate.io>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com>
Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>
|
||
|
|
e91c90c144
|
fix: validate custom property name charset (#27808)
* fix: validate custom property name charset
Tighten custom property name validation to block characters that break
downstream parsers, with verified empirical reproduction:
- `"` causes HTTP 500 on PUT /metadata/types/{id}
- `:` breaks CSV import — exporter writes `key:value;key:value`, importer
splits at first colon, treats prefix as the field name
- `^` breaks OpenSearch query when the name is in
searchSettings.searchFields — Lucene reads `^` as the boost separator
in `field^boost`
- `$` breaks CSV import via java.util.regex.Matcher.replaceAll which
interprets `$<letter>` as a backreference
Adds a `customPropertyName` definition in basic.json and switches
customProperty.json to reference it. Adds a defensive regex check in
TypeRepository.validateProperty so the API returns 400 with a clear
error message even if schema validation is bypassed.
Tests cover allowed-charset acceptance, the four blocked characters,
leading-character validation, max-length enforcement, and unbalanced
brackets.
* Update generated TypeScript types
* test: add schema-vs-Java consistency test for custom property name
Guards against drift between basic.json#customPropertyName and the
TypeRepository regex/length constants. If either side is updated
without the other, CI fails with a message pointing to both files.
The Java validator is kept (better error message + covers internal
callers that bypass the HTTP layer); the consistency test guarantees
the two definitions cannot drift.
* fix: extend custom property name charset after gap-coverage matrix
Re-ran the matrix on previously-untested chars (+ ? * ~ ` \) across all
17 property types × create/patch/CSV/search:
- + ? * ~ ` all pass cleanly on every operation × every property type — add to allow list
- \ fails CSV roundtrip for entityReference and entityReferenceList types
(escape inconsistency in CSV serialization) — add to block list
Updates the regex, schema description, Java validator error message, and
adds the new chars to the allow/block integration tests. Consistency
unit tests in TypeRepositoryTest continue to pass.
Final allow set: alphanumeric _ - . / & % # @ ! , ; = | ' + ? * ~ `
space ( ) < > [ ] { }
Final block set: " : ^ $ \
* Update generated TypeScript types
* updated the custom property name validation
* added name suffix in custom property name
* lint fixes
* include backslash in invalid char
Co-authored-by: Copilot <copilot@github.com>
* fixed the playwright issue
Co-authored-by: Copilot <copilot@github.com>
* lint fix
* fix check style
* Drop redundant Java validator for custom property name; tighten IT assertions
Schema is the single source of truth: jsonschema2pojo emits @Pattern + @Size
on CustomProperty.name from basic.json#/definitions/customPropertyName, and
@Valid on TypeResource.addOrUpdateProperty enforces them at the HTTP boundary.
The hand-written Pattern constant, validateCustomPropertyName, and the
schema-vs-Java sync test were duplicating that rule and could never reach the
HTTP user (Bean Validation always fires first via @Valid).
Tighten the new TypeResourceIT cases from assertThrows(Exception.class) to
assertThrows(InvalidRequestException.class) so a regression to a different
exception type or status code fails loudly.
* restrict few more special characters from Cp name
* minor fix
* Disallow & < > in custom property names; align IT cases
Schema-side counterpart to the UI changes in the previous two commits:
basic.json#/definitions/customPropertyName now blocks &, <, > alongside the
existing " : ^ $ \\. The DOMPurify pass on the UI sanitizes &, <, > into HTML
entities, which produced inconsistent persisted values; rejecting them at the
schema layer prevents that drift across all write paths.
IT updates:
- Drop &, <, > from the allowed-charset cases (and the "withMatched(pair)And<more>" composite)
- Add &, <, > to the disallowed-charset cases
- Drop "<" leading-character case (now covered as a disallowed character)
- Drop "<" / ">" unbalanced-bracket cases
* Update generated TypeScript types
* Close PATCH bypass for custom property name validation on Type
Bean Validation runs for the dedicated PUT /types/{id} (addOrUpdateProperty)
because the resource declares @Valid CustomProperty, and the createOrUpdate
path can't carry customProperties at all (CreateType schema doesn't include
the field). PATCH /types/{id} accepts an opaque JsonPatch, so @Valid never
reaches into the resulting customProperties[] — a JSON Patch like
[{"op":"add","path":"/customProperties/-","value":{"name":"bad:colon",...}}]
persisted bad-named properties (verified live: HTTP 200 before this fix).
Run Hibernate Validator programmatically inside TypeRepository.prepare() so
every write path enforces the schema-derived @Pattern / @Size / @NotNull on
each CustomProperty. The rule still lives only in basic.json — picked up via
the generated @Pattern annotation, executed via ValidatorUtil.validate.
Tests in TypeResourceIT:
- test_patchCannotAddCustomPropertyWithDisallowedName — seeds a valid property
to ensure /customProperties exists, then PATCHes appending a name with ':',
asserts InvalidRequestException and verifies the bad name is not persisted
- test_patchCanAddCustomPropertyWithValidName — guards against the fix
rejecting valid PATCH-driven additions
* Block * in custom property names — breaks ES field-path lookup
When the property name appears in extension.<propertyName>^boost entries of
searchSettings.searchFields, OpenSearch treats * as a field-path wildcard.
The literal * field never matches its own wildcard pattern, so the field
gets silently skipped from the query and Explore search returns no hit for
the value. Bisected against the running server: of 12 candidate Lucene-special
chars, only * actually breaks the mainline UI search flow. ? ~ ( ) { } [ ] /
! and space all returned hits via the searchFields path because OS looks up
the field literally and only treats * as a wildcard at that layer.
Updates the regex + description in basic.json/customProperty.json, the UI
regex in regex.constants.ts, the validation message across 19 locales, the
generated TS docstrings, the Playwright invalid-name fixtures and spec, and
the IT TypeResourceIT case (with*asterisk moves from allowed to disallowed).
* Validate only newly-added custom properties; isolate PATCH IT to fresh types
prepare() previously validated the entire customProperties[] on every Type
write. An upgraded instance with a legacy property whose name contained a
now-banned char would then reject any subsequent PUT/PATCH on that type,
even when the write only adds a different valid property. Move the name
validation into TypeUpdater.updateCustomProperties() and scope it to the
`added` list computed by recordListChange against the original entity. New
properties are still validated; pre-existing names are left alone.
Replace the IT PATCH cases' shared `topic` Type with a fresh, namespaced
entity-category Type per test (createEntityTypeForTest). The shared `topic`
was mutated concurrently by other tests in the class — combined with
PATCH's lack of per-type locking, that produced lost-update races and
flaky asserts. The fresh per-test type has customProperties: [] from
creation, so the patch sets the array directly without a seed property.
* chore: prettier formatting on the new asterisk-rejection test
* Potential fix for pull request finding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* docs: add + ? ~ ` to JSDoc allow-list to match the regex
* fix(it): request customProperties field on read-back in PATCH IT
Type.customProperties is a lazy field — TypeRepository.setFields only
populates it when the request URL includes ?fields=customProperties. The
default getTypeById helper omits the param, so the read-back always saw
customProperties == null. That made test_patchCanAdd... fail (the just-
persisted property wasn't visible) and made test_patchCannotAdd... pass
for the wrong reason (would have stayed green even if the bad name had
slipped through validation).
Add a fields-aware getTypeById overload and use it in both PATCH cases.
Empirically verified against the live server: good name returns 200 +
appears in customProperties, bad name returns 400 + does not.
* minor fix
* playwright test fix
* removed unecessary test
* blocked ~ and / from custom property name
* lint-fix
* Block / and ~ in custom property names (JSON Pointer reservations)
Forward slash and tilde are reserved by JSON Pointer (RFC 6901): / is the
path separator and ~ is the escape lead-in (~0 = ~, ~1 = /). Allowing
them in a property name shifts the burden onto every caller that builds
a JSON Patch by string interpolation; a raw `/extension/${propertyName}`
either splits into the wrong number of segments or contains an invalid
escape sequence, and the server applies the patch to the wrong key (or
400s outright).
This surfaced as a reproducible failure in the table-cp Playwright suite:
the preceding test ended with `path: \`/extension/${propertyName}\`` where
propertyName ended in `/`. The server addressed extension[name-without-/][""]
instead of extension[name-with-/], returned 400, and TableClass.patch
overwrote entityResponseData with the error body — stripping id and FQN.
The next test fell into the search-based navigation path with an empty
search term and timed out at 180s.
Tighten the schema regex in openmetadata-spec/.../basic.json#customPropertyName
to drop / and ~ from the allowed set; update the human-readable description
in basic.json and customProperty.json to call out the RFC 6901 reservation.
Move the with/slash and with~tilde cases from the allowed-charset IT to
the disallowed-charset IT in TypeResourceIT.
* Update generated TypeScript types
* Use fresh per-test Type in custom-property name validation IT
The five charset/length/lead-char tests added in this PR previously mutated
the shared built-in TABLE_ENTITY_TYPE under @Execution(CONCURRENT). The
PUT path acquires TYPE_PROPERTY_LOCKS so concurrent writes serialize, but
relying on that lock for test isolation is fragile — the PATCH-driven IT
in the same class already uses a per-test fresh Type via
createEntityTypeForTest(client, ns, ...) for exactly this reason
(see
|
||
|
|
f9d3c85d20
|
fix(search): restore live settings on per-entity promote path (#27920)
* Restore live index settings on per-entity distributed-promote path
DefaultRecreateHandler exposes two finalization paths:
- finalizeReindex(...) — centralized end-of-job promotion. Calls
applyLiveServingSettings + maybeForceMerge
before the alias swap, reverting the bulk
overrides (refresh_interval=-1, replicas=0,
async translog) back to live values
(refresh=1s, replicas=1, durable translog).
- promoteEntityIndex(ctx, ok) — per-entity promotion. Used by the distributed
search-indexer's "promote as soon as all
partitions for an entity complete" callback
(DistributedSearchIndexExecutor.promoteEntityIndex).
Swaps the alias and cleans up old indices —
but never restored live settings.
When an entity finishes its partitions before the final reconciliation
(typically the smallest entities — e.g. knowledge `page` with ~11 rows),
its index is promoted via the per-entity path, the alias swap succeeds,
and the bulk-build overrides become the new live settings. refresh_interval
stays at -1 in production, so live writes after the reindex are buffered in
the translog and never reach searchable segments until a manual _refresh.
Externally this surfaces as "create an article, hierarchy is empty until I
re-trigger reindex" — exactly the user-reported bug.
Mirror the finalizeReindex sequence by calling applyLiveServingSettings
(and maybeForceMerge for parity) at the top of the promote block in
promoteEntityIndex, before the alias swap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Wire jobData into per-entity reindex promotion handler
DefaultRecreateHandler.applyLiveServingSettings reads from the handler's
jobData field (live + bulk index-settings overrides on the EventPublisherJob).
The per-entity distributed-promotion path in DistributedSearchIndexExecutor
created its own DefaultRecreateHandler instance and never called
withJobData(jobData) on it. With jobData=null, buildRevertJson returns null
and applyLiveServingSettings silently no-ops — meaning the previous fix
(
|
||
|
|
8dd98fa765
|
fix(it): Stabilize Flaky integration tests (#27546)
* fix(it): stabilize three flaky integration tests
- TagResourceIT.test_searchTagByClassificationDisplayName: raise Awaitility
timeout from 30s to 90s — under full-suite concurrent load the tag search
index can lag well past 30s before the tag is discoverable by classification
display name
- GlossaryOntologyExportIT.testExportGlossaryAsRdfXml: replace legacy
model.write("RDF/XML") with RDFDataMgr.write(RDFXML_PLAIN) — the legacy
Jena API attempts external DTD/entity resolution from w3.org, hanging ~104s
in network-isolated CI before the client times out at 60s; RIOT writes
purely in-memory with no network I/O
- SearchResourceIT.testExportWithFromAndSizeForPagination: add _id as a
final tiebreaker sort on export requests in both ElasticSearch and
OpenSearch managers; from/size pagination without a unique tiebreaker
produces duplicate rows across pages when concurrent CI tests mutate the
same index between requests; also deduplicate the redundant name.keyword
secondary sort when the caller already sorts by name.keyword
* fix(search): use id.keyword instead of _id for export sort tiebreaker
_id is an Elasticsearch meta-field that requires fielddata to sort on,
disabled by default. Use the indexed id.keyword sub-field instead, which
is a proper keyword field with doc values and is sortable without any
cluster setting changes.
* fix(it): retry pagination assertion in Awaitility to tolerate transient index shifts
from/size pagination on a shared search index can return duplicate rows
across two consecutive requests when concurrent tests mutate the same
index in between. Wrapping both page fetches and the assertion in
untilAsserted lets the check retry until the index stabilises rather
than failing on the first transient collision.
* revert(search): drop id.keyword tiebreaker; rely on test-side Awaitility retry
* fix(search): strengthen pagination test assertions and restore id.keyword tiebreaker sort
* fix(it): revert RdfRepository prod change; increase GlossaryOntologyExportIT timeout to 150s for Jena DTD stall in CI
* fix(it): restore tag search index aliases in IndexTemplateIT after index deletion
testDocUpdateOnDeletedIndexUsesTemplateNotAutoInference deletes the physical
openmetadata_tag_search_index and previously restored it with a bare PUT — leaving
all aliases (openmetadata_tag, openmetadata_classification, openmetadata_all)
missing for the remainder of the run. This caused TagResourceIT.checkCreatedEntity
to time out (searches on tag_search_index hit an empty bare index) and delete_by_query
cleanup ops to fail with index_not_found_exception on openmetadata_tag.
Fix: replace the bare PUT with Entity.getSearchRepository().createIndex() which
recreates the physical index with proper OpenMetadata mappings and restores all aliases.
* fix(it): isolate IndexTemplateIT tag test to avoid wiping production search index
testDocUpdateOnDeletedIndexUsesTemplateNotAutoInference was deleting the
production openmetadata_tag_search_index backing index, racing with
TagResourceIT.test_searchTagByClassificationDisplayName which polls that
index for 90s. Use a test-scoped index name matching the template pattern
instead, consistent with the other tests in this class.
* fix(it): make testClaimPendingIncludesRetryStatuses race-tolerant
The production SearchIndexRetryWorker (4 daemon threads, 5s poll) races
the test by calling the same global claimPending SQL. Replace the brittle
size-based assertion with an Awaitility loop that checks claimedAt != null
for each inserted record — proving claimPending's SQL filter accepted the
record's status regardless of which thread won the race.
* fix(it): avoid stale entityStatus in patch_addDeleteReviewers
The GlossaryTermApprovalWorkflow fires asynchronously when reviewers are
added, setting entityStatus=IN_REVIEW. The final patch sent the stale
entityStatus=APPROVED from the previous response, causing a spurious
IN_REVIEW→APPROVED transition in the diff which requires the caller to
be a reviewer — admin is not. Re-fetch the entity before the reviewer
removal so the diff contains only the reviewer change.
* fix(it): handle claimedAt reset in testClaimPendingIncludesRetryStatuses
updateFailureAndRetryCount sets claimedAt=NULL after the worker processes
a record. Add retryCount > 0 as a secondary proof-of-claim signal so
records that were claimed, processed, and had claimedAt reset are still
counted — covers the FAILED exhaustion path and intermediate PENDING_RETRY_*
states where claimedAt is temporarily null.
* fix(it): avoid governance workflow race in testApplyFeedback_withRecognizerMetadata
repository.create() publishes a ChangeEvent that triggers
ApplyRecognizerFeedbackImpl asynchronously. That workflow call races
with the direct applyFeedback below: by the time the workflow runs, the
GENERATED tag is already removed by the direct call, so
getRecognizerIdFromTagLabel returns null and the workflow falls back to
ALL recognizers, contaminating recognizer2.
Fix: insert directly to DAO (bypassing publishChangeEvent) so the
governance workflow is never triggered for this unit-level test.
* fix(it): handle worker deleteByEntity path in testClaimPendingIncludesRetryStatuses
The worker's processRecord takes the delete path (removeStaleEntityById +
deleteByEntity) when resolveEntityReference returns null but entityId is
non-empty — which applies to our fake UUID test records. If the worker
wins and deletes a record, findByStatus finds nothing and the assertion
fails.
Fix: track which IDs are still visible in any status. An ID absent from
all statuses was deleted by the worker after a successful claim —
deleteByEntity is only reached after claimPending accepted the record,
so absence is equally valid proof that claimPending's SQL filter worked.
* fix(it): re-fetch before reviewer patches in test_glossaryTermReviewersMultipleUpdates
Same root cause as patch_addDeleteReviewers: GlossaryTermApprovalWorkflow
fires asynchronously after reviewers are added, setting entityStatus=IN_REVIEW.
Subsequent patches using the stale APPROVED status from the previous response
trigger a spurious IN_REVIEW→APPROVED transition, rejected because admin is
not a reviewer. Re-fetch before each subsequent patch to avoid the stale status.
---------
Co-authored-by: Mohit Yadav <105265192+mohityadav766@users.noreply.github.com>
|
||
|
|
7039be31e9
|
Container children: search + Deleted toggle, rename tab to Containers (#27893)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Container children: search + Deleted toggle, rename tab to Containers Adds case-insensitive substring search via ?q= and a Deleted include toggle to /v1/containers/.../children, mirroring the controls already on the service-level Containers tab. Renames the Container detail page's "Children" tab to "Containers" (URL slug unchanged) for consistency with the file-system-style nested navigation. Backend - ContainerResource.listChildren accepts ?q=; ContainerRepository forwards it as a LOWER(name) LIKE ESCAPE '!' bind. '!' is preferred over backslash because JDBI's ColonPrefixSqlParser mishandles literal '\\' inside SQL string literals and silently leaves a downstream :includeDeleted bind un-substituted. - Searches bypass ChildrenPageCache (cache hit rate per substring ≈ 0). UI - ContainerChildren.tsx adds a SearchBar and a Deleted Switch (existing Table component pattern); state changes reset to page 1. - ContainerDetailUtils + ContainerDetailsClassBase render the tab with label.container-plural; EntityTabs.CHILDREN slug preserved. Tests - Backend IT: 3 new ContainerResourceIT tests cover q= substring match, LIKE-wildcard escaping, and per-level scoping of include=deleted (a soft-deleted descendant must NOT bubble up to ancestor /children listings). - Playwright: 5 new tests across two describe blocks — search scoping vs sibling parent, Deleted toggle reveal/hide, search + toggle composition, plus multi-level scoping (grandparent toggle ON returns empty; parent toggle ON reveals the deleted leaf). - Updated ContainerChildren.test.tsx and ContainerPage.test.tsx for the new include= bind on the API call and the renamed tab label. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address PR review: 5 Copilot comments on container children - ContainerChildren.tsx: replace FQN-keyed stale-response guard with a monotonic token. The old guard caught navigation-between-containers but not overlapping fetches against the SAME container — a slow request from before a search/toggle change could overwrite a newer filtered result. - ContainerChildren.tsx: reset searchValue + showDeleted on decodedContainerName change. Without this, a Deleted toggle ON or a search query carried over to the next container the user navigated to, making it look empty with no URL state to explain why. - ContainerDetailUtils.tsx: honor labelMap for the CHILDREN tab in the dataModel-non-empty branch (was hard-coded to label.container-plural; the dataModel-empty branch already honored labelMap). - ContainerDetailsClassBase.ts: override displayName for EntityTabs.CHILDREN in the customize-page editor — TAB_LABEL_MAP renders it as "Children" globally (still correct for Directory entities), but the live Container detail page now shows "Containers", and the editor needs to match. - ContainerResourceIT.java: docstring for filterByQuery_escapesLikeWildcards referenced backslash escaping; updated to match the actual ESCAPE '!' implementation in buildNameLikeBind. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b5374f9fec
|
Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876)
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
* Reindex robustness: selective fields, cache fail-fast, stop actually stops
Three independent fixes that all surfaced from the same incident: a 580k-
container reindex that froze for hours, then refused to actually stop when
the user clicked Stop.
Selective fields in the distributed reader path. PartitionWorker was
hardcoding List.of("*"), triggering every fieldFetcher in setFieldsInBulk —
including fetchAndSetOwns on Team/User where every owned entity becomes a
getEntityReferenceById round-trip. PR #27723 fixed this for EntityReader
(single-server) but the distributed pipeline never picked it up. Lifted the
field-resolution into ReindexingUtil so both paths share one source of
truth.
Cache layer no longer flaps on a single Redis hiccup. RedisCacheProvider
used to flip the whole provider unavailable on the first 300 ms timeout and
flip back on the next PING success — which combined with a 1 s health-check
made the indexer pay one timeout per cycle indefinitely. Replaced with a
sliding-window failure detector (5 failures in 30 s to trip, 3 consecutive
successes to recover) on the BulkCircuitBreaker pattern.
CacheWarmupApp parsed user config as EventPublisherJob (the SearchIndex
schema), which broke the Configuration page once cacheWarmupAppConfig.json
gained a type discriminator. Switched to CacheWarmupAppConfig in all four
parse sites and decoupled runtime status/stats from the parsed config.
Removed the readAppConfigFlags() workaround that read warmBundles /
enableDistributedClaim out of a raw map. Bails with ACTIVE_ERROR (not
COMPLETED) when an entity type is only partially warmed; retries on
transient cache unavailability instead of giving up on the first miss.
Stop actually stops. Three pieces:
- DistributedJobStatsAggregator skips the WebSocket status broadcast while
the job is STOPPING so it doesn't overwrite the AppRunRecord.STOPPED that
AppScheduler.updateAndBroadcastStoppedStatus pushed. Self-stops after a
30 s grace if the executor never gets to call stop() on it.
- DistributedSearchIndexExecutor.stop() now calls workerExecutor.shutdownNow()
after flagging workers, so threads parked inside the bulk-sink semaphore,
initializeKeysetCursor, or waitForSinkOperations (5-min deadline) get
interrupted instead of grinding for minutes.
- OpenSearchBulkSink replaces concurrentRequestSemaphore.acquire() with a
60-second tryAcquire, recording permanent failure on timeout. A leaked
bulk future (callback never fires) can no longer permanently freeze every
subsequent flush at a fixed record count.
|
||
|
|
8cec97b52c
|
Containers: FQN-driven hierarchy listings + cascade-delete orphan fix (#27878)
* Containers: FQN-driven hierarchy listings + cascade-delete orphan fix
Stops `?root=true&service=...` and `/containers/.../children` from leaking
deeply-nested orphans, fixes the source bug that produced them, and corrects
the 1.13.0 fqnHash pattern index opclass.
Listing path
- ListFilter.getFqnPrefixCondition now binds both <param>Hash and
<param>HashChild ('<hash>.%' and '<hash>.%.%') so depth-aware listings
can require "exactly one segment below the prefix" via a single LIKE +
NOT LIKE pair on fqnHash. Same shape works at any tree depth.
- ContainerDAO.listRoot{Before,After,Count} swap the NOT EXISTS anti-join
on entity_relationship for fqnHash NOT LIKE :serviceHashChild. The FQN
is the canonical hierarchy in OpenMetadata; the relationship table is
no longer consulted for hierarchical listings.
- ContainerRepository.listChildren rewritten: no parent-by-name lookup, no
findToWithOffset/countFindTo on entity_relationship, no second-hop
hydration. Single SQL roundtrip + slim projection via
listDirectChildSummariesByParentHash. Orphans whose parent CONTAINS row
is missing are now correctly placed under their FQN-implied parent.
- Both endpoints honour ?include=non-deleted|all|deleted; ChildrenPageCache
key includes the include tag so toggling the UI Deleted switch doesn't
return a stale page from the other side.
- ContainerResource.listChildren accepts ?include= for parity with the
root listing.
Cascade-delete orphan source (EntityRepository.processDeletionBatch)
- Removed the redundant pre-batch-delete of relationships and the
swallow-all try/catch in the per-child loop. cleanup() per entity now
owns row removal AND relationship deletion atomically; exceptions
propagate so the loop stops on first failure with per-child atomicity.
Stops the orphan-without-relationships pattern that the listing change
defends against.
Migration correction (1.13.0 postgres fqnHash pattern indexes)
- Recreate 23 idx_*_fqnhash_pattern indexes with text_pattern_ops instead
of varchar_pattern_ops. The planner casts the column to text when the
LIKE RHS is text-typed (every JDBC setString call), so
varchar_pattern_ops doesn't match the resulting (fqnhash)::text ~~
expression. Confirmed via EXPLAIN ANALYZE on a 580k-row table: the same
query drops from ~470ms cold (Parallel Seq Scan) to <1ms (Index Scan).
Tests
- ListFilterTest: 3 unit tests covering both binds, dotted/quoted service
name special-char handling, and include= flowing through alongside the
service prefix.
- ContainerResourceIT: 8 integration tests covering depth correctness at
every level (5-level chain), orphan exclusion at root, orphan
discoverability under FQN-implied parent, sibling subtree isolation,
the include toggle on both endpoints, and large-batch hard-delete
leaving no orphan rows or relationships.
Closes #27870 (subset of its listing-side intent shipped here as a single
FQN-depth predicate; PR's cascade fix and both new tests picked up
verbatim).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Address review comments on #27878
- ContainerDAO.listRoot* override now defaults :serviceHashChild to '%.%.%'
via rootListingParams() when ?service= is absent. Previous code
unconditionally referenced the bind, so ?root=true without a service
filter crashed at runtime with a missing-named-parameter error.
- Migration 1.13.0/postgres/schemaChanges.sql now DROP INDEX CONCURRENTLY
IF EXISTS before each CREATE so already-upgraded environments (which
have the original varchar_pattern_ops indexes) get the index recreated
with text_pattern_ops on next deploy. Fresh installs see the DROP as
a no-op. Comment block updated to record the recreate intent.
- ChildrenPageCache include tag for ALL changed from "all" to "a" so the
CacheKeys.childrenPage Javadoc's "1-2 char" promise holds (now nd/a/d
are all <=2 chars).
- ContainerRepository.includeToBindString Javadoc corrected: it described
the SQL as a CASE expression, but listDirectChildSummariesByParentHash
actually uses a three-branch OR chain.
- ListFilterTest: added test_noServiceFilter_doesNotBindServicePatterns
as a regression guard for the missing-bind bug.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Fix java style
* Address second review pass on #27878
- EntityRepository.processDeletionBatch wraps per-child cleanup exceptions
with entityType + entityId context before re-throwing. The exception
still propagates (so the loop still stops, failure-semantics contract
unchanged); operators now get a stack trace that names the row that
blocked a large recursive delete instead of an opaque error.
- CacheKeys.childrenPage Javadoc now lists the actual include tags
("nd" / "a" / "d") and points at ChildrenPageCache.includeTag as the
authoritative source. Earlier comment still mentioned "all" after the
switch to single-letter tags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Test: ?root=true without service filter end-to-end (#27878 review)
Adds test_rootListing_withoutServiceFilter_returnsRootsAcrossAllServices
to ContainerResourceIT. Creates two distinct storage services, each with
a root container and a child container, then asserts that GET
/containers?root=true (no service filter):
- Succeeds (rootListingParams() defaults :serviceHashChild to '%.%.%' so
the SQL has its bind even when ListFilter.getServiceCondition didn't
add it).
- Includes root containers from both services (cross-service listing
works without a service prefix narrowing the candidate set).
- Excludes child containers from either service (depth check still
applied via the default bind).
Regression guard for the bug Copilot's review pass flagged at
CollectionDAO.java:784: 'GET /containers?root=true (no service) crashes
at runtime due to a missing named parameter.'
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Use generated name column instead of JSON extract in container summary queries
storage_container_entity has 'name' as a STORED generated column derived
from json->>'name' (see bootstrap/sql/schema/postgres.sql). Both slim
projection queries (findContainerSummaryRows and listDirectChildSummariesByParentHash)
were redundantly extracting it via JSON_UNQUOTE(JSON_EXTRACT(...)) on MySQL
and json->>'name' on Postgres — work the database had already done at insert
time.
Reading 'name' as a column directly:
- Saves one JSON op per row on every page fetch
- Lets ORDER BY name sort on the indexed generated column rather than a
per-row JSON-extracted expression
displayName, fullyQualifiedName, and description stay as JSON extracts —
they aren't generated columns. (description in particular shouldn't be:
free-text fields can be many KB and a STORED generated column would
double the row size on disk.)
Row mapper unchanged — column labels in the SELECT list still match.
* Fix inaccurate ListFilterTest comment and Javadoc link to private method
ListFilterTest: the prefix-pattern comment said the LIKE patterns 'exclude'
direct/grandchildren — patterns themselves match, the SQL's NOT LIKE is
what excludes. Rewrote to show how ContainerDAO.listRoot* combines LIKE
and NOT LIKE on the two binds.
CacheKeys.childrenPage: the @link pointed at ChildrenPageCache#includeTag
which is private static; Javadoc tooling renders that as an unresolved
link. Redirected to the public Include enum the tag is derived from.
* Log original exception in recursive batch delete catch before wrapping
Wrapping the caught RuntimeException into a new one (with entity context
in the message) preserves the original via the cause chain, but the outer
exception mapper sees the wrapper and renders a generic 500 — the original
type information doesn't surface to operators investigating a failed
delete.
Adds a LOG.error before the wrap so the original exception (with full type
and stack) lands in the logs adjacent to the entity context, giving
operators enough signal to diagnose what actually blocked the delete.
* Restore failure-semantics comment block on recursive batch delete wrap
* use Entity.SEPARATOR instead of hard-coding
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
* fix check style
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: sonika-shah <58761340+sonika-shah@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
||
|
|
2a11bd53b0
|
test: cover glossary CSV typed relations across unit, IT, Playwright (#27740)
* test: cover glossary CSV typed relations across unit, integration, and Playwright Add round-trip and validation coverage for typed glossary term relations (synonym, broader, narrower, custom types from settings) in CSV import/export. Unit tests in CsvUtilTest exercise addTermRelations serialization edge cases. Integration tests in GlossaryCsvRelationTypesIT add a true export-reimport round-trip, mixed-format per-term API verification, asymmetric relation visibility from both sides, and a custom-relation-type round-trip via GlossaryTermRelationSettings. Playwright specs in GlossaryImportExport.spec.ts drive the UI bulk import/export flow with typed relations and assert the validation surface rejects unknown relation types. |
||
|
|
72e528c695
|
Fix flaky IT tests: incident-id tracking + GlossaryOntologyExportIT isolation (#27867)
* Fix flaky IT tests by snapshotting create/update/patch responses in SDK |
||
|
|
620d1b6ad9
|
Cache audit fixes: tag rename + relationship invalidation, bundle warmup (#27864)
* Cache audit fixes: tag rename invalidation, relationship invalidation, bundle warmup Fix two write-through cache correctness bugs and enhance the warmup app: Bug A — Tag/Glossary/Classification rename now invalidates the cached entity JSON for every entity tagged with the renamed tag. Adds invalidateCacheForTaggedEntities helpers in EntityRepository that use the search index (same source updateClassificationTagByFqnPrefix already uses) to enumerate affected entities, then call invalidateCacheForEntity for each. Wired into TagRepository, ClassificationRepository, and both rename + parent- move paths in GlossaryTermRepository. Bug B — Direct addRelationship/deleteRelationship and bulk variants now invalidate the cached bundle/owners/domains on both sides. Bot/Domain/Data Product (already in UNCACHED_ENTITY_TYPES) short-circuit in invalidateCacheForEntity so cascade-heavy delete paths don't pay for Redis ops on keys that were never written. Warmup enhancements — new BundleWarmupBatcher pre-warms the per-entity bundle cache (tags + certification) using batched tag_usage queries; relations stay lazy. CacheWarmupApp adds per-entity-type checkpoint resume, opt-in SETNX-based distributed claim for multi-instance deploys, and per-entity-type cache.warmup.coverage / cache.warmup.bundle.coverage gauges. CacheProvider gains scanCount; CacheMetrics gains coverage gauges and a completed_runs counter. cacheWarmupAppConfig.json drops dead consumerThreads/queueSize and adds warmBundles + enableDistributedClaim flags. Tests — BundleWarmupBatcherTest covers the batcher with mocked DAO/cache. TagRenameCacheIT and RelationshipCacheInvalidationIT cover the bug fixes end-to-end under the cache-tests / postgres-os-redis profile. * Update generated TypeScript types * Address PR review: claim ownership, config wiring, coverage, keyspace Six follow-ups against PR #27864: - releaseClaim now does a compare-and-delete: GET the owner first and only DEL when it matches our instanceId. Previously a 10-min TTL could expire mid-warm, another instance could claim, and we'd blindly delete its lock. - warmBundles / enableDistributedClaim are now read from the user-supplied app config map (matching the cacheWarmupAppConfig.json schema fields) instead of being JVM-system-property-only. System properties remain a fallback for bootstrap / unedited records. - Per-entity-type checkpoint and claim keys now embed cacheConfig.redis .keyspace so two environments sharing one Redis with different keyspaces no longer collide on warmup metadata. - reportCoverage now prefers scanCount over current-run success delta when available, so resumed runs report end-state coverage instead of the artificially low partial number. - CacheWarmupApp class Javadoc updated to reflect that bundle pre-warm is on by default; intentionally-not-done text was stale. - RelationshipCacheInvalidationIT Javadoc clarifies that the assertion is on the cacheable table side; Domain itself is in UNCACHED_ENTITY_TYPES. - CacheProvider.scanCount Javadoc documents the O(DBSIZE) cost so callers don't drop it on the request path. - EntityRepository.invalidateCacheForTaggedEntities Javadoc strengthens the search-lag tradeoff section with the actual fallback (entity TTL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * PR review: accept string and Boolean forms for warmup config flags readAppConfigFlags previously only accepted Boolean.TRUE / Boolean.FALSE. Depending on how the app config arrives (typed POJO, raw JSON, YAML env-var override, API string body) the same logical value can land as a "true"/"false" string — and the instanceof Boolean check would silently ignore it and fall back to the JVM system property. Operators setting the flag in the UI would see no effect. Added a small parseBooleanFlag helper that handles both shapes. |