OpenMetadata

mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

History

Sriharsha Chintalapani d3bbbefe37 fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 ) * fix(rdf): dedupe lineage edges and broaden PROV-O coverage The RDF Knowledge Graph endpoint was emitting two edges per lineage relationship — once as `om:UPSTREAM` (forward) and once as `prov:wasDerivedFrom` (reverse) — because the parser preserved each predicate's native subject/object orientation instead of canonicalizing both into a single `(upstream, downstream)` edge. Also extend PROV-O coverage so external SPARQL clients can use the W3C Provenance vocabulary directly: - `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on datasets / pipelines / users - `prov:wasAttributedTo` mirror of `om:owners` - `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used` on lineageDetails so the Entity → Activity → Entity chain is complete - `prov:hadPlan` + `prov:Plan` for SQL transformation recipes - `prov:startedAtTime` / `prov:endedAtTime` on Activity instances - `prov:wasAssociatedWith` Activity → Agent linking - `prov:invalidatedAtTime` on soft-deleted entities Other RDF cleanups in the same area: - LineageDetails URIs are now deterministic (driven by from/to ids instead of a timestamp), so re-indexing collapses duplicate Activity resources via the existing DELETE+INSERT idempotency - Skip emitting the redundant `om:owners` JSON-string literal — the mapped path already produces clean `om:hasOwner <agent>` triples - Skip empty `[]` array literals in the unmapped path - Propagate failures from `RdfRepository.{addRelationship, addLineageWithDetails, bulkAddRelationships, bulkAddGlossaryTermRelations}` instead of silently swallowing them, so downstream callers can surface the failure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): surface Fuseki failures in app run record Per-entity and per-batch failures from the RDF index app used to be logged via SLF4J only — they never made it into the AppRunRecord, so the UI/run history showed "completed" even when every entity had silently failed to write to Fuseki. - `RdfBatchProcessor.processEntities` now captures the last error per entity, returns it in `BatchProcessingResult.lastError`, and accumulates relationship-processing failures into the same result. - Relationship and lineage processing methods (`processBatchRelationships`, `processLineageRelationship`, `processGlossaryTermRelations`) return structured results with failure counts and last-error messages instead of `void`, so failures are visible to the partition worker. - `RdfIndexApp` records the failure on `jobData` for both the distributed and non-distributed code paths, so users see a real error message in the run history (e.g. "Failed to write entity X to Fuseki: ConnectException"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * perf(rdf-index-app): port distributed-mode improvements from SearchIndex The RDF distributed-indexing fork was lagging behind several SearchIndex improvements that addressed concrete reliability and throughput issues. Port them across: Core perf / reliability - Precomputed partition start cursors: coordinator walks each entity once via keyset pagination at job init and caches the boundary cursor per (jobId, entityType, rangeStart). Workers consult the cache before falling back to the OFFSET-based path. Eliminates the previous O(N²) per-partition cursor lookup. - `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion` on the coordinator. Stop now cancels both PENDING and PROCESSING partitions in a single SQL update and immediately drives the job status from STOPPING → STOPPED, so the UI status no longer hangs while workers drain. - Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset` uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of `List.of("")`, avoiding expensive fetchers (e.g. fetchAndSetOwns) per batch. - Partition heartbeat thread: virtual thread refreshes `lastUpdateAt` every 30s for partitions actively being processed by this server, so the stale reclaimer no longer interrupts active work. - `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path rejects when the server already holds 5 PROCESSING partitions, giving fair distribution across pods. Verified the existing claim DAO uses `FOR UPDATE SKIP LOCKED` for both MySQL and Postgres. - Gate WebSocket stat broadcasts during the STOPPING phase so the Quartz-scheduler-driven STOPPED status push isn't overwritten. Multi-server scaffolding (single-pod is unaffected) - `RdfPollingJobNotifier`: DB-polling discovery for other server pods to find an in-flight RDF reindex they can join. - `RdfEntityCompletionTracker`: per-entity-type partition tracking with callback firing once all partitions for an entity complete, foundation for early per-entity index promotion. Tests: precomputed-cursor cache lookup, in-flight backpressure, cancelInFlight delegation, completion tracker callback semantics, notifier start/stop. DAO additions on `rdf_index_partition`: - `cancelInFlightPartitions(jobId, now)` — covers both PENDING and PROCESSING in one statement - `countInFlightPartitionsForServer(jobId, serverId)` — backpressure - `countPartitionsByStatus(jobId, status)` — used by completion check Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row When an app has no run history, AppRunsHistory fabricated a synthetic placeholder row that looked like a real run — `runType: "CurrentConfig"`, a fake `Run At` timestamp pulled from `appData.updatedAt`, an ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button that targeted nothing. Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows, and hide the `Stop` button so users no longer see "Run now → 19-minute Running with Stop button" when the actual job never registered. Real app runs are unaffected — they still display `runType` from the backend (OnDemandJob, Hourly, Daily, Custom, etc.). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address PR review findings Four issues raised in PR #27999 review: - Cursor format consistency in walkAndRecord (bug): The defensive branch produced cursors via a custom `{name, id}` map while the regular path used `repo.getCursorValue()`. For entities with quoted names these encodings diverge — a quoted-name entity could land in the cache with a cursor incompatible with what the worker fetches via keyset pagination. Track the last seen entity reference and run it through `repo.getCursorValue()` in both paths. `encodeBoundaryCursor` is removed. - Adaptive scheduling in RdfPollingJobNotifier (perf): The previous implementation woke the scheduler thread every 1s and short-circuited inside the poll method when idle. Reschedule the task at the appropriate interval (1s active / 30s idle) when `setParticipating` flips, so the thread genuinely sleeps when idle. - Cursor cache cleanup on startup recovery (edge case): `partitionStartCursors` was only evicted by `refreshAggregatedJob` / `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job and never reached either, the cache entry leaked until process restart. Add `evictStaleCursorCacheEntries()` invoked by `performStartupRecovery` that drops entries for jobs that no longer exist in the DB or are already terminal. - Consolidate describeError helpers (quality): `describeError`, `describeBulkError`, and `describeLineageError` in `RdfBatchProcessor` all walked the cause chain and formatted a prefixed message with the same logic. Reduced to a single `describeError(prefix, error)` plus a thin `describeEntityError` adapter for the per-entity call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop() stop() called workerExecutor.shutdownNow() inline AND through cleanupLocalExecution -> shutdownWorkerExecutor, which broke the DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the inline call — cleanupLocalExecution is the single owner of the shutdown path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: drop redundant DB matrix from openmetadata-service unit tests The {mysql, postgresql} strategy matrix on openmetadata-service unit tests doubled CI cost without adding signal: both jobs ran the same surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single `test.database` property, and that property is consumed exclusively by the failsafe plugin (integration tests `IT.java` / `IntegrationTest.java`), which only runs under `-Pintegration-tests` — not enabled here. `openmetadata-service` itself has zero tests that read `test.database` or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The only testcontainer-based DB code in the repo lives in `openmetadata-integration-tests`, a different module that this workflow doesn't build. Run the unit suite once. The `openmetadata-service-unit-tests-status` required-check aggregator is unaffected (it depends on the renamed job which still has the same name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address Copilot PR review findings Six correctness issues raised on PR #27999: - Lineage-details DELETE was too broad (RdfRepository): the cleanup step deleted all `<fromUri> om:hasLineageDetails ?d` triples, so reindexing one (fromId, toId) edge wiped lineage-details links for every other downstream of the same source entity. Pin the delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>` triple. Same with prov:generated cleanup — anchor it to the specific detailsUri instead of any details resource. - Predicate not flipped during canonicalization (RdfRepository): `parseEntityGraphEdgesFromResults` swapped subject/object for reverse-direction predicates (`prov:wasDerivedFrom`, `prov:wasInfluencedBy`) but kept the original predicate URI on the resulting EdgeInfo. Exported graphs could carry semantically invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`. Add `forwardEquivalentPredicate` to substitute the OM-native forward predicate when the direction flips. - `dct:modified` was an invalid xsd:dateTime (RdfPropertyMapper): `entity.getUpdatedAt().toString()` returns the epoch-millis Long as a string, but the literal was tagged `xsd:dateTime`. Convert via `Instant.ofEpochMilli(...).toString()` so the lexical form matches the type — same fix already in place for prov:invalidatedAtTime. - Unmapped EntityReference arrays were dropped entirely (RdfPropertyMapper): the previous fix to skip noisy JSON-string literals also dropped fields like `domains`, `reviewers`, `voters` for entity contexts that don't have a JSON-LD mapping for them — the unmapped path was the only path emitting them, so nothing landed in RDF. Expand each array element through `addEntityReference` so the data still produces proper `om:<fieldName> <ref>` triples; mapped-path duplicates are collapsed by Jena's Model dedupe. - Partition failure detection missed reader errors (DistributedRdfIndexExecutor): the EntityCompletionTracker was fed `result.errorMessage() != null`, but `RdfPartitionWorker` can increment `failedCount` from `readerErrors` without ever setting `lastError`. Use `result.failedCount() > 0` so partitions whose failures came from `ResultList.getErrors()` are also marked as failed when promoting an entity. - `COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0 (RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS whenever any partition is FAILED or CANCELLED, including for user-initiated stops where no record-level failures accrued. The monitor's `completedWithErrors` gate required `failedRecords > 0`, so those terminal states never hit `jobData.setFailure(...)` and the run record showed success. Drop the failedRecords precondition and tailor the fallback message based on whether there are record-level failures or partition-level only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): separate relationship failures + type lineage as prov:Activity Two more PR review findings on #27999: - Relationship failures inflated failedRecords stat: `processEntities` was folding relationship/lineage edge failures into `failedCount`, which becomes `failedRecords` in the index stats. Records there mean entities, computed from entity counts in `totalRecords`. Counting per-edge relationship failures could push `failedRecords` above `processedRecords`/`totalRecords` and produce nonsensical per-entity stats. Track them separately: add `relationshipFailureCount` to `BatchProcessingResult` and `PartitionResult`. `failedCount` now stays entity-level. The completion tracker is fed the broader `result.hasAnyFailure()` so partitions where relationship triples failed don't get prematurely promoted as success even though their entity writes succeeded. - `detailsResource` wasn't typed as prov:Activity: the resource carries Activity-shaped predicates (prov:startedAtTime, prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy, prov:wasAssociatedWith) but only the OM-specific `om:LineageDetails` rdf:type. Add an explicit `rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL clients recognize it as an Activity without having to learn the OM type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): label lineage edges relative to focal node The Knowledge Graph view was labeling every edge with relation type "upstream" as "Upstream" regardless of direction relative to the focal node. For a focal node F, the raw stored relation `(F, X, upstream)` means "F is upstream of X" — i.e. X is downstream of F. The previous output labeled both `F → X` and `X → F` edges as "Upstream", which made bidirectional lineage look like a duplicated relation. Re-orient the label in `convertEdgesToGraphData` based on whether the focal is the edge's source or target: - focal → X → "Downstream" - X → focal → "Upstream" - non-focal-touching edges keep the raw relation label. Reported on a sample-data table with a circular lineage cycle (`dim_customer ↔ fact_orders`) where both directions showed "Upstream". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close remaining Copilot review gaps Three findings from PR #27999's third review pass — all about failure signals being silently dropped between layers: - `RdfIndexApp.processTask` ignored relationship failures: only `result.failedCount() > 0` was treated as a failure, so partitions whose Fuseki relationship/lineage writes failed (incrementing `relationshipFailureCount` but not `failedCount`) never wrote `jobData.failure`. Switch to `result.hasAnyFailure()` and report the combined count. - `checkAndUpdateJobCompletion` ignored partition `lastError`: a partition can finish COMPLETED with `lastError` set when a relationship bulk write was caught and recorded but didn't bump `failedRecords` or flip the partition to FAILED. The job would then go to COMPLETED even though there were real failures. Treat the presence of any `rdf_index_partition.lastError` as an error signal — promote to COMPLETED_WITH_ERRORS and aggregate sample errors into the job's errorMessage if it was blank. - `forwardEquivalentPredicate` mapped to a non-existent `om:DOWNSTREAM` URI: OpenMetadata only stores lineage with `om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O pair); there is no `om:DOWNSTREAM` predicate written anywhere — the downstream view is derived by reading the same UPSTREAM edge from the other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy` to `om:UPSTREAM` (both are reverse-direction causation predicates: in `B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and effect is B, so the canonical forward predicate is the same). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix RDF tag mapper * Fix all the comments Cherry-picked from #27562 (without bin/ autogenerated noise). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Align RdfPropertyMapper tests with refactor and isolate ontology export IT RdfPropertyMapperTest still referenced the removed addVotes helper and expected addStructuredProperty to dispatch votes — both gone after votes was added to IGNORED_PROPERTIES. Update the assertions accordingly. GlossaryOntologyExportIT timed out on the full suite because it flips a global RDF singleton in @BeforeAll and each test blocks a server thread on synchronous Fuseki writes. SAME_THREAD only serialized methods within the class — concurrent classes still raced for server threads. Adding @Isolated matches the pattern already used by RdfResourceIT for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): align addCertification typing + relationType after predicate flip Two findings on PR #27999 from the post-cherry-pick review pass: - `addCertification` mis-typed glossary-source certifications and skipped skos:Concept: it always emitted `om:Tag` regardless of source, even though `resolveTagResource` returns a glossaryTerm URI when the certification points at a glossary term. It also didn't add `skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for classification tags), so SPARQL queries filtering certification targets by `a skos:Concept` missed them while `addTagLabel`-emitted tags were findable. Mirror `addTagLabel`: branch on source (`Glossary` vs `Classification`), emit the right primary type plus `skos:Concept` (glossary) or `om:Tag` (classification), and include `om:tagSource`. - `relationType` left stale after predicate flip: when `parseEntityGraphEdgesFromResults` flipped subject/object for a reverse-direction predicate and rewrote `canonicalPredicate` to `om:UPSTREAM`, it kept the original `relationType` derived from the reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo with `relationType=downstream` + `predicate=om:UPSTREAM` — internally inconsistent, and the mismatched `edgeKey` prevented dedup against an existing UPSTREAM edge with the same endpoints. Re-derive `relationType` from the canonical predicate after the flip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings + add parser-helper unit tests Two outstanding Copilot findings on PR #27999 plus targeted unit coverage for the helpers that drive lineage canonicalization. Findings: - `colLineageUri` collision risk (RdfRepository): the deterministic key replaced non-alphanumerics in `toColumn` with `_`, so distinct column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which would lose / overwrite column-lineage resources during reindex. Append the loop index as a tiebreaker so distinct columns keep distinct URIs. - `createTypeResource` missing dprod prefix (RdfPropertyMapper): the `getNamespace` switch didn't recognize `dprod`, so `RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`) produced an invalid `dprod:DataProduct` URI on the wire. Added the `DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod` case in the switch. Coverage: - New `RdfParserHelpersTest` exercises the canonicalization helpers via reflection: `isReverseDirectionPredicate` (recognizes PROV-O causation predicates, ignores forward predicates), `forwardEquivalentPredicate` (both `wasDerivedFrom` and `wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works), `relativeRelationLabel` (focal-relative Upstream/Downstream flipping with all the boundary cases — non-focal edges, non-lineage relations, null focal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): merge array contexts before per-field resolution The third (low-confidence "suppressed") finding on review 4256830399 turned out to be a real duplication: when a field is mapped in one context map of an array context but absent from another, the previous processArrayContext ran processContextMappings once per map. The pass where the field IS mapped emits the proper `om:hasOwner <ref>` triples (plus `prov:wasAttributedTo`); the pass where the field is absent falls through to processUnmappedField and emits an additional `om:owners <ref>` triple. Net: two predicates for the same logical relationship. Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112 `om:owners` triples — one set per pass. Fix: flatten all context maps in the array into a single merged map once, then iterate entity fields exactly once against that combined view (later contexts win on key conflicts, matching JSON-LD context merge semantics). Each field is resolved against the union of mappings, so the unmapped fallback only fires for fields truly absent from every context. Net effect: `prov:wasAttributedTo` count is unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners` triples disappear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings on coordinator finalization race Two findings from PR #27999 review 4259628860: - `checkAndUpdateJobCompletion` early-returned before lastError check could promote: `refreshAggregatedJob` already marks the job COMPLETED when partitions all finish without `failedRecords`/`failedPartitions`, so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())` short-circuit silently dropped the lastError signal. Move the partition-lastError check INTO `refreshAggregatedJob` so both code paths produce consistent terminal status — a partition that finished COMPLETED but carries a non-null lastError now correctly promotes the job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the race. - `completePartition` / `failPartition` overwrote CANCELLED state: the unconditional partition row update lost a concurrent Stop's CANCELLED status if a worker finished its batch after the Stop request landed but before noticing it. Add a status-guarded `updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND status = 'PROCESSING') and have both completion paths use it; if 0 rows update, log and skip the side effects (no server-stat increment, no refreshAggregatedJob call) so the authoritative CANCELLED status stays. Mirrors the pattern SearchIndex's coordinator uses for the same race. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>		2026-05-11 06:14:50 -07:00
..
src	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 )	2026-05-11 06:14:50 -07:00
.swp	Fixes #9259 Change Tags APIs to conform with rest of the APIs (#9260 )	2022-12-26 12:32:17 -08:00
LEARNING_RESOURCES_DESIGN.md	Learning Resources (#25005 )	2026-01-25 07:20:14 -08:00
lombok.config	Issue-19251: Upgrade dropwizard to 4.x and Jetty to 11.x (#19252 )	2025-05-27 20:31:59 +05:30
pom.xml	fix(security): pin libthrift, provided jsonschema2pojo, bump azure-kv/sjm/reactor-netty, exclude netty-epoll (#28010 )	2026-05-11 14:08:26 +05:30