OpenMetadata/bootstrap/sql/migrations/native/2.0.1/postgres/postDataMigrationSQLScript.sql

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

31 lines
1.3 KiB
MySQL
Raw Normal View History

Task redesign (#25894) * Task Redesign: Add Task entity & tests * Task Redesign: Add Task entity & tests * Task Redesign: Add Permissions checks for Task APIs * Task UI changed to the new APIs * Migrate UI and APIs to new tasks system inlcuding suggestions * Add Suggestions integration * Activity Feed Refactor * ActivityFeed -> ActivityStream publisher * Activity Feed redesign * Activity Feed redesign, adding tests * Incident Manager update * Migrate Incidents to new tasks * Migrate Incidents to new tasks * Update generated TypeScript types * Update generated TypeScript types * feat(tasks): add domain-aware task cutover and workflow v2 migration * test(tasks): cover domain filters and task feed visibility flows * Address comments * Fix workflow tests to use new Task entity API and fix UserApprovalTaskV2 candidate transformation Migrated 9 WorkflowDefinitionResourceIT tests from legacy Feed/Thread API to the new Task entity API (UserApprovalTaskV2 creates Task entities, not Thread entities). Fixed a bug in UserApprovalTaskV2 where candidates were passed as raw EntityReferences instead of being transformed into users/teams FQN arrays for SetApprovalAssigneesImpl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix tests * refactor: stabilize task entity workflows * refactor: finish task entity cutover and activity migration * refactor: migrate legacy thread feed during cutover * refactor: split legacy thread rename and archive migrations * Merge main; fix tests * Update generated TypeScript types * feat: advance task redesign through phase 2 * Merge main; fix tests * Update generated TypeScript types * Fix failing tests * Update generated TypeScript types * fininsh phase 6 of the design, configurable task forms * Update generated TypeScript types * Update generated TypeScript types * Fix linting * Address gitar comments * Address gitar comments * Fix build * Address giar comments * fix build * Add task custom forms * Fix tests * Address tests * Apply UI lint autofixes * Fix tess * Fix linter * Fix task patching * Fix tests * Fix playwright tests * fix java checkstyle * Add python sdk support for tasks, annoucements * Fix playwright tests * Fix playwright tests * Fix playwright tests * Fix python tests * Fix python tests * Fix linting workflows * fix pycheck * fix pycheck * Fix tests * Fix build * Address deviations from main and fix tests * Fix integration tests * Fix integration tests * Fix integration tests * Update generated TypeScript types * Fix Playwright tests * Fix Playwright tests * feat(incident): wire incident manager to task-first architecture (#27369) * feat(incident): wire incident manager to task-first architecture Connect the incident manager to the task redesign so it works end-to-end: resolve data persistence, backward transitions, reopen from resolved, and incident discovery via TCRS. * Update generated TypeScript types * refactor: single-query incident task lookup with parameterized statuses Replace two sequential queries (Open, InProgress) in getOrCreateIncident with one findByAboutAndTypeAndStatuses query using @BindList for status IN (...). --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * Fix Playwright tests * Update generated TypeScript types * Fix linter * Fix tests * Fix tests * Fix checkstyle * Fix tests * Fix checkstyle * Update FeedResourceIT.java * Update TableRepository.java * fix tests * Update ActivityFeedProvider.tsx * fix tests * fix tests * Address Task comments * Fix unit test * Fix the feed summary panel showing on landing page * Fix comment functionality * Fix pytests * Fix failing playwright tests * Fix test flakiness * Fix ui-checkstyle * Fix advanced search spec failure * Fix playwright tests Co-authored-by: Copilot <copilot@github.com> * Fix checkstyle * Fix the flaky tests Co-authored-by: Copilot <copilot@github.com> * fix checkstyle * Reduce the workflow polling * Update generated TypeScript types * skip failing tests Co-authored-by: Copilot <copilot@github.com> * Fix ui-checkstyle --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com> Co-authored-by: IceS2 <pablo.takara@getcollate.io> Co-authored-by: karanh37 <karanh37@gmail.com> Co-authored-by: Karan Hotchandani <33024356+karanh37@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
2026-04-23 13:52:30 +00:00
-- Post data migration script for Task workflow cutover - OpenMetadata 2.0.1
fix(rdf): converge Fuseki state on weekly rebuilds and isolate API latency (#28117) * fix(rdf): converge Fuseki state on weekly rebuilds and isolate API latency RdfIndexApp ran daily and never reconciled removed relationships, so triples grew unboundedly across runs. When Fuseki crash-looped on the resulting disk pressure, every entity-write hook blocked synchronously on the unreachable server (no HTTP connect timeout, 3-retry loop on ConnectException), saturating the bounded AsyncService pool and pushing login to ~45s. Storage-side fixes (stop growth): - Drop the extractRelationshipTriples "preserve forward" path in RdfRepository.createOrUpdate; the translator is the source of truth and the surrounding orchestration already rewrites the current relationship set. This also removes a wasted CONSTRUCT round-trip per entity write. - bulkStoreRelationships now does per-source-entity DELETE WHERE with a predicate-exclusion FILTER for lineage edges, so relationships that no longer exist actually leave the store. - Wire RdfRepository.clearAllGlossaryTermRelations() into RdfIndexApp's initializeJob (the method existed but had no callers). - Flip recreateIndex default to true and move the cron to Saturday midnight ("0 0 * * 6"). Add reloadOntologies() so CLEAR ALL doesn't leave the ontology graph empty before indexing starts. - Include a 2.0.1 post-data migration that updates existing installed_apps rows; the app loader is insert-only on upgrade. Connectivity / concurrency fixes (isolate API latency from Fuseki health): - Add 2s connectTimeout to every JenaFusekiStorage HttpClient and fast-fail on ConnectException / ClosedChannelException / HttpConnectTimeoutException instead of retrying. Introduce a 5-failure/30s circuit breaker. - Route all RdfUpdater mutators through AsyncService.execute with a bounded pendingWrites gate (cap 1000, drop-on-overflow with logged warning) so a dead Fuseki can no longer block request threads or starve the AsyncService pool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): address PR review — preserve relationships, scope DELETEs, surface ontology failures PR #28117 review feedback. Addresses 13 findings across gitar-bot and Copilot: Storage correctness: - JenaFusekiStorage.storeEntity now keeps URI-valued triples (relationships) and only refreshes literal-valued triples. A metadata-only PATCH would otherwise wipe every inter-entity edge until the next weekly recreate-index, and async ordering between updateEntity and addRelationship could leave the graph missing edges (Copilot #1, #2). - RdfRepository.removeRelationship wraps the DELETE in the knowledge named graph and uses getRelationshipPredicate so the predicate URI matches what addRelationship actually wrote (e.g. UPSTREAM → prov:wasDerivedFrom). The previous bare DELETE in the default graph was a silent no-op (Copilot #3). - RdfBatchProcessor now calls a new RdfRepository.clearOutgoingEntityRelationships for every entity in the batch, not just those with current edges. An entity whose last outgoing relationship was removed in MySQL contributes zero RelationshipData entries, so bulkStoreRelationships' per-source DELETE never fired for it (Copilot #4). - bulkStoreRelationships no longer swallows non-connect DELETE errors — DELETE WHERE on a source with no edges is a no-op, so exceptions there are real failures (malformed SPARQL, auth, server errors) and should surface (Copilot #5). Visibility: - reloadOntologies() now checks areOntologiesLoaded() after load and throws if still empty. OntologyLoader.loadOntologies catches internally, so the old reloadOntologies always appeared to succeed (Copilot #6). - clearAllGlossaryTermRelations rethrows on failure instead of silently logging — the indexer's caller can now react to cleanup failures (Copilot #10). - clearAllGlossaryTermRelations pulls custom predicate URIs from GlossaryTermRelationSettings and includes them in the DELETE FILTER. The hardcoded list missed any custom predicates an admin configured (Copilot #7). Quality: - Set / LinkedHashSet imported instead of using java.util.* fully qualified in JenaFusekiStorage and RdfBatchProcessor (gitar-bot #2). - RdfIndexAppTest uses InOrder to assert clearAll → reloadOntologies ordering — a plain verify would have accepted a future change that reordered the calls (Copilot #9). - Documented the residual gap that HttpClient.connectTimeout only bounds TCP connect, not request bodies; circuit breaker + bounded pendingWrites contain the blast radius (Copilot #8). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(rdf): expect per-source clear on batches whose relationships are all filtered The two EventSubscription-skip tests used verifyNoInteractions on the RDF repository mock, which was valid before because filtered batches never touched RDF. The new per-source reconciliation clear in RdfBatchProcessor.processBatchRelationships now runs for every batch entity regardless of whether its relationships survive filtering — that's deliberate, since stale RDF state for those source entities still needs to be reconciled even when their current MySQL edges all point to excluded entity types. Switch the assertions to verify clearOutgoingEntityRelationships is the sole interaction (no bulkAdd, no addRelationship). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): address remaining PR review nits Three findings from the second gitar-bot review pass: - Replace the fully qualified `org.openmetadata.schema.configuration.GlossaryTermRelationSettings` / `SettingsType` / `SettingsCache` references in clearAllGlossaryTermRelations with imports, matching the project's existing convention. Other pre-existing FQN usages in the same file are left alone (not part of this PR's scope). - Make expandPredicateCurie throw IllegalArgumentException on null/empty input instead of silently defaulting to `om:relatedTo`. The current caller already null-guards so the path is unreachable today, but a future caller could otherwise silently miss-clean a misconfigured predicate. - Document why the lineage predicate URIs in the reconciliation DELETE filter (UPSTREAM / hasLineageDetails) are literal-hardcoded rather than baseUri-derived: they match what addLineageWithDetails actually writes (also hardcoded at RdfRepository.java:423,435). Switching the filter to be baseUri-derived would stop matching the stored lineage triples on non-default baseUri deployments and would incorrectly delete them. Comment added in both clearOutgoingEntityRelationships and bulkStoreRelationships so the next reader doesn't get nudged into "fixing" it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): surface cleanup failures, sync fallback predicates, time-bound reads Addresses the three unresolved Copilot findings from review 4295208187: - Drop the try/catch around clearAllGlossaryTermRelations in initializeJob. clearAllGlossaryTermRelations rethrows specifically so the indexer can fail loudly; wrapping it again let an unreconciled graph slip past as a "successful" run. The outer execute() handler will now mark the run FAILED. - Sync DEFAULT_GLOSSARY_TERM_RELATION_PREDICATES with what SettingsCache actually bootstraps (SettingsCache.java:355-486): adds skos:exactMatch (the real default for `synonym`), om:antonym, om:partOf, om:hasPart, rdfs:seeAlso. Keeps legacy om:* URIs from the stale getGlossaryTermRelationPredicateUri switch so a cleanup run still scrubs pre-SettingsCache data. - Apply READ_TIMEOUT_MS (10s) via QueryExecution.setTimeout on every read path (executeSparqlQuery for SELECT/CONSTRUCT/ASK/DESCRIBE, getEntity, getAllGraphs, getTripleCount, testConnection, the ontology presence check). A Fuseki that accepts the TCP connection but stalls mid-query no longer hangs reads indefinitely. UPDATE-side calls still rely on the connect timeout + circuit breaker + bounded pendingWrites since Jena's RDFConnection.update API doesn't expose a per-request timeout cleanly; comment near the constant notes the gap and a viable follow-up via UpdateExecHTTPBuilder.timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): qualify EntityRelationship in test to fix compile RdfIndexAppTest references EntityRelationship.class in two verify() calls that I added in the previous commit, but the class was never imported into the test file. CI's openmetadata-service test compile fails with "cannot find symbol class EntityRelationship", which cascades into 11 dependent checks (build x2, openmetadata-service-unit-tests, three Java integration test workflows, two Python integration test shards that build OM as a setup step, Test Report aggregate, maven-sonarcloud-ci, and the unit-test status gate). Use the fully qualified org.openmetadata.schema.type.EntityRelationship to match how every other reference in this file already spells it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): drop QueryExecution.setTimeout — removed in Jena 5 used by IT classpath GlossaryOntologyExportIT was failing on RdfUpdater.initialize with NoSuchMethodError: 'void org.apache.jena.query.QueryExecution.setTimeout(long, java.util.concurrent.TimeUnit)'. openmetadata-service builds against Jena 4.10 (apache-jena-libs), but openmetadata-integration-tests directly pulls in jena-core/jena-arq 5.0.0, and Jena 5 removed the setTimeout overloads from the QueryExecution interface. Compile passes, integration test JVM links the 5.x class and bombs at the first read path (loadOntology's ASK check). Strip the nine setTimeout calls and the READ_TIMEOUT_MS constant. A clean read-side timeout that works on both Jena 4 and 5 needs to be plumbed via QueryExecutionHTTPBuilder.timeout / UpdateExecHTTPBuilder.timeout instead of RDFConnection — bigger change than this PR should carry. The comment near CONNECT_TIMEOUT now records that history so the next reader knows why we don't simply re-add setTimeout. Protection against a stalled-but-accepting Fuseki still relies on the 5-failure circuit breaker + bounded pendingWrites gate in RdfUpdater. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): align ontology-loaded check, predicate URIs, and CURIE fallback Three real bugs flagged by Copilot's later review passes: - areOntologiesLoaded() looked for `"boolean" : true` (space before colon) but JenaFusekiStorage formats ASK results without that space, so the check never matched and reloadOntologies() always threw. recreateIndex=true (now the default) ran into this on the very first scheduled run. Normalise whitespace before checking. - bulkAddRelationships wrote `om:<lowercase relationshipType>` directly, while removeRelationship uses getRelationshipPredicate which maps a handful of types to prov:* (UPSTREAM → prov:wasDerivedFrom, USES → prov:used, etc.). Triples written by the indexer were therefore unreachable by the live remove hook. Pre-compute predicateUri via getRelationshipPredicate in bulkAddRelationships and pass it through a new field on RelationshipData so JenaFusekiStorage uses the same URI both paths agree on. The legacy RelationshipData(5-arg) ctor still works for callers that don't have a predicate handy; bulkStoreRelationships falls back to the old shape there. - expandPredicateCurie returned bare strings like `customRel` unchanged, but createPropertyFromUri's default branch writes `<baseUri>ontology/customRel`. Custom relation predicates expressed as local names would never match the cleanup FILTER. Mirror createPropertyFromUri: full URIs pass through, bare local names get the OM-ontology prefix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): schema default + migration force entities=[all] for safe full reindex - rdfIndexingAppConfig.json: flip recreateIndex.default from false to true so any UI form / config generation path that surfaces the schema default agrees with the install JSON files and the new full-rebuild semantics. - 2.0.1 migration (MySQL + Postgres): in addition to flipping recreateIndex=true and the weekly Saturday cron, also rewrite appConfiguration.entities to ["all"]. Pre-upgrade an operator could have narrowed RDF indexing to a subset of entity types; the new recreateIndex=true semantics issues CLEAR ALL before indexing, which would otherwise wipe triples for excluded entity types and leave the graph permanently missing them. Forcing entities back to ["all"] ensures the post-CLEAR-ALL run repopulates the graph fully. Operators can re-narrow after the migration if they need partial indexing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): scope storeEntity DELETE to translator-managed predicates Replace the literal-only FILTER(!isIRI(?o)) in JenaFusekiStorage.storeEntity with a predicate-scoped DELETE so translator-emitted URI triples (tags, glossary terms, owner, domain, tier, data products, structured sub-resources) are refreshed from the new model on every entity write, while hook-managed predicates (om:UPSTREAM, om:hasLineageDetails, om:owns / om:contains / ...) stay intact. Previously, with !isIRI(?o), every URI-valued triple survived storeEntity forever — when a tag was removed or an owner changed, the old URI coexisted with the new one because no hook ever cleans those up (tags live in tag_usage, not entity_relationship; owners' translator-side predicate om:hasOwner is not what the OWNS hook writes). The DELETE set is the union of: - RdfPropertyMapper.TRANSLATOR_MANAGED_DIRECT_PREDICATES, a static list of predicates that may shrink to empty between writes (so the current model walk wouldn't see them) — rdf:type, om:hasOwner, prov:wasAttributedTo, om:hasTag, om:hasGlossaryTerm, om:hasTier, om:belongsToDomain, om:hasDataProduct, dct:source, om:sourceUrl, plus the structured-resource attachment predicates (om:hasLifeCycle / hasCertification / hasExtension / hasCustomProperty). - the predicates the current model actually emits for the entity subject, covering JSON-LD context-driven predicates that aren't in the static list. Added two coverage tests on RdfPropertyMapperTest: the static set contains the documented core predicates, and never contains lineage-hook predicates (om:UPSTREAM, prov:wasDerivedFrom, om:hasLineageDetails) — that overlap would let storeEntity wipe lineage edges on every entity update. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): scope reconciliation DELETE to relationship-hook predicates only Both clearOutgoingEntityRelationships (in RdfRepository) and the per-source DELETE inside JenaFusekiStorage.bulkStoreRelationships used to clear ANY outgoing edge whose object was a baseUri/entity/ URI (with only the three lineage predicates excluded). That swept up translator-managed URI triples (om:hasTag, om:hasGlossaryTerm, om:hasOwner, om:belongsToDomain, …) which bulkAddRelationships does not re-emit, so reconciliation runs were permanently destroying tag/owner/domain links. Switch the filter to opt-in: only delete triples whose predicate is in RELATIONSHIP_HOOK_PREDICATES, derived from the Relationship enum via the existing getRelationshipPredicate mapping. The set excludes the lineage predicates by skipping the UPSTREAM enum value (managed by addLineageWithDetails). Translator-managed predicates aren't relationship types so they're naturally not in the set; the new RdfPredicatePartitionTest enforces the partition. Refactored getRelationshipPredicate into a static getRelationshipPredicateUri so it can be reused at class-init time to build the predicate set without an instance. Added a small buildPredicateInList helper exposed at package level for JenaFusekiStorage to reuse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): scope bulk reconciliation to batch entities, not all relationship sources bulkStoreRelationships used to compute its per-source DELETE set from the relationships list, so any source URI mentioned by any row in the batch was reconciled. RdfBatchProcessor passes BOTH outgoing relationships (sources inside the batch) and incoming UPSTREAM lineage (sources outside the batch where this batch's entity is the target). The outside-batch sources had their OTHER outgoing edges wiped, even though the indexer never planned to re-index them. Add a 2-arg overload to RdfStorageInterface.bulkStoreRelationships that takes an explicit Set<String> sourcesToReconcile. The default 1-arg method keeps the legacy "derive from relationships" behavior for any plugin caller that hasn't migrated. RdfRepository.bulkAddRelationships gains a matching overload taking Set<EntitySourceRef>; RdfBatchProcessor passes its batchSources (the entities IT is actually indexing in this pass). JenaFusekiStorage.bulkStoreRelationships now iterates sourcesToReconcile for the per-source DELETE instead of computing distinctSources from relationships. The new buildEntityUri helper on the interface lets callers (or the default delegate) build consistent source URIs. QLeverStorage stubs the new overload (still UnsupportedOperationException). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): time-bound HTTP request bodies via CompletableFuture wrapper Wrap every blocking RDFConnection call in the hot read/write paths (storeEntity DELETE+LOAD, storeRelationship, bulkStoreRelationships, getEntity, deleteEntity, executeSparqlQuery, executeSparqlUpdate) with a CompletableFuture-based 10s request timeout. When Fuseki accepts the TCP connection and then stalls on the response, the caller thread now frees after 10s instead of waiting until the OS gives up on the socket (~60s). We chose CompletableFuture over Jena's QueryExecution.setTimeout because that overload was removed in Jena 5 (broke integration tests already once in this PR), and over Jena's QueryExecutionHTTPBuilder / UpdateExecHTTPBuilder because their API surface differs between Jena 4 and Jena 5 and our two classpaths use different versions. The CompletableFuture wrapper is Jena- API-agnostic. On timeout the underlying HTTP request still leaks its (virtual) thread until OS-level TCP give-up; that's bounded by the existing circuit breaker (after 5 timeouts the breaker opens for 30s, short-circuiting subsequent traffic). Lower-traffic paths (loadTurtleFile, clearGraph, getAllGraphs, getTripleCount, loadOntology, testConnection) keep using the direct connection.update / connection.query / connection.load calls — they're protected by the circuit breaker and the connect timeout, and adding wrappers there is churn without proportional benefit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(rdf): document RdfUpdater async-ordering trade-off in submitAsync Add a comment block in RdfUpdater.submitAsync explaining why we accept the loss of per-entity ordering when submitting through AsyncService: - EntityUpdater diff-applies changes per request, so add-then-remove of the same edge within one API call nets to no-op (no hooks fire). - Cross-request races reconcile at the next weekly recreate-index, which rebuilds from MySQL. - The alternative (per-entity striped lock) costs memory and adds latency for the no-contention common case. - Pointers for the future maintainer if an observed-in-production race emerges: gate via ConcurrentHashMap<UUID, Semaphore>. No behavior change. The two open Copilot threads on this trade-off (M6CQYup, M6CYbM2) stay open so a future PR can pick them up if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): atomic clear+insert, broader fallback predicate set, close temp models Three follow-up findings from the latest Copilot pass: - Atomicity (3249716506): clearOutgoingEntityRelationships + bulkAddRelationships ran as two separate SPARQL updates. If bulkAddRelationships failed after the clear succeeded, the batch entities had their relationships wiped without the replacement edges in place — they stayed gone until the next weekly recreate-index. Combine the per-source DELETE and the INSERT DATA into a single SPARQL update inside JenaFusekiStorage.bulkStoreRelationships and drop the now-redundant separate clear call from RdfBatchProcessor. Either the whole reconciliation commits or none of it does. Also let bulkStoreRelationships handle the zero-edge case (relationships empty, sourcesToReconcile non-empty) so RdfBatchProcessor doesn't need a separate clear for entities whose relationships were all removed in MySQL. - Fallback predicate set (3249716532): when SettingsCache returns null, getGlossaryTermRelationPredicate falls back to literal `https://open-metadata.org/ontology/<relationType>` — so `broader` / `narrower` / `exactMatch` get written as om:broader/om:narrower/om:exactMatch, not skos:* equivalents. Without those URIs in DEFAULT_GLOSSARY_TERM_RELATION_ PREDICATES, a cleanup run during a transient settings-cache outage would miss them. Added the three om:* fallback variants alongside the existing skos:*/rdfs:* bootstrap defaults. - Temp Model leaks (3249319886): bulkAddRelationships and removeRelationship each create an ephemeral Jena Model just to mint property URIs. Wrapped both in try/finally close() so the in-memory graphs are released right after use. Jena 4's Model has a close() method but doesn't implement java.lang.AutoCloseable so try-with-resources isn't possible there. Copilot's "still deleting only non-IRI" finding (3249716480) is a stale- snapshot false positive — JenaFusekiStorage.storeEntity has used predicate- scoped DELETE via TRANSLATOR_MANAGED_DIRECT_PREDICATES since 22d5825c. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): make buildPredicateInList public so JenaFusekiStorage can use it JenaFusekiStorage (org.openmetadata.service.rdf.storage) lives in a different package than RdfRepository (org.openmetadata.service.rdf), so the package-private buildPredicateInList helper introduced in 857c09 couldn't be called from JenaFusekiStorage.bulkStoreRelationships — CI was failing with: [ERROR] JenaFusekiStorage.java:[606,51] buildPredicateInList(Set<String>) is not public in RdfRepository; cannot be accessed from outside package Promote it to public alongside RELATIONSHIP_HOOK_PREDICATES (which is the only data this helper renders) so the cross-package call resolves. Local javac across the touched RDF files now reports zero new errors; the only remaining build failures are the pre-existing es.co.elastic.clients shading issues unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): normalise sourcesToReconcile to empty-set to prevent NPE in iteration bulkStoreRelationships' early-return guard accepts sourcesToReconcile == null as a valid input, but the subsequent per-source DELETE loop iterates sourcesToReconcile directly — so a caller passing null with a non-empty relationships list would skip the guard and crash at the for-loop. Today no caller hits this path (RdfRepository.bulkAddRelationships always passes non-null, and the 1-arg default interface method derives a set), but the null-check in the guard explicitly encodes null as supported, so the contract should match the iteration. Normalise once after the guard: Set<String> effectiveSources = sourcesToReconcile != null ? sourcesToReconcile : Set.of(); and use effectiveSources for both the loop and the success-log size. Local filtered compile passes cleanly (zero NEW errors from RDF files; remaining errors are the pre-existing es.co.elastic.clients shading mess). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(rdf): update RdfIndexAppTest verifications for the new bulkAddRelationships 2-arg signature Three test failures after the Fix-I / atomic-clear-insert changes: - testProcessBatchRelationshipsStoresResults verified `bulkAddRelationships(captor.capture())` (1-arg) but RdfBatchProcessor now calls the 2-arg `bulkAddRelationships(relationships, batchSources)` — Mockito surfaced this as "different arguments" because the actual call had a Set<EntitySourceRef> tail. Updated the verify to `bulkAddRelationships(captor.capture(), anySet())`. - The two event-subscription skip tests previously verified `clearOutgoingEntityRelationships(anySet())` as the only interaction; that method is no longer called from RdfBatchProcessor (the clear was folded into bulkAddRelationships' atomic SPARQL transaction for safety). Replace with `verify(mockRdfRepository).bulkAddRelationships(eq(List.of()), anySet())` — bulkAdd is still invoked with an empty list to drive the per-source reconciliation for the batch entity, even when the only fetched relationship pointed at an excluded entity type. Filtered local compile + test-compile passes cleanly (no NEW errors from RDF files; only pre-existing es.co.elastic.clients shading errors remain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): four follow-up findings from Copilot review 4299978111 - collectTranslatorPredicates over-broad (3249798300): RdfRepository.addRelationship passes storeEntity a model loaded from Fuseki PLUS the new relationship, so the dynamic walk was pulling hook-managed predicates (om:owns, etc.) into the DELETE scope. With async writes, two concurrent additions for the same source could each read the old model and each storeEntity wipe the other's relationship. Exclude RELATIONSHIP_HOOK_PREDICATES from the walk result (and defensively from the static-set union too). - ForkJoinPool.commonPool starvation (3249798327): runWithTimeout used CompletableFuture.supplyAsync's default executor, so a Fuseki that stalls would leak workers on the JVM-wide commonPool and starve unrelated CompletableFuture / parallel-stream work. Introduce a dedicated virtual-thread executor (Thread.ofVirtual().name("rdf-storage-timeout-", 0)) and route all timeout wrappers through it — virtual threads are cheap to leak and the circuit breaker bounds the pile-up. - Shrink-to-empty for literal predicates (3249798383): the predicate-scoped DELETE no longer caught stale literals when a literal-valued field (description / displayName / …) was cleared and the new model simply omitted the triple. Chain a "DELETE … FILTER(!isIRI(?o))" pass with the URI-scoped pass so hook-managed URI triples stay intact while stale literals get swept on every write. - UI schema default (3249798439): the UI form schema at utils/ApplicationSchemas/RdfIndexApp.json still declared recreateIndex.default = false. Flipped to true to match the backend openmetadata-spec schema and the install JSON files. (The sibling jsons/applicationSchemas/ is gitignored generated output, no source change needed there.) Local verification before push: spotless:apply, filtered compile + test-compile (zero new errors), and `mvn test -Dtest='RdfIndexAppTest,RdfPropertyMapperTest, RdfPredicatePartitionTest,RdfStorageIdempotencyTest'` — 64 tests, 0 failures. The "buildPredicateInList package-private" finding from the same review (3249798351) is already addressed in 03c5d4f6dc and surfaces here only because Copilot reviewed an earlier commit. The "lineage incremental cleanup" finding (3249798415) is a known architectural trade-off: addLineageWithDetails handles current lineage rows but removed edges have no row to trigger a per-edge delete, and adding UPSTREAM/wasDerivedFrom to RELATIONSHIP_HOOK_PREDICATES would conflict with the inline addLineageWithDetails call that runs BEFORE bulkAddRelationships in RdfBatchProcessor. The weekly recreateIndex=true run (the new default) wipes and rebuilds from MySQL, which reconciles stale lineage; left this thread open as a documented gap rather than reordering processBatchRelationships in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 00:36:06 +00:00
-- RdfIndexApp: switch to weekly Saturday cron and full-rebuild every run.
-- Previous defaults (daily, incremental) were producing unbounded triple growth
-- because relationship-removal paths weren't fully reconciled. With per-run
-- CLEAR ALL the dataset always converges to MySQL state; weekly cadence keeps
-- per-run cost from saturating Fuseki.
--
-- Also rewrite `entities` to `["all"]`. Pre-upgrade, an operator could have
-- narrowed RDF indexing to a subset of entity types; the new recreateIndex=true
-- semantics issues a CLEAR ALL before indexing, which would otherwise wipe
-- triples for entity types still in MySQL but missing from the subset list.
-- Forcing the subset list back to `["all"]` ensures the post-CLEAR-ALL run
-- repopulates the graph fully; operators can re-narrow after the migration if
-- they need partial indexing.
UPDATE installed_apps
SET json = jsonb_set(
jsonb_set(
jsonb_set(json::jsonb, '{appConfiguration,recreateIndex}', 'true'),
'{appSchedule,cronExpression}',
'"0 0 * * 6"'
),
'{appConfiguration,entities}',
'["all"]'::jsonb
)
WHERE name = 'RdfIndexApp';
UPDATE apps_marketplace
SET json = jsonb_set(json::jsonb, '{appConfiguration,recreateIndex}', 'true')
WHERE name = 'RdfIndexApp';