Elgato_dark/OpenMetadata: OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

Find a file

Sriharsha Chintalapani d3bbbefe37 fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 ) * fix(rdf): dedupe lineage edges and broaden PROV-O coverage The RDF Knowledge Graph endpoint was emitting two edges per lineage relationship — once as `om:UPSTREAM` (forward) and once as `prov:wasDerivedFrom` (reverse) — because the parser preserved each predicate's native subject/object orientation instead of canonicalizing both into a single `(upstream, downstream)` edge. Also extend PROV-O coverage so external SPARQL clients can use the W3C Provenance vocabulary directly: - `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on datasets / pipelines / users - `prov:wasAttributedTo` mirror of `om:owners` - `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used` on lineageDetails so the Entity → Activity → Entity chain is complete - `prov:hadPlan` + `prov:Plan` for SQL transformation recipes - `prov:startedAtTime` / `prov:endedAtTime` on Activity instances - `prov:wasAssociatedWith` Activity → Agent linking - `prov:invalidatedAtTime` on soft-deleted entities Other RDF cleanups in the same area: - LineageDetails URIs are now deterministic (driven by from/to ids instead of a timestamp), so re-indexing collapses duplicate Activity resources via the existing DELETE+INSERT idempotency - Skip emitting the redundant `om:owners` JSON-string literal — the mapped path already produces clean `om:hasOwner <agent>` triples - Skip empty `[]` array literals in the unmapped path - Propagate failures from `RdfRepository.{addRelationship, addLineageWithDetails, bulkAddRelationships, bulkAddGlossaryTermRelations}` instead of silently swallowing them, so downstream callers can surface the failure Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): surface Fuseki failures in app run record Per-entity and per-batch failures from the RDF index app used to be logged via SLF4J only — they never made it into the AppRunRecord, so the UI/run history showed "completed" even when every entity had silently failed to write to Fuseki. - `RdfBatchProcessor.processEntities` now captures the last error per entity, returns it in `BatchProcessingResult.lastError`, and accumulates relationship-processing failures into the same result. - Relationship and lineage processing methods (`processBatchRelationships`, `processLineageRelationship`, `processGlossaryTermRelations`) return structured results with failure counts and last-error messages instead of `void`, so failures are visible to the partition worker. - `RdfIndexApp` records the failure on `jobData` for both the distributed and non-distributed code paths, so users see a real error message in the run history (e.g. "Failed to write entity X to Fuseki: ConnectException"). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * perf(rdf-index-app): port distributed-mode improvements from SearchIndex The RDF distributed-indexing fork was lagging behind several SearchIndex improvements that addressed concrete reliability and throughput issues. Port them across: Core perf / reliability - Precomputed partition start cursors: coordinator walks each entity once via keyset pagination at job init and caches the boundary cursor per (jobId, entityType, rangeStart). Workers consult the cache before falling back to the OFFSET-based path. Eliminates the previous O(N²) per-partition cursor lookup. - `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion` on the coordinator. Stop now cancels both PENDING and PROCESSING partitions in a single SQL update and immediately drives the job status from STOPPING → STOPPED, so the UI status no longer hangs while workers drain. - Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset` uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of `List.of("")`, avoiding expensive fetchers (e.g. fetchAndSetOwns) per batch. - Partition heartbeat thread: virtual thread refreshes `lastUpdateAt` every 30s for partitions actively being processed by this server, so the stale reclaimer no longer interrupts active work. - `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path rejects when the server already holds 5 PROCESSING partitions, giving fair distribution across pods. Verified the existing claim DAO uses `FOR UPDATE SKIP LOCKED` for both MySQL and Postgres. - Gate WebSocket stat broadcasts during the STOPPING phase so the Quartz-scheduler-driven STOPPED status push isn't overwritten. Multi-server scaffolding (single-pod is unaffected) - `RdfPollingJobNotifier`: DB-polling discovery for other server pods to find an in-flight RDF reindex they can join. - `RdfEntityCompletionTracker`: per-entity-type partition tracking with callback firing once all partitions for an entity complete, foundation for early per-entity index promotion. Tests: precomputed-cursor cache lookup, in-flight backpressure, cancelInFlight delegation, completion tracker callback semantics, notifier start/stop. DAO additions on `rdf_index_partition`: - `cancelInFlightPartitions(jobId, now)` — covers both PENDING and PROCESSING in one statement - `countInFlightPartitionsForServer(jobId, serverId)` — backpressure - `countPartitionsByStatus(jobId, status)` — used by completion check Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row When an app has no run history, AppRunsHistory fabricated a synthetic placeholder row that looked like a real run — `runType: "CurrentConfig"`, a fake `Run At` timestamp pulled from `appData.updatedAt`, an ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button that targeted nothing. Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows, and hide the `Stop` button so users no longer see "Run now → 19-minute Running with Stop button" when the actual job never registered. Real app runs are unaffected — they still display `runType` from the backend (OnDemandJob, Hourly, Daily, Custom, etc.). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address PR review findings Four issues raised in PR #27999 review: - Cursor format consistency in walkAndRecord (bug): The defensive branch produced cursors via a custom `{name, id}` map while the regular path used `repo.getCursorValue()`. For entities with quoted names these encodings diverge — a quoted-name entity could land in the cache with a cursor incompatible with what the worker fetches via keyset pagination. Track the last seen entity reference and run it through `repo.getCursorValue()` in both paths. `encodeBoundaryCursor` is removed. - Adaptive scheduling in RdfPollingJobNotifier (perf): The previous implementation woke the scheduler thread every 1s and short-circuited inside the poll method when idle. Reschedule the task at the appropriate interval (1s active / 30s idle) when `setParticipating` flips, so the thread genuinely sleeps when idle. - Cursor cache cleanup on startup recovery (edge case): `partitionStartCursors` was only evicted by `refreshAggregatedJob` / `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job and never reached either, the cache entry leaked until process restart. Add `evictStaleCursorCacheEntries()` invoked by `performStartupRecovery` that drops entries for jobs that no longer exist in the DB or are already terminal. - Consolidate describeError helpers (quality): `describeError`, `describeBulkError`, and `describeLineageError` in `RdfBatchProcessor` all walked the cause chain and formatted a prefixed message with the same logic. Reduced to a single `describeError(prefix, error)` plus a thin `describeEntityError` adapter for the per-entity call site. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop() stop() called workerExecutor.shutdownNow() inline AND through cleanupLocalExecution -> shutdownWorkerExecutor, which broke the DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the inline call — cleanupLocalExecution is the single owner of the shutdown path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: drop redundant DB matrix from openmetadata-service unit tests The {mysql, postgresql} strategy matrix on openmetadata-service unit tests doubled CI cost without adding signal: both jobs ran the same surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single `test.database` property, and that property is consumed exclusively by the failsafe plugin (integration tests `IT.java` / `IntegrationTest.java`), which only runs under `-Pintegration-tests` — not enabled here. `openmetadata-service` itself has zero tests that read `test.database` or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The only testcontainer-based DB code in the repo lives in `openmetadata-integration-tests`, a different module that this workflow doesn't build. Run the unit suite once. The `openmetadata-service-unit-tests-status` required-check aggregator is unaffected (it depends on the renamed job which still has the same name). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): address Copilot PR review findings Six correctness issues raised on PR #27999: - Lineage-details DELETE was too broad (RdfRepository): the cleanup step deleted all `<fromUri> om:hasLineageDetails ?d` triples, so reindexing one (fromId, toId) edge wiped lineage-details links for every other downstream of the same source entity. Pin the delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>` triple. Same with prov:generated cleanup — anchor it to the specific detailsUri instead of any details resource. - Predicate not flipped during canonicalization (RdfRepository): `parseEntityGraphEdgesFromResults` swapped subject/object for reverse-direction predicates (`prov:wasDerivedFrom`, `prov:wasInfluencedBy`) but kept the original predicate URI on the resulting EdgeInfo. Exported graphs could carry semantically invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`. Add `forwardEquivalentPredicate` to substitute the OM-native forward predicate when the direction flips. - `dct:modified` was an invalid xsd:dateTime (RdfPropertyMapper): `entity.getUpdatedAt().toString()` returns the epoch-millis Long as a string, but the literal was tagged `xsd:dateTime`. Convert via `Instant.ofEpochMilli(...).toString()` so the lexical form matches the type — same fix already in place for prov:invalidatedAtTime. - Unmapped EntityReference arrays were dropped entirely (RdfPropertyMapper): the previous fix to skip noisy JSON-string literals also dropped fields like `domains`, `reviewers`, `voters` for entity contexts that don't have a JSON-LD mapping for them — the unmapped path was the only path emitting them, so nothing landed in RDF. Expand each array element through `addEntityReference` so the data still produces proper `om:<fieldName> <ref>` triples; mapped-path duplicates are collapsed by Jena's Model dedupe. - Partition failure detection missed reader errors (DistributedRdfIndexExecutor): the EntityCompletionTracker was fed `result.errorMessage() != null`, but `RdfPartitionWorker` can increment `failedCount` from `readerErrors` without ever setting `lastError`. Use `result.failedCount() > 0` so partitions whose failures came from `ResultList.getErrors()` are also marked as failed when promoting an entity. - `COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0 (RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS whenever any partition is FAILED or CANCELLED, including for user-initiated stops where no record-level failures accrued. The monitor's `completedWithErrors` gate required `failedRecords > 0`, so those terminal states never hit `jobData.setFailure(...)` and the run record showed success. Drop the failedRecords precondition and tailor the fallback message based on whether there are record-level failures or partition-level only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): separate relationship failures + type lineage as prov:Activity Two more PR review findings on #27999: - Relationship failures inflated failedRecords stat: `processEntities` was folding relationship/lineage edge failures into `failedCount`, which becomes `failedRecords` in the index stats. Records there mean entities, computed from entity counts in `totalRecords`. Counting per-edge relationship failures could push `failedRecords` above `processedRecords`/`totalRecords` and produce nonsensical per-entity stats. Track them separately: add `relationshipFailureCount` to `BatchProcessingResult` and `PartitionResult`. `failedCount` now stays entity-level. The completion tracker is fed the broader `result.hasAnyFailure()` so partitions where relationship triples failed don't get prematurely promoted as success even though their entity writes succeeded. - `detailsResource` wasn't typed as prov:Activity: the resource carries Activity-shaped predicates (prov:startedAtTime, prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy, prov:wasAssociatedWith) but only the OM-specific `om:LineageDetails` rdf:type. Add an explicit `rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL clients recognize it as an Activity without having to learn the OM type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): label lineage edges relative to focal node The Knowledge Graph view was labeling every edge with relation type "upstream" as "Upstream" regardless of direction relative to the focal node. For a focal node F, the raw stored relation `(F, X, upstream)` means "F is upstream of X" — i.e. X is downstream of F. The previous output labeled both `F → X` and `X → F` edges as "Upstream", which made bidirectional lineage look like a duplicated relation. Re-orient the label in `convertEdgesToGraphData` based on whether the focal is the edge's source or target: - focal → X → "Downstream" - X → focal → "Upstream" - non-focal-touching edges keep the raw relation label. Reported on a sample-data table with a circular lineage cycle (`dim_customer ↔ fact_orders`) where both directions showed "Upstream". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close remaining Copilot review gaps Three findings from PR #27999's third review pass — all about failure signals being silently dropped between layers: - `RdfIndexApp.processTask` ignored relationship failures: only `result.failedCount() > 0` was treated as a failure, so partitions whose Fuseki relationship/lineage writes failed (incrementing `relationshipFailureCount` but not `failedCount`) never wrote `jobData.failure`. Switch to `result.hasAnyFailure()` and report the combined count. - `checkAndUpdateJobCompletion` ignored partition `lastError`: a partition can finish COMPLETED with `lastError` set when a relationship bulk write was caught and recorded but didn't bump `failedRecords` or flip the partition to FAILED. The job would then go to COMPLETED even though there were real failures. Treat the presence of any `rdf_index_partition.lastError` as an error signal — promote to COMPLETED_WITH_ERRORS and aggregate sample errors into the job's errorMessage if it was blank. - `forwardEquivalentPredicate` mapped to a non-existent `om:DOWNSTREAM` URI: OpenMetadata only stores lineage with `om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O pair); there is no `om:DOWNSTREAM` predicate written anywhere — the downstream view is derived by reading the same UPSTREAM edge from the other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy` to `om:UPSTREAM` (both are reverse-direction causation predicates: in `B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and effect is B, so the canonical forward predicate is the same). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix RDF tag mapper * Fix all the comments Cherry-picked from #27562 (without bin/ autogenerated noise). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Align RdfPropertyMapper tests with refactor and isolate ontology export IT RdfPropertyMapperTest still referenced the removed addVotes helper and expected addStructuredProperty to dispatch votes — both gone after votes was added to IGNORED_PROPERTIES. Update the assertions accordingly. GlossaryOntologyExportIT timed out on the full suite because it flips a global RDF singleton in @BeforeAll and each test blocks a server thread on synchronous Fuseki writes. SAME_THREAD only serialized methods within the class — concurrent classes still raced for server threads. Adding @Isolated matches the pattern already used by RdfResourceIT for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(rdf): align addCertification typing + relationType after predicate flip Two findings on PR #27999 from the post-cherry-pick review pass: - `addCertification` mis-typed glossary-source certifications and skipped skos:Concept: it always emitted `om:Tag` regardless of source, even though `resolveTagResource` returns a glossaryTerm URI when the certification points at a glossary term. It also didn't add `skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for classification tags), so SPARQL queries filtering certification targets by `a skos:Concept` missed them while `addTagLabel`-emitted tags were findable. Mirror `addTagLabel`: branch on source (`Glossary` vs `Classification`), emit the right primary type plus `skos:Concept` (glossary) or `om:Tag` (classification), and include `om:tagSource`. - `relationType` left stale after predicate flip: when `parseEntityGraphEdgesFromResults` flipped subject/object for a reverse-direction predicate and rewrote `canonicalPredicate` to `om:UPSTREAM`, it kept the original `relationType` derived from the reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo with `relationType=downstream` + `predicate=om:UPSTREAM` — internally inconsistent, and the mismatched `edgeKey` prevented dedup against an existing UPSTREAM edge with the same endpoints. Re-derive `relationType` from the canonical predicate after the flip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings + add parser-helper unit tests Two outstanding Copilot findings on PR #27999 plus targeted unit coverage for the helpers that drive lineage canonicalization. Findings: - `colLineageUri` collision risk (RdfRepository): the deterministic key replaced non-alphanumerics in `toColumn` with `_`, so distinct column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which would lose / overwrite column-lineage resources during reindex. Append the loop index as a tiebreaker so distinct columns keep distinct URIs. - `createTypeResource` missing dprod prefix (RdfPropertyMapper): the `getNamespace` switch didn't recognize `dprod`, so `RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`) produced an invalid `dprod:DataProduct` URI on the wire. Added the `DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod` case in the switch. Coverage: - New `RdfParserHelpersTest` exercises the canonicalization helpers via reflection: `isReverseDirectionPredicate` (recognizes PROV-O causation predicates, ignores forward predicates), `forwardEquivalentPredicate` (both `wasDerivedFrom` and `wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works), `relativeRelationLabel` (focal-relative Upstream/Downstream flipping with all the boundary cases — non-focal edges, non-lineage relations, null focal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): merge array contexts before per-field resolution The third (low-confidence "suppressed") finding on review 4256830399 turned out to be a real duplication: when a field is mapped in one context map of an array context but absent from another, the previous processArrayContext ran processContextMappings once per map. The pass where the field IS mapped emits the proper `om:hasOwner <ref>` triples (plus `prov:wasAttributedTo`); the pass where the field is absent falls through to processUnmappedField and emits an additional `om:owners <ref>` triple. Net: two predicates for the same logical relationship. Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112 `om:owners` triples — one set per pass. Fix: flatten all context maps in the array into a single merged map once, then iterate entity fields exactly once against that combined view (later contexts win on key conflicts, matching JSON-LD context merge semantics). Each field is resolved against the union of mappings, so the unmapped fallback only fires for fields truly absent from every context. Net effect: `prov:wasAttributedTo` count is unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners` triples disappear. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(rdf): close 2 review findings on coordinator finalization race Two findings from PR #27999 review 4259628860: - `checkAndUpdateJobCompletion` early-returned before lastError check could promote: `refreshAggregatedJob` already marks the job COMPLETED when partitions all finish without `failedRecords`/`failedPartitions`, so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())` short-circuit silently dropped the lastError signal. Move the partition-lastError check INTO `refreshAggregatedJob` so both code paths produce consistent terminal status — a partition that finished COMPLETED but carries a non-null lastError now correctly promotes the job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the race. - `completePartition` / `failPartition` overwrote CANCELLED state: the unconditional partition row update lost a concurrent Stop's CANCELLED status if a worker finished its batch after the Stop request landed but before noticing it. Add a status-guarded `updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND status = 'PROCESSING') and have both completion paths use it; if 0 rows update, log and skip the side effects (no server-stat increment, no refreshAggregatedJob call) so the authoritative CANCELLED status stays. Mirrors the pattern SearchIndex's coordinator uses for the same race. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>		2026-05-11 06:14:50 -07:00
.agents/skills	Context center (#27558 )	2026-05-08 10:56:04 -07:00
.claude	Context center (#27558 )	2026-05-08 10:56:04 -07:00
.devcontainer	MINOR - DevContainer Setup for contribution (#26623 )	2026-03-20 08:27:30 +01:00
.github	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 )	2026-05-11 06:14:50 -07:00
bin	Set Indexing related executor threads priority to LOW (#27153 )	2026-04-15 11:28:47 -07:00
bootstrap	Context center (#27558 )	2026-05-08 10:56:04 -07:00
common	fix(security): pin libthrift, provided jsonschema2pojo, bump azure-kv/sjm/reactor-netty, exclude netty-epoll (#28010 )	2026-05-11 14:08:26 +05:30
conf	feat(search): add Google Gemini embedding provider (#27974 )	2026-05-10 16:37:53 +02:00
docker	Perf/redis cache metrics and indexes (#27499 )	2026-04-23 12:18:53 +02:00
docs	feat: Add auto-classification support for storage service containers (#26495 )	2026-04-24 06:29:16 -07:00
examples/python-sdk/data-quality	Create documentation resources for Data Quality as Code (closes #23800 ) (#24169 )	2025-11-11 10:25:42 +00:00
ingestion	Fixes #27950 : [Datalake] JSON columns incorrectly typed as STRING for empty dict values (#27951 )	2026-05-11 18:02:06 +05:30
openmetadata-airflow-apis	chore(ingestion): drop pylint, expand ruff (#27774 )	2026-04-28 07:21:59 +02:00
openmetadata-clients	fix(security): upgrade Java dependencies to resolve CRITICAL and HIGH CVEs (#27940 )	2026-05-07 09:19:10 +00:00
openmetadata-dist	Deprecate OpenMetadata Java client in favor of new Java SDK (#26388 )	2026-03-10 21:30:39 -07:00
openmetadata-integration-tests	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 )	2026-05-11 06:14:50 -07:00
openmetadata-k8s-operator	Fixes #27852 : propagate tolerations from CronOMJob to scheduled OMJob (#27955 )	2026-05-07 14:38:49 +02:00
openmetadata-mcp	chore(mcp): add server.json for MCP Registry publishing (#27982 )	2026-05-08 10:14:31 +02:00
openmetadata-sdk	Context center (#27558 )	2026-05-08 10:56:04 -07:00
openmetadata-service	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 )	2026-05-11 06:14:50 -07:00
openmetadata-shaded-deps	fix(security): upgrade Java dependencies to resolve CRITICAL and HIGH CVEs (#27940 )	2026-05-07 09:19:10 +00:00
openmetadata-spec	feat(ingestion): add QuestDB database connector (#27604 )	2026-05-11 13:02:32 +05:30
openmetadata-ui	fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999 )	2026-05-11 06:14:50 -07:00
openmetadata-ui-core-components	Fix fast-uri Dependabot vulnerabilities in UI core components (#28020 )	2026-05-11 08:30:58 +00:00
openspec	Task redesign (#25894 )	2026-04-23 15:52:30 +02:00
scripts	Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876 )	2026-05-04 13:22:15 -07:00
skills	feat(ingestion): add QuestDB database connector (#27604 )	2026-05-11 13:02:32 +05:30
.dockerignore	RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex (#26902 )	2026-04-14 13:24:41 -07:00
.git-blame-ignore-revs	Minor: update git-blmae-ignore-revs, and uncomment ClassificationResourceTest tests code (#14431 )	2023-12-18 19:16:29 -08:00
.gitignore	chore(ingestion): enable basedpyright across the codebase via baseline (#27755 )	2026-04-27 17:15:44 +02:00
.nojekyll	shahsank3t published a site update	2021-08-04 06:23:29 +00:00
.pre-commit-config.yaml	chore(ingestion): migrate to ruff for format + isort + unused-import (#27739 )	2026-04-27 10:05:28 +02:00
.snyk	Ignore _openmetadata_testutils from snyk (#21168 )	2025-05-13 18:01:05 +05:30
adr-incident-manager-governance-workflows.md	Task redesign (#25894 )	2026-04-23 15:52:30 +02:00
AGENTS.md	Context center (#27558 )	2026-05-08 10:56:04 -07:00
APPLICATION.md	Rename app 'preview' property to 'enabled' (#26170 )	2026-03-05 08:29:54 +01:00
CLAUDE.md	Context center (#27558 )	2026-05-08 10:56:04 -07:00
CODE_OF_CONDUCT.md	Fix #412 - Add code of conduct for OpenMetadata community	2021-09-06 18:57:17 -07:00
CONTRIBUTING.md	addded more detail on issue creation in contributors page (#16583 )	2024-06-09 14:02:36 -07:00
DEVELOPER.md	chore(ingestion): drop pylint, expand ruff (#27774 )	2026-04-28 07:21:59 +02:00
generate_ts.sh	Feature: Generate TS From JSON (#19823 )	2025-02-25 18:18:02 +05:30
INCIDENT_RESPONSE.md	Add threat model and incident response (#23603 )	2025-09-28 13:17:23 -07:00
LICENSE	OpenMetadata snapshot release 0.3	2021-08-01 14:27:44 -07:00
Makefile	security: Include branch name in security scan Slack alerts and fail only on high vulnerabilities (#27977 )	2026-05-11 10:41:48 +05:30
NOTICE	OpenMetadata snapshot release 0.3	2021-08-01 14:27:44 -07:00
package.json	fix: Resolve frontend security vulnerabilities in lodash and lodash-es (#27105 )	2026-04-07 07:55:25 +00:00
pom.xml	Fixes #22916 : Add chart-level lineage for Metabase connector (#26778 )	2026-05-11 16:40:49 +05:30
README.md	Update README.md for column-level consistency (#24670 )	2025-12-03 07:59:18 -08:00
SECURITY.md	Update vulnerability reporting instructions in SECURITY.md (#25651 )	2026-01-30 14:03:09 -08:00
THREAT_MODEL.md	Add threat model and incident response (#23603 )	2025-09-28 13:17:23 -07:00
yarn.lock	fix: Resolve frontend security vulnerabilities in lodash and lodash-es (#27105 )	2026-04-07 07:55:25 +00:00

README.md

Empower your Data Journey with OpenMetadata

What is OpenMetadata?

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column-level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.

Contents:

Features
Try our Sandbox
Install & Run
Roadmap
Documentation and Support
Contributors

OpenMetadata Consists of Four Main Components:

Metadata Schemas: These are the core definitions and vocabulary for metadata based on common abstractions and types. They also allow for custom extensions and properties to suit different use cases and domains.
Metadata Store: This is the central repository for storing and managing the metadata graph, which connects data assets, users, and tool-generated metadata in a unified way.
Metadata APIs: These are the interfaces for producing and consuming metadata, built on top of the metadata schemas. They enable seamless integration of user interfaces and tools, systems, and services with the metadata store.
Ingestion Framework: This is a pluggable framework for ingesting metadata from various sources and tools to the metadata store. It supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.

Key Features of OpenMetadata

Data Discovery: Find and explore all your data assets in a single place using various strategies, such as keyword search, data associations, and advanced queries. You can search across tables, topics, dashboards, pipelines, and services.

Data Collaboration: Communicate, converse, and cooperate with other users and teams on data assets. You can get event notifications, send alerts, add announcements, create tasks, and use conversation threads.

Data Quality and Profiler: Measure and monitor the quality with no-code to build trust in your data. You can define and run data quality tests, group them into test suites, and view the results in an interactive dashboard. With powerful collaboration, make data quality a shared responsibility in your organization.

Data Governance: Enforce data policies and standards across your organization. You can define data domains and data products, assign owners and stakeholders, and classify data assets using tags and terms. Use powerful automation features to auto-classify your data.

Data Insights and KPIs: Use reports and platform analytics to understand how your organization's data is doing. Data Insights provides a single-pane view of all the key metrics to reflect the state of your data best. Define the Key Performance Indicators (KPIs) and set goals within OpenMetadata to work towards better documentation, ownership, and tiering. Alerts can be set against the KPIs to be received on a specified schedule.

Data Lineage: Track and visualize the origin and transformation of your data assets end-to-end. You can view column-level lineage, filter queries, and edit lineage manually using a no-code editor.

Data Documentation: Document your data assets and metadata entities using rich text, images, and links. You can also add comments and annotations and generate data dictionaries and data catalogs.

Data Observability: Monitor the health and performance of your data assets and pipelines. You can view metrics such as data freshness, data volume, data quality, and data latency. You can also set up alerts and notifications for any anomalies or failures.

Data Security: Secure your data and metadata using various authentication and authorization mechanisms. You can integrate with different identity providers for single sign-on and define roles and policies for access control.

Webhooks: Integrate with external applications and services using webhooks. You can register URLs to receive metadata event notifications and integrate with Slack, Microsoft Teams, and Google Chat.

Connectors: Ingest metadata from various sources and tools using connectors. OpenMetadata supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.

Try our Sandbox

Take a look and play with sample data at http://sandbox.open-metadata.org

Install and Run OpenMetadata

Get up and running in a few minutes. See the OpenMetadata documentation for installation instructions.

Documentation and Support

We're here to help and make OpenMetadata even better! Check out OpenMetadata documentation for a complete description of OpenMetadata's features. Join our Slack Community to get in touch with us if you want to chat, need help, or discuss new feature requirements.

Contributors

We ❤️ all contributions, big and small! Check out our CONTRIBUTING guide to get started, and let us know how we can help.

Don't want to miss anything? Give the project a ⭐ 🚀

A HUGE THANK YOU to all our supporters!

Stargazers

License

OpenMetadata is released under Apache License, Version 2.0