OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
Find a file
Sriharsha Chintalapani d3bbbefe37
fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999)
* fix(rdf): dedupe lineage edges and broaden PROV-O coverage

The RDF Knowledge Graph endpoint was emitting two edges per lineage
relationship — once as `om:UPSTREAM` (forward) and once as
`prov:wasDerivedFrom` (reverse) — because the parser preserved each
predicate's native subject/object orientation instead of canonicalizing
both into a single `(upstream, downstream)` edge.

Also extend PROV-O coverage so external SPARQL clients can use the W3C
Provenance vocabulary directly:
- `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on
  datasets / pipelines / users
- `prov:wasAttributedTo` mirror of `om:owners`
- `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used`
  on lineageDetails so the Entity → Activity → Entity chain is complete
- `prov:hadPlan` + `prov:Plan` for SQL transformation recipes
- `prov:startedAtTime` / `prov:endedAtTime` on Activity instances
- `prov:wasAssociatedWith` Activity → Agent linking
- `prov:invalidatedAtTime` on soft-deleted entities

Other RDF cleanups in the same area:
- LineageDetails URIs are now deterministic (driven by from/to ids
  instead of a timestamp), so re-indexing collapses duplicate Activity
  resources via the existing DELETE+INSERT idempotency
- Skip emitting the redundant `om:owners` JSON-string literal — the
  mapped path already produces clean `om:hasOwner <agent>` triples
- Skip empty `[]` array literals in the unmapped path
- Propagate failures from `RdfRepository.{addRelationship,
  addLineageWithDetails, bulkAddRelationships,
  bulkAddGlossaryTermRelations}` instead of silently swallowing them,
  so downstream callers can surface the failure

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf-index-app): surface Fuseki failures in app run record

Per-entity and per-batch failures from the RDF index app used to be
logged via SLF4J only — they never made it into the AppRunRecord, so
the UI/run history showed "completed" even when every entity had
silently failed to write to Fuseki.

- `RdfBatchProcessor.processEntities` now captures the last error per
  entity, returns it in `BatchProcessingResult.lastError`, and
  accumulates relationship-processing failures into the same result.
- Relationship and lineage processing methods (`processBatchRelationships`,
  `processLineageRelationship`, `processGlossaryTermRelations`) return
  structured results with failure counts and last-error messages instead
  of `void`, so failures are visible to the partition worker.
- `RdfIndexApp` records the failure on `jobData` for both the
  distributed and non-distributed code paths, so users see a real
  error message in the run history (e.g.
  "Failed to write entity X to Fuseki: ConnectException").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* perf(rdf-index-app): port distributed-mode improvements from SearchIndex

The RDF distributed-indexing fork was lagging behind several SearchIndex
improvements that addressed concrete reliability and throughput issues.
Port them across:

Core perf / reliability
- Precomputed partition start cursors: coordinator walks each entity
  once via keyset pagination at job init and caches the boundary cursor
  per (jobId, entityType, rangeStart). Workers consult the cache before
  falling back to the OFFSET-based path. Eliminates the previous O(N²)
  per-partition cursor lookup.
- `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion`
  on the coordinator. Stop now cancels both PENDING and PROCESSING
  partitions in a single SQL update and immediately drives the job
  status from STOPPING → STOPPED, so the UI status no longer hangs
  while workers drain.
- Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset`
  uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of
  `List.of("*")`, avoiding expensive fetchers (e.g. fetchAndSetOwns)
  per batch.
- Partition heartbeat thread: virtual thread refreshes
  `lastUpdateAt` every 30s for partitions actively being processed by
  this server, so the stale reclaimer no longer interrupts active work.
- `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path
  rejects when the server already holds 5 PROCESSING partitions, giving
  fair distribution across pods. Verified the existing claim DAO uses
  `FOR UPDATE SKIP LOCKED` for both MySQL and Postgres.
- Gate WebSocket stat broadcasts during the STOPPING phase so the
  Quartz-scheduler-driven STOPPED status push isn't overwritten.

Multi-server scaffolding (single-pod is unaffected)
- `RdfPollingJobNotifier`: DB-polling discovery for other server pods
  to find an in-flight RDF reindex they can join.
- `RdfEntityCompletionTracker`: per-entity-type partition tracking with
  callback firing once all partitions for an entity complete, foundation
  for early per-entity index promotion.

Tests: precomputed-cursor cache lookup, in-flight backpressure,
cancelInFlight delegation, completion tracker callback semantics,
notifier start/stop.

DAO additions on `rdf_index_partition`:
- `cancelInFlightPartitions(jobId, now)` — covers both PENDING and
  PROCESSING in one statement
- `countInFlightPartitionsForServer(jobId, serverId)` — backpressure
- `countPartitionsByStatus(jobId, status)` — used by completion check

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row

When an app has no run history, AppRunsHistory fabricated a synthetic
placeholder row that looked like a real run — `runType: "CurrentConfig"`,
a fake `Run At` timestamp pulled from `appData.updatedAt`, an
ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button
that targeted nothing.

Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows,
and hide the `Stop` button so users no longer see "Run now → 19-minute
Running with Stop button" when the actual job never registered. Real
app runs are unaffected — they still display `runType` from the
backend (OnDemandJob, Hourly, Daily, Custom, etc.).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): address PR review findings

Four issues raised in PR #27999 review:

- **Cursor format consistency in walkAndRecord** (bug):
  The defensive branch produced cursors via a custom `{name, id}` map
  while the regular path used `repo.getCursorValue()`. For entities
  with quoted names these encodings diverge — a quoted-name entity
  could land in the cache with a cursor incompatible with what the
  worker fetches via keyset pagination. Track the last seen entity
  reference and run it through `repo.getCursorValue()` in both paths.
  `encodeBoundaryCursor` is removed.

- **Adaptive scheduling in RdfPollingJobNotifier** (perf):
  The previous implementation woke the scheduler thread every 1s and
  short-circuited inside the poll method when idle. Reschedule the
  task at the appropriate interval (1s active / 30s idle) when
  `setParticipating` flips, so the thread genuinely sleeps when idle.

- **Cursor cache cleanup on startup recovery** (edge case):
  `partitionStartCursors` was only evicted by `refreshAggregatedJob`
  / `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job
  and never reached either, the cache entry leaked until process
  restart. Add `evictStaleCursorCacheEntries()` invoked by
  `performStartupRecovery` that drops entries for jobs that no longer
  exist in the DB or are already terminal.

- **Consolidate describeError helpers** (quality):
  `describeError`, `describeBulkError`, and `describeLineageError` in
  `RdfBatchProcessor` all walked the cause chain and formatted a
  prefixed message with the same logic. Reduced to a single
  `describeError(prefix, error)` plus a thin `describeEntityError`
  adapter for the per-entity call site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop()

stop() called workerExecutor.shutdownNow() inline AND through
cleanupLocalExecution -> shutdownWorkerExecutor, which broke the
DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce
verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the
inline call — cleanupLocalExecution is the single owner of the
shutdown path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: drop redundant DB matrix from openmetadata-service unit tests

The {mysql, postgresql} strategy matrix on openmetadata-service unit
tests doubled CI cost without adding signal: both jobs ran the same
surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined
only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single
`test.database` property, and that property is consumed exclusively by
the failsafe plugin (integration tests `*IT.java` / `*IntegrationTest.java`),
which only runs under `-Pintegration-tests` — not enabled here.

`openmetadata-service` itself has zero tests that read `test.database`
or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The
only testcontainer-based DB code in the repo lives in
`openmetadata-integration-tests`, a different module that this workflow
doesn't build.

Run the unit suite once. The `openmetadata-service-unit-tests-status`
required-check aggregator is unaffected (it depends on the renamed job
which still has the same name).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): address Copilot PR review findings

Six correctness issues raised on PR #27999:

- **Lineage-details DELETE was too broad** (RdfRepository): the cleanup
  step deleted *all* `<fromUri> om:hasLineageDetails ?d` triples,
  so reindexing one (fromId, toId) edge wiped lineage-details links
  for every other downstream of the same source entity. Pin the
  delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>`
  triple. Same with prov:generated cleanup — anchor it to the
  specific detailsUri instead of any details resource.

- **Predicate not flipped during canonicalization** (RdfRepository):
  `parseEntityGraphEdgesFromResults` swapped subject/object for
  reverse-direction predicates (`prov:wasDerivedFrom`,
  `prov:wasInfluencedBy`) but kept the original predicate URI on the
  resulting EdgeInfo. Exported graphs could carry semantically
  invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`.
  Add `forwardEquivalentPredicate` to substitute the OM-native
  forward predicate when the direction flips.

- **`dct:modified` was an invalid xsd:dateTime** (RdfPropertyMapper):
  `entity.getUpdatedAt().toString()` returns the epoch-millis Long as
  a string, but the literal was tagged `xsd:dateTime`. Convert via
  `Instant.ofEpochMilli(...).toString()` so the lexical form matches
  the type — same fix already in place for prov:invalidatedAtTime.

- **Unmapped EntityReference arrays were dropped entirely**
  (RdfPropertyMapper): the previous fix to skip noisy JSON-string
  literals also dropped fields like `domains`, `reviewers`, `voters`
  for entity contexts that don't have a JSON-LD mapping for them —
  the unmapped path was the only path emitting them, so nothing
  landed in RDF. Expand each array element through
  `addEntityReference` so the data still produces proper
  `om:<fieldName> <ref>` triples; mapped-path duplicates are
  collapsed by Jena's Model dedupe.

- **Partition failure detection missed reader errors**
  (DistributedRdfIndexExecutor): the EntityCompletionTracker was fed
  `result.errorMessage() != null`, but `RdfPartitionWorker` can
  increment `failedCount` from `readerErrors` without ever setting
  `lastError`. Use `result.failedCount() > 0` so partitions whose
  failures came from `ResultList.getErrors()` are also marked as
  failed when promoting an entity.

- **`COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0**
  (RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS
  whenever any partition is FAILED or CANCELLED, including for
  user-initiated stops where no record-level failures accrued. The
  monitor's `completedWithErrors` gate required `failedRecords > 0`,
  so those terminal states never hit `jobData.setFailure(...)` and
  the run record showed success. Drop the failedRecords precondition
  and tailor the fallback message based on whether there are
  record-level failures or partition-level only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): separate relationship failures + type lineage as prov:Activity

Two more PR review findings on #27999:

- **Relationship failures inflated failedRecords stat**: `processEntities`
  was folding relationship/lineage edge failures into `failedCount`,
  which becomes `failedRecords` in the index stats. Records there mean
  entities, computed from entity counts in `totalRecords`. Counting
  per-edge relationship failures could push `failedRecords` above
  `processedRecords`/`totalRecords` and produce nonsensical
  per-entity stats.

  Track them separately: add `relationshipFailureCount` to
  `BatchProcessingResult` and `PartitionResult`. `failedCount` now stays
  entity-level. The completion tracker is fed the broader
  `result.hasAnyFailure()` so partitions where relationship triples
  failed don't get prematurely promoted as success even though their
  entity writes succeeded.

- **`detailsResource` wasn't typed as prov:Activity**: the resource
  carries Activity-shaped predicates (prov:startedAtTime,
  prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy,
  prov:wasAssociatedWith) but only the OM-specific
  `om:LineageDetails` rdf:type. Add an explicit
  `rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL
  clients recognize it as an Activity without having to learn the
  OM type.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): label lineage edges relative to focal node

The Knowledge Graph view was labeling every edge with relation
type "upstream" as "Upstream" regardless of direction relative to the
focal node. For a focal node F, the raw stored relation `(F, X, upstream)`
means "F is upstream of X" — i.e. X is *downstream* of F. The previous
output labeled both `F → X` and `X → F` edges as "Upstream", which made
bidirectional lineage look like a duplicated relation.

Re-orient the label in `convertEdgesToGraphData` based on whether the
focal is the edge's source or target:
- focal → X → "Downstream"
- X → focal → "Upstream"
- non-focal-touching edges keep the raw relation label.

Reported on a sample-data table with a circular lineage cycle
(`dim_customer ↔ fact_orders`) where both directions showed "Upstream".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close remaining Copilot review gaps

Three findings from PR #27999's third review pass — all about failure
signals being silently dropped between layers:

- **`RdfIndexApp.processTask` ignored relationship failures**: only
  `result.failedCount() > 0` was treated as a failure, so partitions
  whose Fuseki relationship/lineage writes failed (incrementing
  `relationshipFailureCount` but not `failedCount`) never wrote
  `jobData.failure`. Switch to `result.hasAnyFailure()` and report the
  combined count.

- **`checkAndUpdateJobCompletion` ignored partition `lastError`**: a
  partition can finish COMPLETED with `lastError` set when a relationship
  bulk write was caught and recorded but didn't bump `failedRecords` or
  flip the partition to FAILED. The job would then go to COMPLETED even
  though there were real failures. Treat the presence of any
  `rdf_index_partition.lastError` as an error signal — promote to
  COMPLETED_WITH_ERRORS and aggregate sample errors into the job's
  errorMessage if it was blank.

- **`forwardEquivalentPredicate` mapped to a non-existent
  `om:DOWNSTREAM` URI**: OpenMetadata only stores lineage with
  `om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O
  pair); there is no `om:DOWNSTREAM` predicate written anywhere — the
  downstream view is derived by reading the same UPSTREAM edge from the
  other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy`
  to `om:UPSTREAM` (both are reverse-direction causation predicates: in
  `B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and
  effect is B, so the canonical forward predicate is the same).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix RDF tag mapper

* Fix all the comments

Cherry-picked from #27562 (without bin/ autogenerated noise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Align RdfPropertyMapper tests with refactor and isolate ontology export IT

RdfPropertyMapperTest still referenced the removed addVotes helper and
expected addStructuredProperty to dispatch votes — both gone after votes
was added to IGNORED_PROPERTIES. Update the assertions accordingly.

GlossaryOntologyExportIT timed out on the full suite because it flips a
global RDF singleton in @BeforeAll and each test blocks a server thread on
synchronous Fuseki writes. SAME_THREAD only serialized methods within the
class — concurrent classes still raced for server threads. Adding @Isolated
matches the pattern already used by RdfResourceIT for the same reason.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(rdf): align addCertification typing + relationType after predicate flip

Two findings on PR #27999 from the post-cherry-pick review pass:

- **`addCertification` mis-typed glossary-source certifications and
  skipped skos:Concept**: it always emitted `om:Tag` regardless of
  source, even though `resolveTagResource` returns a glossaryTerm URI
  when the certification points at a glossary term. It also didn't add
  `skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for
  classification tags), so SPARQL queries filtering certification
  targets by `a skos:Concept` missed them while `addTagLabel`-emitted
  tags were findable. Mirror `addTagLabel`: branch on source
  (`Glossary` vs `Classification`), emit the right primary type plus
  `skos:Concept` (glossary) or `om:Tag` (classification), and include
  `om:tagSource`.

- **`relationType` left stale after predicate flip**: when
  `parseEntityGraphEdgesFromResults` flipped subject/object for a
  reverse-direction predicate and rewrote `canonicalPredicate` to
  `om:UPSTREAM`, it kept the original `relationType` derived from the
  reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo
  with `relationType=downstream` + `predicate=om:UPSTREAM` —
  internally inconsistent, and the mismatched `edgeKey` prevented
  dedup against an existing UPSTREAM edge with the same endpoints.
  Re-derive `relationType` from the canonical predicate after the
  flip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close 2 review findings + add parser-helper unit tests

Two outstanding Copilot findings on PR #27999 plus targeted unit
coverage for the helpers that drive lineage canonicalization.

Findings:

- **`colLineageUri` collision risk** (RdfRepository): the deterministic
  key replaced non-alphanumerics in `toColumn` with `_`, so distinct
  column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which
  would lose / overwrite column-lineage resources during reindex.
  Append the loop index as a tiebreaker so distinct columns keep
  distinct URIs.

- **`createTypeResource` missing dprod prefix** (RdfPropertyMapper):
  the `getNamespace` switch didn't recognize `dprod`, so
  `RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`)
  produced an invalid `dprod:DataProduct` URI on the wire. Added the
  `DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod`
  case in the switch.

Coverage:

- New `RdfParserHelpersTest` exercises the canonicalization helpers
  via reflection: `isReverseDirectionPredicate` (recognizes
  PROV-O causation predicates, ignores forward predicates),
  `forwardEquivalentPredicate` (both `wasDerivedFrom` and
  `wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works),
  `relativeRelationLabel` (focal-relative Upstream/Downstream
  flipping with all the boundary cases — non-focal edges,
  non-lineage relations, null focal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): merge array contexts before per-field resolution

The third (low-confidence "suppressed") finding on review 4256830399
turned out to be a real duplication: when a field is mapped in one
context map of an array context but absent from another, the previous
processArrayContext ran processContextMappings once per map. The pass
where the field IS mapped emits the proper `om:hasOwner <ref>` triples
(plus `prov:wasAttributedTo`); the pass where the field is absent
falls through to processUnmappedField and emits an additional
`om:owners <ref>` triple. Net: two predicates for the same logical
relationship.

Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112
`om:owners` triples — one set per pass.

Fix: flatten all context maps in the array into a single merged map
once, then iterate entity fields exactly once against that combined
view (later contexts win on key conflicts, matching JSON-LD context
merge semantics). Each field is resolved against the union of
mappings, so the unmapped fallback only fires for fields truly absent
from every context. Net effect: `prov:wasAttributedTo` count is
unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners`
triples disappear.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(rdf): close 2 review findings on coordinator finalization race

Two findings from PR #27999 review 4259628860:

- **`checkAndUpdateJobCompletion` early-returned before lastError check
  could promote**: `refreshAggregatedJob` already marks the job COMPLETED
  when partitions all finish without `failedRecords`/`failedPartitions`,
  so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())`
  short-circuit silently dropped the lastError signal. Move the
  partition-lastError check INTO `refreshAggregatedJob` so both code
  paths produce consistent terminal status — a partition that finished
  COMPLETED but carries a non-null lastError now correctly promotes the
  job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the
  race.

- **`completePartition` / `failPartition` overwrote CANCELLED state**:
  the unconditional partition row update lost a concurrent Stop's
  CANCELLED status if a worker finished its batch after the Stop
  request landed but before noticing it. Add a status-guarded
  `updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND
  status = 'PROCESSING') and have both completion paths use it; if 0
  rows update, log and skip the side effects (no server-stat increment,
  no refreshAggregatedJob call) so the authoritative CANCELLED status
  stays. Mirrors the pattern SearchIndex's coordinator uses for the
  same race.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
2026-05-11 06:14:50 -07:00
.agents/skills Context center (#27558) 2026-05-08 10:56:04 -07:00
.claude Context center (#27558) 2026-05-08 10:56:04 -07:00
.devcontainer MINOR - DevContainer Setup for contribution (#26623) 2026-03-20 08:27:30 +01:00
.github fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999) 2026-05-11 06:14:50 -07:00
bin Set Indexing related executor threads priority to LOW (#27153) 2026-04-15 11:28:47 -07:00
bootstrap Context center (#27558) 2026-05-08 10:56:04 -07:00
common fix(security): pin libthrift, provided jsonschema2pojo, bump azure-kv/sjm/reactor-netty, exclude netty-epoll (#28010) 2026-05-11 14:08:26 +05:30
conf feat(search): add Google Gemini embedding provider (#27974) 2026-05-10 16:37:53 +02:00
docker Perf/redis cache metrics and indexes (#27499) 2026-04-23 12:18:53 +02:00
docs feat: Add auto-classification support for storage service containers (#26495) 2026-04-24 06:29:16 -07:00
examples/python-sdk/data-quality Create documentation resources for Data Quality as Code (closes #23800) (#24169) 2025-11-11 10:25:42 +00:00
ingestion Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values (#27951) 2026-05-11 18:02:06 +05:30
openmetadata-airflow-apis chore(ingestion): drop pylint, expand ruff (#27774) 2026-04-28 07:21:59 +02:00
openmetadata-clients fix(security): upgrade Java dependencies to resolve CRITICAL and HIGH CVEs (#27940) 2026-05-07 09:19:10 +00:00
openmetadata-dist Deprecate OpenMetadata Java client in favor of new Java SDK (#26388) 2026-03-10 21:30:39 -07:00
openmetadata-integration-tests fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999) 2026-05-11 06:14:50 -07:00
openmetadata-k8s-operator Fixes #27852: propagate tolerations from CronOMJob to scheduled OMJob (#27955) 2026-05-07 14:38:49 +02:00
openmetadata-mcp chore(mcp): add server.json for MCP Registry publishing (#27982) 2026-05-08 10:14:31 +02:00
openmetadata-sdk Context center (#27558) 2026-05-08 10:56:04 -07:00
openmetadata-service fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999) 2026-05-11 06:14:50 -07:00
openmetadata-shaded-deps fix(security): upgrade Java dependencies to resolve CRITICAL and HIGH CVEs (#27940) 2026-05-07 09:19:10 +00:00
openmetadata-spec feat(ingestion): add QuestDB database connector (#27604) 2026-05-11 13:02:32 +05:30
openmetadata-ui fix(rdf): dedupe lineage edges, surface Fuseki failures, port distributed-mode improvements (#27999) 2026-05-11 06:14:50 -07:00
openmetadata-ui-core-components Fix fast-uri Dependabot vulnerabilities in UI core components (#28020) 2026-05-11 08:30:58 +00:00
openspec Task redesign (#25894) 2026-04-23 15:52:30 +02:00
scripts Reindex robustness: selective fields, cache fail-fast, stop actually stops (#27876) 2026-05-04 13:22:15 -07:00
skills feat(ingestion): add QuestDB database connector (#27604) 2026-05-11 13:02:32 +05:30
.dockerignore RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex (#26902) 2026-04-14 13:24:41 -07:00
.git-blame-ignore-revs Minor: update git-blmae-ignore-revs, and uncomment ClassificationResourceTest tests code (#14431) 2023-12-18 19:16:29 -08:00
.gitignore chore(ingestion): enable basedpyright across the codebase via baseline (#27755) 2026-04-27 17:15:44 +02:00
.nojekyll shahsank3t published a site update 2021-08-04 06:23:29 +00:00
.pre-commit-config.yaml chore(ingestion): migrate to ruff for format + isort + unused-import (#27739) 2026-04-27 10:05:28 +02:00
.snyk Ignore _openmetadata_testutils from snyk (#21168) 2025-05-13 18:01:05 +05:30
adr-incident-manager-governance-workflows.md Task redesign (#25894) 2026-04-23 15:52:30 +02:00
AGENTS.md Context center (#27558) 2026-05-08 10:56:04 -07:00
APPLICATION.md Rename app 'preview' property to 'enabled' (#26170) 2026-03-05 08:29:54 +01:00
CLAUDE.md Context center (#27558) 2026-05-08 10:56:04 -07:00
CODE_OF_CONDUCT.md Fix #412 - Add code of conduct for OpenMetadata community 2021-09-06 18:57:17 -07:00
CONTRIBUTING.md addded more detail on issue creation in contributors page (#16583) 2024-06-09 14:02:36 -07:00
DEVELOPER.md chore(ingestion): drop pylint, expand ruff (#27774) 2026-04-28 07:21:59 +02:00
generate_ts.sh Feature: Generate TS From JSON (#19823) 2025-02-25 18:18:02 +05:30
INCIDENT_RESPONSE.md Add threat model and incident response (#23603) 2025-09-28 13:17:23 -07:00
LICENSE OpenMetadata snapshot release 0.3 2021-08-01 14:27:44 -07:00
Makefile security: Include branch name in security scan Slack alerts and fail only on high vulnerabilities (#27977) 2026-05-11 10:41:48 +05:30
NOTICE OpenMetadata snapshot release 0.3 2021-08-01 14:27:44 -07:00
package.json fix: Resolve frontend security vulnerabilities in lodash and lodash-es (#27105) 2026-04-07 07:55:25 +00:00
pom.xml Fixes #22916: Add chart-level lineage for Metabase connector (#26778) 2026-05-11 16:40:49 +05:30
README.md Update README.md for column-level consistency (#24670) 2025-12-03 07:59:18 -08:00
SECURITY.md Update vulnerability reporting instructions in SECURITY.md (#25651) 2026-01-30 14:03:09 -08:00
THREAT_MODEL.md Add threat model and incident response (#23603) 2025-09-28 13:17:23 -07:00
yarn.lock fix: Resolve frontend security vulnerabilities in lodash and lodash-es (#27105) 2026-04-07 07:55:25 +00:00



Logo

Empower your Data Journey with OpenMetadata

Commit Activity Release

What is OpenMetadata?

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column-level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.


Contents:

OpenMetadata Consists of Four Main Components:

  • Metadata Schemas: These are the core definitions and vocabulary for metadata based on common abstractions and types. They also allow for custom extensions and properties to suit different use cases and domains.
  • Metadata Store: This is the central repository for storing and managing the metadata graph, which connects data assets, users, and tool-generated metadata in a unified way.
  • Metadata APIs: These are the interfaces for producing and consuming metadata, built on top of the metadata schemas. They enable seamless integration of user interfaces and tools, systems, and services with the metadata store.
  • Ingestion Framework: This is a pluggable framework for ingesting metadata from various sources and tools to the metadata store. It supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.

Key Features of OpenMetadata

Data Discovery: Find and explore all your data assets in a single place using various strategies, such as keyword search, data associations, and advanced queries. You can search across tables, topics, dashboards, pipelines, and services.

12


Data Collaboration: Communicate, converse, and cooperate with other users and teams on data assets. You can get event notifications, send alerts, add announcements, create tasks, and use conversation threads.

11


Data Quality and Profiler: Measure and monitor the quality with no-code to build trust in your data. You can define and run data quality tests, group them into test suites, and view the results in an interactive dashboard. With powerful collaboration, make data quality a shared responsibility in your organization.

8


Data Governance: Enforce data policies and standards across your organization. You can define data domains and data products, assign owners and stakeholders, and classify data assets using tags and terms. Use powerful automation features to auto-classify your data.

10


Data Insights and KPIs: Use reports and platform analytics to understand how your organization's data is doing. Data Insights provides a single-pane view of all the key metrics to reflect the state of your data best. Define the Key Performance Indicators (KPIs) and set goals within OpenMetadata to work towards better documentation, ownership, and tiering. Alerts can be set against the KPIs to be received on a specified schedule.

9


Data Lineage: Track and visualize the origin and transformation of your data assets end-to-end. You can view column-level lineage, filter queries, and edit lineage manually using a no-code editor.

Data Documentation: Document your data assets and metadata entities using rich text, images, and links. You can also add comments and annotations and generate data dictionaries and data catalogs.

Data Observability: Monitor the health and performance of your data assets and pipelines. You can view metrics such as data freshness, data volume, data quality, and data latency. You can also set up alerts and notifications for any anomalies or failures.

Data Security: Secure your data and metadata using various authentication and authorization mechanisms. You can integrate with different identity providers for single sign-on and define roles and policies for access control.

Webhooks: Integrate with external applications and services using webhooks. You can register URLs to receive metadata event notifications and integrate with Slack, Microsoft Teams, and Google Chat.

Connectors: Ingest metadata from various sources and tools using connectors. OpenMetadata supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.

Try our Sandbox

Take a look and play with sample data at http://sandbox.open-metadata.org

Install and Run OpenMetadata

Get up and running in a few minutes. See the OpenMetadata documentation for installation instructions.

Documentation and Support

We're here to help and make OpenMetadata even better! Check out OpenMetadata documentation for a complete description of OpenMetadata's features. Join our Slack Community to get in touch with us if you want to chat, need help, or discuss new feature requirements.

Contributors

We ❤️ all contributions, big and small! Check out our CONTRIBUTING guide to get started, and let us know how we can help.

Don't want to miss anything? Give the project a 🚀

A HUGE THANK YOU to all our supporters!

Stargazers

Stargazers of @open-metadata/OpenMetadata repo

License

OpenMetadata is released under Apache License, Version 2.0