* fix(rdf): dedupe lineage edges and broaden PROV-O coverage
The RDF Knowledge Graph endpoint was emitting two edges per lineage
relationship — once as `om:UPSTREAM` (forward) and once as
`prov:wasDerivedFrom` (reverse) — because the parser preserved each
predicate's native subject/object orientation instead of canonicalizing
both into a single `(upstream, downstream)` edge.
Also extend PROV-O coverage so external SPARQL clients can use the W3C
Provenance vocabulary directly:
- `prov:Entity` / `prov:Activity` / `prov:Agent` class typing on
datasets / pipelines / users
- `prov:wasAttributedTo` mirror of `om:owners`
- `prov:generated` (inverse of existing `wasGeneratedBy`) and `prov:used`
on lineageDetails so the Entity → Activity → Entity chain is complete
- `prov:hadPlan` + `prov:Plan` for SQL transformation recipes
- `prov:startedAtTime` / `prov:endedAtTime` on Activity instances
- `prov:wasAssociatedWith` Activity → Agent linking
- `prov:invalidatedAtTime` on soft-deleted entities
Other RDF cleanups in the same area:
- LineageDetails URIs are now deterministic (driven by from/to ids
instead of a timestamp), so re-indexing collapses duplicate Activity
resources via the existing DELETE+INSERT idempotency
- Skip emitting the redundant `om:owners` JSON-string literal — the
mapped path already produces clean `om:hasOwner <agent>` triples
- Skip empty `[]` array literals in the unmapped path
- Propagate failures from `RdfRepository.{addRelationship,
addLineageWithDetails, bulkAddRelationships,
bulkAddGlossaryTermRelations}` instead of silently swallowing them,
so downstream callers can surface the failure
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf-index-app): surface Fuseki failures in app run record
Per-entity and per-batch failures from the RDF index app used to be
logged via SLF4J only — they never made it into the AppRunRecord, so
the UI/run history showed "completed" even when every entity had
silently failed to write to Fuseki.
- `RdfBatchProcessor.processEntities` now captures the last error per
entity, returns it in `BatchProcessingResult.lastError`, and
accumulates relationship-processing failures into the same result.
- Relationship and lineage processing methods (`processBatchRelationships`,
`processLineageRelationship`, `processGlossaryTermRelations`) return
structured results with failure counts and last-error messages instead
of `void`, so failures are visible to the partition worker.
- `RdfIndexApp` records the failure on `jobData` for both the
distributed and non-distributed code paths, so users see a real
error message in the run history (e.g.
"Failed to write entity X to Fuseki: ConnectException").
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* perf(rdf-index-app): port distributed-mode improvements from SearchIndex
The RDF distributed-indexing fork was lagging behind several SearchIndex
improvements that addressed concrete reliability and throughput issues.
Port them across:
Core perf / reliability
- Precomputed partition start cursors: coordinator walks each entity
once via keyset pagination at job init and caches the boundary cursor
per (jobId, entityType, rangeStart). Workers consult the cache before
falling back to the OFFSET-based path. Eliminates the previous O(N²)
per-partition cursor lookup.
- `cancelInFlightPartitions` + `requestStop` + `checkAndUpdateJobCompletion`
on the coordinator. Stop now cancels both PENDING and PROCESSING
partitions in a single SQL update and immediately drives the job
status from STOPPING → STOPPED, so the UI status no longer hangs
while workers drain.
- Selective field hydration: `RdfPartitionWorker.readEntitiesKeyset`
uses `ReindexingUtil.getSearchIndexFields(entityType)` instead of
`List.of("*")`, avoiding expensive fetchers (e.g. fetchAndSetOwns)
per batch.
- Partition heartbeat thread: virtual thread refreshes
`lastUpdateAt` every 30s for partitions actively being processed by
this server, so the stale reclaimer no longer interrupts active work.
- `MAX_IN_FLIGHT_PARTITIONS_PER_SERVER = 5` backpressure: claim path
rejects when the server already holds 5 PROCESSING partitions, giving
fair distribution across pods. Verified the existing claim DAO uses
`FOR UPDATE SKIP LOCKED` for both MySQL and Postgres.
- Gate WebSocket stat broadcasts during the STOPPING phase so the
Quartz-scheduler-driven STOPPED status push isn't overwritten.
Multi-server scaffolding (single-pod is unaffected)
- `RdfPollingJobNotifier`: DB-polling discovery for other server pods
to find an in-flight RDF reindex they can join.
- `RdfEntityCompletionTracker`: per-entity-type partition tracking with
callback firing once all partitions for an entity complete, foundation
for early per-entity index promotion.
Tests: precomputed-cursor cache lookup, in-flight backpressure,
cancelInFlight delegation, completion tracker callback semantics,
notifier start/stop.
DAO additions on `rdf_index_partition`:
- `cancelInFlightPartitions(jobId, now)` — covers both PENDING and
PROCESSING in one statement
- `countInFlightPartitionsForServer(jobId, serverId)` — backpressure
- `countPartitionsByStatus(jobId, status)` — used by completion check
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(ui-apps): hide misleading data on synthetic 'CurrentConfig' row
When an app has no run history, AppRunsHistory fabricated a synthetic
placeholder row that looked like a real run — `runType: "CurrentConfig"`,
a fake `Run At` timestamp pulled from `appData.updatedAt`, an
ever-growing `Duration` (`now − updatedAt`), and an active `Stop` button
that targeted nothing.
Render `--` for `Run At`, `Run Type`, and `Duration` on synthetic rows,
and hide the `Stop` button so users no longer see "Run now → 19-minute
Running with Stop button" when the actual job never registered. Real
app runs are unaffected — they still display `runType` from the
backend (OnDemandJob, Hourly, Daily, Custom, etc.).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): address PR review findings
Four issues raised in PR #27999 review:
- **Cursor format consistency in walkAndRecord** (bug):
The defensive branch produced cursors via a custom `{name, id}` map
while the regular path used `repo.getCursorValue()`. For entities
with quoted names these encodings diverge — a quoted-name entity
could land in the cache with a cursor incompatible with what the
worker fetches via keyset pagination. Track the last seen entity
reference and run it through `repo.getCursorValue()` in both paths.
`encodeBoundaryCursor` is removed.
- **Adaptive scheduling in RdfPollingJobNotifier** (perf):
The previous implementation woke the scheduler thread every 1s and
short-circuited inside the poll method when idle. Reschedule the
task at the appropriate interval (1s active / 30s idle) when
`setParticipating` flips, so the thread genuinely sleeps when idle.
- **Cursor cache cleanup on startup recovery** (edge case):
`partitionStartCursors` was only evicted by `refreshAggregatedJob`
/ `checkAndUpdateJobCompletion`. If a coordinator crashed mid-job
and never reached either, the cache entry leaked until process
restart. Add `evictStaleCursorCacheEntries()` invoked by
`performStartupRecovery` that drops entries for jobs that no longer
exist in the DB or are already terminal.
- **Consolidate describeError helpers** (quality):
`describeError`, `describeBulkError`, and `describeLineageError` in
`RdfBatchProcessor` all walked the cause chain and formatted a
prefixed message with the same logic. Reduced to a single
`describeError(prefix, error)` plus a thin `describeEntityError`
adapter for the per-entity call site.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf-index-app): avoid double workerExecutor.shutdownNow() in stop()
stop() called workerExecutor.shutdownNow() inline AND through
cleanupLocalExecution -> shutdownWorkerExecutor, which broke the
DistributedRdfIndexExecutorTest.stopAndCoordinatorCleanupOnlyTearDownLocalExecutionOnce
verify(workerExecutor, times(1)).shutdownNow() expectation. Drop the
inline call — cleanupLocalExecution is the single owner of the
shutdown path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* ci: drop redundant DB matrix from openmetadata-service unit tests
The {mysql, postgresql} strategy matrix on openmetadata-service unit
tests doubled CI cost without adding signal: both jobs ran the same
surefire suite. The `-Pmysql` / `-Ppostgresql` profiles are defined
only in `openmetadata-sdk/pom.xml` (lines 190-206), set a single
`test.database` property, and that property is consumed exclusively by
the failsafe plugin (integration tests `*IT.java` / `*IntegrationTest.java`),
which only runs under `-Pintegration-tests` — not enabled here.
`openmetadata-service` itself has zero tests that read `test.database`
or use `MySQLContainer`/`PostgreSQLContainer` (verified by grep). The
only testcontainer-based DB code in the repo lives in
`openmetadata-integration-tests`, a different module that this workflow
doesn't build.
Run the unit suite once. The `openmetadata-service-unit-tests-status`
required-check aggregator is unaffected (it depends on the renamed job
which still has the same name).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): address Copilot PR review findings
Six correctness issues raised on PR #27999:
- **Lineage-details DELETE was too broad** (RdfRepository): the cleanup
step deleted *all* `<fromUri> om:hasLineageDetails ?d` triples,
so reindexing one (fromId, toId) edge wiped lineage-details links
for every other downstream of the same source entity. Pin the
delete to the specific `<fromUri> om:hasLineageDetails <detailsUri>`
triple. Same with prov:generated cleanup — anchor it to the
specific detailsUri instead of any details resource.
- **Predicate not flipped during canonicalization** (RdfRepository):
`parseEntityGraphEdgesFromResults` swapped subject/object for
reverse-direction predicates (`prov:wasDerivedFrom`,
`prov:wasInfluencedBy`) but kept the original predicate URI on the
resulting EdgeInfo. Exported graphs could carry semantically
invalid triples like `<upstream> prov:wasDerivedFrom <downstream>`.
Add `forwardEquivalentPredicate` to substitute the OM-native
forward predicate when the direction flips.
- **`dct:modified` was an invalid xsd:dateTime** (RdfPropertyMapper):
`entity.getUpdatedAt().toString()` returns the epoch-millis Long as
a string, but the literal was tagged `xsd:dateTime`. Convert via
`Instant.ofEpochMilli(...).toString()` so the lexical form matches
the type — same fix already in place for prov:invalidatedAtTime.
- **Unmapped EntityReference arrays were dropped entirely**
(RdfPropertyMapper): the previous fix to skip noisy JSON-string
literals also dropped fields like `domains`, `reviewers`, `voters`
for entity contexts that don't have a JSON-LD mapping for them —
the unmapped path was the only path emitting them, so nothing
landed in RDF. Expand each array element through
`addEntityReference` so the data still produces proper
`om:<fieldName> <ref>` triples; mapped-path duplicates are
collapsed by Jena's Model dedupe.
- **Partition failure detection missed reader errors**
(DistributedRdfIndexExecutor): the EntityCompletionTracker was fed
`result.errorMessage() != null`, but `RdfPartitionWorker` can
increment `failedCount` from `readerErrors` without ever setting
`lastError`. Use `result.failedCount() > 0` so partitions whose
failures came from `ResultList.getErrors()` are also marked as
failed when promoting an entity.
- **`COMPLETED_WITH_ERRORS` was hidden when failedRecords == 0**
(RdfIndexApp): the coordinator marks a job COMPLETED_WITH_ERRORS
whenever any partition is FAILED or CANCELLED, including for
user-initiated stops where no record-level failures accrued. The
monitor's `completedWithErrors` gate required `failedRecords > 0`,
so those terminal states never hit `jobData.setFailure(...)` and
the run record showed success. Drop the failedRecords precondition
and tailor the fallback message based on whether there are
record-level failures or partition-level only.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): separate relationship failures + type lineage as prov:Activity
Two more PR review findings on #27999:
- **Relationship failures inflated failedRecords stat**: `processEntities`
was folding relationship/lineage edge failures into `failedCount`,
which becomes `failedRecords` in the index stats. Records there mean
entities, computed from entity counts in `totalRecords`. Counting
per-edge relationship failures could push `failedRecords` above
`processedRecords`/`totalRecords` and produce nonsensical
per-entity stats.
Track them separately: add `relationshipFailureCount` to
`BatchProcessingResult` and `PartitionResult`. `failedCount` now stays
entity-level. The completion tracker is fed the broader
`result.hasAnyFailure()` so partitions where relationship triples
failed don't get prematurely promoted as success even though their
entity writes succeeded.
- **`detailsResource` wasn't typed as prov:Activity**: the resource
carries Activity-shaped predicates (prov:startedAtTime,
prov:endedAtTime, prov:used, prov:hadPlan, prov:wasGeneratedBy,
prov:wasAssociatedWith) but only the OM-specific
`om:LineageDetails` rdf:type. Add an explicit
`rdf:type prov:Activity` so PROV-O reasoners and federated SPARQL
clients recognize it as an Activity without having to learn the
OM type.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): label lineage edges relative to focal node
The Knowledge Graph view was labeling every edge with relation
type "upstream" as "Upstream" regardless of direction relative to the
focal node. For a focal node F, the raw stored relation `(F, X, upstream)`
means "F is upstream of X" — i.e. X is *downstream* of F. The previous
output labeled both `F → X` and `X → F` edges as "Upstream", which made
bidirectional lineage look like a duplicated relation.
Re-orient the label in `convertEdgesToGraphData` based on whether the
focal is the edge's source or target:
- focal → X → "Downstream"
- X → focal → "Upstream"
- non-focal-touching edges keep the raw relation label.
Reported on a sample-data table with a circular lineage cycle
(`dim_customer ↔ fact_orders`) where both directions showed "Upstream".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close remaining Copilot review gaps
Three findings from PR #27999's third review pass — all about failure
signals being silently dropped between layers:
- **`RdfIndexApp.processTask` ignored relationship failures**: only
`result.failedCount() > 0` was treated as a failure, so partitions
whose Fuseki relationship/lineage writes failed (incrementing
`relationshipFailureCount` but not `failedCount`) never wrote
`jobData.failure`. Switch to `result.hasAnyFailure()` and report the
combined count.
- **`checkAndUpdateJobCompletion` ignored partition `lastError`**: a
partition can finish COMPLETED with `lastError` set when a relationship
bulk write was caught and recorded but didn't bump `failedRecords` or
flip the partition to FAILED. The job would then go to COMPLETED even
though there were real failures. Treat the presence of any
`rdf_index_partition.lastError` as an error signal — promote to
COMPLETED_WITH_ERRORS and aggregate sample errors into the job's
errorMessage if it was blank.
- **`forwardEquivalentPredicate` mapped to a non-existent
`om:DOWNSTREAM` URI**: OpenMetadata only stores lineage with
`om:UPSTREAM` (forward) and `prov:wasDerivedFrom` (reverse PROV-O
pair); there is no `om:DOWNSTREAM` predicate written anywhere — the
downstream view is derived by reading the same UPSTREAM edge from the
other side. Map both `prov:wasDerivedFrom` and `prov:wasInfluencedBy`
to `om:UPSTREAM` (both are reverse-direction causation predicates: in
`B wasDerivedFrom A` / `B wasInfluencedBy A` the source is A and
effect is B, so the canonical forward predicate is the same).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Fix RDF tag mapper
* Fix all the comments
Cherry-picked from #27562 (without bin/ autogenerated noise).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* Align RdfPropertyMapper tests with refactor and isolate ontology export IT
RdfPropertyMapperTest still referenced the removed addVotes helper and
expected addStructuredProperty to dispatch votes — both gone after votes
was added to IGNORED_PROPERTIES. Update the assertions accordingly.
GlossaryOntologyExportIT timed out on the full suite because it flips a
global RDF singleton in @BeforeAll and each test blocks a server thread on
synchronous Fuseki writes. SAME_THREAD only serialized methods within the
class — concurrent classes still raced for server threads. Adding @Isolated
matches the pattern already used by RdfResourceIT for the same reason.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(rdf): align addCertification typing + relationType after predicate flip
Two findings on PR #27999 from the post-cherry-pick review pass:
- **`addCertification` mis-typed glossary-source certifications and
skipped skos:Concept**: it always emitted `om:Tag` regardless of
source, even though `resolveTagResource` returns a glossaryTerm URI
when the certification points at a glossary term. It also didn't add
`skos:Concept` (or the `createTypeResource("tag")` `skos:Concept` for
classification tags), so SPARQL queries filtering certification
targets by `a skos:Concept` missed them while `addTagLabel`-emitted
tags were findable. Mirror `addTagLabel`: branch on source
(`Glossary` vs `Classification`), emit the right primary type plus
`skos:Concept` (glossary) or `om:Tag` (classification), and include
`om:tagSource`.
- **`relationType` left stale after predicate flip**: when
`parseEntityGraphEdgesFromResults` flipped subject/object for a
reverse-direction predicate and rewrote `canonicalPredicate` to
`om:UPSTREAM`, it kept the original `relationType` derived from the
reverse predicate. So `prov:wasInfluencedBy` produced an EdgeInfo
with `relationType=downstream` + `predicate=om:UPSTREAM` —
internally inconsistent, and the mismatched `edgeKey` prevented
dedup against an existing UPSTREAM edge with the same endpoints.
Re-derive `relationType` from the canonical predicate after the
flip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close 2 review findings + add parser-helper unit tests
Two outstanding Copilot findings on PR #27999 plus targeted unit
coverage for the helpers that drive lineage canonicalization.
Findings:
- **`colLineageUri` collision risk** (RdfRepository): the deterministic
key replaced non-alphanumerics in `toColumn` with `_`, so distinct
column names (e.g. `a-b` vs `a_b`) collapsed onto the same URI, which
would lose / overwrite column-lineage resources during reindex.
Append the loop index as a tiebreaker so distinct columns keep
distinct URIs.
- **`createTypeResource` missing dprod prefix** (RdfPropertyMapper):
the `getNamespace` switch didn't recognize `dprod`, so
`RdfUtils.getRdfType("dataProduct")` (returns `dprod:DataProduct`)
produced an invalid `dprod:DataProduct` URI on the wire. Added the
`DPROD_NS = https://ekgf.github.io/dprod/` constant and a `dprod`
case in the switch.
Coverage:
- New `RdfParserHelpersTest` exercises the canonicalization helpers
via reflection: `isReverseDirectionPredicate` (recognizes
PROV-O causation predicates, ignores forward predicates),
`forwardEquivalentPredicate` (both `wasDerivedFrom` and
`wasInfluencedBy` collapse to `om:UPSTREAM` so dedup works),
`relativeRelationLabel` (focal-relative Upstream/Downstream
flipping with all the boundary cases — non-focal edges,
non-lineage relations, null focal).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): merge array contexts before per-field resolution
The third (low-confidence "suppressed") finding on review 4256830399
turned out to be a real duplication: when a field is mapped in one
context map of an array context but absent from another, the previous
processArrayContext ran processContextMappings once per map. The pass
where the field IS mapped emits the proper `om:hasOwner <ref>` triples
(plus `prov:wasAttributedTo`); the pass where the field is absent
falls through to processUnmappedField and emits an additional
`om:owners <ref>` triple. Net: two predicates for the same logical
relationship.
Verified on the live Fuseki: 113 `om:hasOwner` triples vs 112
`om:owners` triples — one set per pass.
Fix: flatten all context maps in the array into a single merged map
once, then iterate entity fields exactly once against that combined
view (later contexts win on key conflicts, matching JSON-LD context
merge semantics). Each field is resolved against the union of
mappings, so the unmapped fallback only fires for fields truly absent
from every context. Net effect: `prov:wasAttributedTo` count is
unchanged, `om:hasOwner` is unchanged, and the redundant `om:owners`
triples disappear.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(rdf): close 2 review findings on coordinator finalization race
Two findings from PR #27999 review 4259628860:
- **`checkAndUpdateJobCompletion` early-returned before lastError check
could promote**: `refreshAggregatedJob` already marks the job COMPLETED
when partitions all finish without `failedRecords`/`failedPartitions`,
so `checkAndUpdateJobCompletion`'s subsequent `if (job.isTerminal())`
short-circuit silently dropped the lastError signal. Move the
partition-lastError check INTO `refreshAggregatedJob` so both code
paths produce consistent terminal status — a partition that finished
COMPLETED but carries a non-null lastError now correctly promotes the
job to COMPLETED_WITH_ERRORS regardless of which finalizer wins the
race.
- **`completePartition` / `failPartition` overwrote CANCELLED state**:
the unconditional partition row update lost a concurrent Stop's
CANCELLED status if a worker finished its batch after the Stop
request landed but before noticing it. Add a status-guarded
`updateIfProcessing` DAO method (UPDATE ... WHERE id = :id AND
status = 'PROCESSING') and have both completion paths use it; if 0
rows update, log and skip the side effects (no server-stat increment,
no refreshAggregatedJob call) so the authoritative CANCELLED status
stays. Mirrors the pattern SearchIndex's coordinator uses for the
same race.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
|
||
|---|---|---|
| .agents/skills | ||
| .claude | ||
| .devcontainer | ||
| .github | ||
| bin | ||
| bootstrap | ||
| common | ||
| conf | ||
| docker | ||
| docs | ||
| examples/python-sdk/data-quality | ||
| ingestion | ||
| openmetadata-airflow-apis | ||
| openmetadata-clients | ||
| openmetadata-dist | ||
| openmetadata-integration-tests | ||
| openmetadata-k8s-operator | ||
| openmetadata-mcp | ||
| openmetadata-sdk | ||
| openmetadata-service | ||
| openmetadata-shaded-deps | ||
| openmetadata-spec | ||
| openmetadata-ui | ||
| openmetadata-ui-core-components | ||
| openspec | ||
| scripts | ||
| skills | ||
| .dockerignore | ||
| .git-blame-ignore-revs | ||
| .gitignore | ||
| .nojekyll | ||
| .pre-commit-config.yaml | ||
| .snyk | ||
| adr-incident-manager-governance-workflows.md | ||
| AGENTS.md | ||
| APPLICATION.md | ||
| CLAUDE.md | ||
| CODE_OF_CONDUCT.md | ||
| CONTRIBUTING.md | ||
| DEVELOPER.md | ||
| generate_ts.sh | ||
| INCIDENT_RESPONSE.md | ||
| LICENSE | ||
| Makefile | ||
| NOTICE | ||
| package.json | ||
| pom.xml | ||
| README.md | ||
| SECURITY.md | ||
| THREAT_MODEL.md | ||
| yarn.lock | ||
Empower your Data Journey with OpenMetadata
What is OpenMetadata?
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column-level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
Contents:
OpenMetadata Consists of Four Main Components:
- Metadata Schemas: These are the core definitions and vocabulary for metadata based on common abstractions and types. They also allow for custom extensions and properties to suit different use cases and domains.
- Metadata Store: This is the central repository for storing and managing the metadata graph, which connects data assets, users, and tool-generated metadata in a unified way.
- Metadata APIs: These are the interfaces for producing and consuming metadata, built on top of the metadata schemas. They enable seamless integration of user interfaces and tools, systems, and services with the metadata store.
- Ingestion Framework: This is a pluggable framework for ingesting metadata from various sources and tools to the metadata store. It supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.
Key Features of OpenMetadata
Data Discovery: Find and explore all your data assets in a single place using various strategies, such as keyword search, data associations, and advanced queries. You can search across tables, topics, dashboards, pipelines, and services.
Data Collaboration: Communicate, converse, and cooperate with other users and teams on data assets. You can get event notifications, send alerts, add announcements, create tasks, and use conversation threads.
Data Quality and Profiler: Measure and monitor the quality with no-code to build trust in your data. You can define and run data quality tests, group them into test suites, and view the results in an interactive dashboard. With powerful collaboration, make data quality a shared responsibility in your organization.
Data Governance: Enforce data policies and standards across your organization. You can define data domains and data products, assign owners and stakeholders, and classify data assets using tags and terms. Use powerful automation features to auto-classify your data.
Data Insights and KPIs: Use reports and platform analytics to understand how your organization's data is doing. Data Insights provides a single-pane view of all the key metrics to reflect the state of your data best. Define the Key Performance Indicators (KPIs) and set goals within OpenMetadata to work towards better documentation, ownership, and tiering. Alerts can be set against the KPIs to be received on a specified schedule.
Data Lineage: Track and visualize the origin and transformation of your data assets end-to-end. You can view column-level lineage, filter queries, and edit lineage manually using a no-code editor.
Data Documentation: Document your data assets and metadata entities using rich text, images, and links. You can also add comments and annotations and generate data dictionaries and data catalogs.
Data Observability: Monitor the health and performance of your data assets and pipelines. You can view metrics such as data freshness, data volume, data quality, and data latency. You can also set up alerts and notifications for any anomalies or failures.
Data Security: Secure your data and metadata using various authentication and authorization mechanisms. You can integrate with different identity providers for single sign-on and define roles and policies for access control.
Webhooks: Integrate with external applications and services using webhooks. You can register URLs to receive metadata event notifications and integrate with Slack, Microsoft Teams, and Google Chat.
Connectors: Ingest metadata from various sources and tools using connectors. OpenMetadata supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.
Try our Sandbox
Take a look and play with sample data at http://sandbox.open-metadata.org
Install and Run OpenMetadata
Get up and running in a few minutes. See the OpenMetadata documentation for installation instructions.
Documentation and Support
We're here to help and make OpenMetadata even better! Check out OpenMetadata documentation for a complete description of OpenMetadata's features. Join our Slack Community to get in touch with us if you want to chat, need help, or discuss new feature requirements.
Contributors
We ❤️ all contributions, big and small! Check out our CONTRIBUTING guide to get started, and let us know how we can help.
Don't want to miss anything? Give the project a ⭐ 🚀
A HUGE THANK YOU to all our supporters!
Stargazers
License
OpenMetadata is released under Apache License, Version 2.0