mirror of
https://github.com/open-metadata/OpenMetadata
synced 2026-05-24 09:39:11 +00:00
Merge upstream/main into fix/greenplum-ddl-metadata-reflect-27405
This commit is contained in:
commit
2ebce67d2d
4610 changed files with 523231 additions and 120110 deletions
58
.agents/skills/java-checkstyle/SKILL.md
Normal file
58
.agents/skills/java-checkstyle/SKILL.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
---
|
||||
name: java-checkstyle
|
||||
description: Run `mvn spotless:apply` to fix Java checkstyle / formatting failures and verify the result. Run after authoring or modifying any `.java` files, or when CI reports a "Java checkstyle failed" / "Fix Java checkstyle" issue on a PR.
|
||||
---
|
||||
|
||||
# Java Checkstyle / Spotless (Codex agent)
|
||||
|
||||
OpenMetadata enforces Java formatting via the Spotless Maven plugin. Every CI
|
||||
build runs `mvn spotless:check` and fails the PR if any file is not formatted.
|
||||
|
||||
## When to activate
|
||||
|
||||
- The user asks to "fix checkstyle", "fix Java formatting", "apply spotless",
|
||||
"run spotless", "format Java", or similar.
|
||||
- CI posts a `Java checkstyle failed` / `Fix Java checkstyle` comment on a PR
|
||||
(the bot's exact phrasing is "Please run `mvn spotless:apply` in the root of
|
||||
your repository and commit the changes to this PR").
|
||||
- After you have finished authoring or editing any `.java` files — before
|
||||
opening a PR or pushing a commit that touches Java.
|
||||
|
||||
## Procedure
|
||||
|
||||
1. From the repo root run Spotless:
|
||||
|
||||
```bash
|
||||
mvn spotless:apply # formats everything
|
||||
# or
|
||||
mvn -pl <module> spotless:apply # scope to a single Maven module for speed
|
||||
# or
|
||||
mvn spotless:check # verify only, without rewriting files
|
||||
```
|
||||
|
||||
Spotless is fast (seconds, no compilation). If it fails with a plugin error
|
||||
rather than a formatting diff, surface the error and stop — do not try to
|
||||
hand-edit formatting around the failure.
|
||||
|
||||
2. Inspect the diff:
|
||||
|
||||
```bash
|
||||
git status --short
|
||||
git diff --stat
|
||||
```
|
||||
|
||||
Expect changes only in `.java` (and possibly `pom.xml`) files. If Spotless
|
||||
keeps rewriting a change you just made, re-read the root `pom.xml`'s
|
||||
`spotless-maven-plugin` config — Spotless is the source of truth, not the
|
||||
IDE.
|
||||
|
||||
3. Only commit if the user asked to. Report the changed-file list first so the
|
||||
user can decide whether to fold the reformat into the in-progress commit or
|
||||
make a separate "Fix Java checkstyle" commit (matches the repo's existing
|
||||
history for bot-triggered formatting-only commits).
|
||||
|
||||
## Out of scope
|
||||
|
||||
- UI / TypeScript formatting — use `yarn pretty` / ESLint flow (see AGENTS.md
|
||||
UI section).
|
||||
- Python formatting — use `make py_format` (black + isort + pycln).
|
||||
86
.agents/skills/ui-checkstyle/SKILL.md
Normal file
86
.agents/skills/ui-checkstyle/SKILL.md
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
---
|
||||
name: ui-checkstyle
|
||||
description: Run the ESLint + Prettier + organize-imports sequence that CI's `UI Checkstyle` jobs (`lint-src`, `lint-playwright`, `lint-core-components`) run — on just the files the PR changed — and fail if any file ends up with a diff. Run after authoring or modifying any `.ts`/`.tsx`/`.js`/`.jsx`/`.json` under `openmetadata-ui/src/main/resources/ui/src/`, `.../playwright/`, or `openmetadata-ui-core-components/src/main/resources/ui/src/`, or when CI reports a `UI Checkstyle` failure on a PR.
|
||||
---
|
||||
|
||||
# UI Checkstyle / ESLint + Prettier + organize-imports (Codex agent)
|
||||
|
||||
The `UI Checkstyle` workflow (`.github/workflows/ui-checkstyle.yml`) has three
|
||||
per-area jobs — `lint-src`, `lint-playwright`, `lint-core-components`. Each
|
||||
reformats the files changed in the PR and fails if the reformat produces a
|
||||
diff, so the committed tree must already be formatted.
|
||||
|
||||
## When to activate
|
||||
|
||||
- The user asks to "fix UI checkstyle", "fix UI lint", "run prettier", "run
|
||||
eslint", "fix UI format", or similar.
|
||||
- CI posts a `UI Checkstyle / lint-src|lint-playwright|lint-core-components`
|
||||
failure (the bot surfaces the modified files in the job summary).
|
||||
- After you have finished authoring or editing any `.ts`/`.tsx`/`.js`/
|
||||
`.jsx`/`.json` under the three UI trees — before opening a PR or pushing
|
||||
a commit that touches UI.
|
||||
|
||||
## Procedure
|
||||
|
||||
1. Build the file list for each affected area:
|
||||
|
||||
```bash
|
||||
# repo root
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui/src/main/resources/ui/src/**/*.{ts,tsx,js,jsx,json}' \
|
||||
| sed 's|openmetadata-ui/src/main/resources/ui/||' > /tmp/src_files.txt
|
||||
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui/src/main/resources/ui/playwright/**/*.{ts,tsx,js,jsx}' \
|
||||
| sed 's|openmetadata-ui/src/main/resources/ui/||' > /tmp/pw_files.txt
|
||||
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui-core-components/**/*.{ts,tsx,js,jsx,json}' \
|
||||
| sed 's|openmetadata-ui-core-components/src/main/resources/ui/||' \
|
||||
> /tmp/core_files.txt
|
||||
```
|
||||
|
||||
Skip any empty list — CI won't run that area's job either.
|
||||
|
||||
2. From the matching working directory (`openmetadata-ui/src/main/resources/ui`
|
||||
or `openmetadata-ui-core-components/src/main/resources/ui`), run the
|
||||
three-step sequence that CI runs:
|
||||
|
||||
```bash
|
||||
# 1) imports first
|
||||
cat /tmp/src_files.txt | xargs ./node_modules/.bin/organize-imports-cli
|
||||
|
||||
# 2) ESLint --fix
|
||||
NODE_OPTIONS='--max-old-space-size=8192' cat /tmp/src_files.txt \
|
||||
| xargs ./node_modules/.bin/eslint --no-error-on-unmatched-pattern --fix
|
||||
|
||||
# 3) prettier --write — MUST be last, because organize-imports-cli uses
|
||||
# 4-space indentation and drops trailing commas; prettier restores them
|
||||
# to the repo's 2-space + trailing-comma style. Reversing the order
|
||||
# leaves CI with a dirty diff.
|
||||
cat /tmp/src_files.txt \
|
||||
| xargs ./node_modules/.bin/prettier \
|
||||
--config './.prettierrc.yaml' --ignore-path './.prettierignore' \
|
||||
--write
|
||||
```
|
||||
|
||||
Core-components has no `organize-imports-cli` wired up — skip step 1 there.
|
||||
|
||||
3. Check the diff from the repo root:
|
||||
|
||||
```bash
|
||||
git status --short
|
||||
git diff --stat
|
||||
```
|
||||
|
||||
If `git status --short` is empty you're done. Otherwise commit the
|
||||
reformatting diff as its own `Fix UI checkstyle` commit, matching the
|
||||
existing history for bot-triggered formatting-only commits — unless the
|
||||
user asked you to fold it into the in-progress commit.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- TypeScript type-check errors (`tsc`) — different jobs, different failure
|
||||
modes, not auto-fixable by this skill.
|
||||
- Java formatting — use the `java-checkstyle` skill (`mvn spotless:apply`).
|
||||
- Python formatting — use `make py_format` (black + isort + pycln).
|
||||
83
.claude/skills/java-checkstyle/SKILL.md
Normal file
83
.claude/skills/java-checkstyle/SKILL.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
---
|
||||
name: java-checkstyle
|
||||
description: Run `mvn spotless:apply` to fix Java checkstyle / formatting failures and verify the result. Invoke after authoring or modifying any `.java` files, or when CI reports a "Java checkstyle failed" or "Fix Java checkstyle" issue on a PR.
|
||||
user-invocable: true
|
||||
argument-hint: "[-pl <module>] [--check]"
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
- Grep
|
||||
- Glob
|
||||
---
|
||||
|
||||
# Java Checkstyle / Spotless
|
||||
|
||||
OpenMetadata enforces Java formatting via the Spotless Maven plugin. Every CI
|
||||
build runs `mvn spotless:check` and fails the PR if any file is not formatted.
|
||||
This skill keeps the fix on a single, consistent command so reviewers never have
|
||||
to ask for it manually again.
|
||||
|
||||
## When to activate
|
||||
|
||||
- The user asks to "fix checkstyle", "fix Java formatting", "apply spotless",
|
||||
"run spotless", "format Java", or similar.
|
||||
- CI posts a `Java checkstyle failed` / `Fix Java checkstyle` comment on a PR
|
||||
(the project's bot phrases the instruction as "Please run
|
||||
`mvn spotless:apply` in the root of your repository and commit the changes").
|
||||
- After the assistant has finished authoring or editing any `.java` files —
|
||||
before opening a PR or pushing a commit that touches Java.
|
||||
|
||||
## Arguments
|
||||
|
||||
- No arguments: run `mvn spotless:apply` at the repo root across all modules.
|
||||
- `-pl <module>`: scope to a single Maven module (e.g.
|
||||
`-pl openmetadata-service`). Useful when only one module changed and you want
|
||||
a faster run.
|
||||
- `--check`: run `mvn spotless:check` instead of `apply`. Use to confirm the
|
||||
tree is clean without touching files (e.g. to verify before push).
|
||||
|
||||
## Process
|
||||
|
||||
### Step 1: Run Spotless
|
||||
|
||||
From the repo root:
|
||||
|
||||
```bash
|
||||
mvn spotless:apply # default — formats everything
|
||||
# or
|
||||
mvn -pl <module> spotless:apply # scoped to one module
|
||||
# or
|
||||
mvn spotless:check # verify only, don't write
|
||||
```
|
||||
|
||||
Spotless is fast (seconds, no compilation). If it fails with a plugin error
|
||||
(not a formatting diff), surface the error and stop — do not try to hand-edit
|
||||
formatting around the failure.
|
||||
|
||||
### Step 2: Check what changed
|
||||
|
||||
```bash
|
||||
git status --short
|
||||
git diff --stat
|
||||
```
|
||||
|
||||
Expect reformatting in `.java` files only. If Spotless touches `pom.xml` or
|
||||
other non-Java files, that's also fine — Spotless is configured for those too
|
||||
in this repo.
|
||||
|
||||
### Step 3: Stage and commit (only if the user asked to commit)
|
||||
|
||||
Do NOT auto-commit. Report the changed file list to the user and let them
|
||||
decide whether to fold the formatting into the in-progress commit or make a
|
||||
separate "Fix Java checkstyle" commit. Follow the repo convention: the
|
||||
existing branch history already uses `Fix Java checkstyle` as the commit title
|
||||
for bot-triggered formatting-only commits.
|
||||
|
||||
## Notes
|
||||
|
||||
- Spotless config lives in the root `pom.xml` (`spotless-maven-plugin`
|
||||
section). Do not redefine formatting rules inline in source files.
|
||||
- If Spotless keeps rewriting a change the user just made, re-read the config
|
||||
— Spotless is the source of truth, not the IDE.
|
||||
- The analogous UI command is `yarn pretty` (see the `test-locally` skill /
|
||||
CLAUDE.md for the UI lint flow); this skill is Java-only.
|
||||
132
.claude/skills/ui-checkstyle/SKILL.md
Normal file
132
.claude/skills/ui-checkstyle/SKILL.md
Normal file
|
|
@ -0,0 +1,132 @@
|
|||
---
|
||||
name: ui-checkstyle
|
||||
description: Run the exact ESLint + Prettier + organize-imports sequence that CI's `UI Checkstyle` jobs (`lint-src`, `lint-playwright`, `lint-core-components`) run — on just the files the PR changed — and fail the task if any file ends up with a diff. Invoke after authoring or modifying any `.ts`, `.tsx`, `.js`, `.jsx`, or `.json` file under `openmetadata-ui/src/main/resources/ui/src/`, `.../playwright/`, or `openmetadata-ui-core-components/src/main/resources/ui/src/`, or when CI reports a "UI Checkstyle" job failure on the PR.
|
||||
user-invocable: true
|
||||
argument-hint: "[--src] [--playwright] [--core-components] [--all] [--check]"
|
||||
allowed-tools:
|
||||
- Bash
|
||||
- Read
|
||||
- Grep
|
||||
- Glob
|
||||
---
|
||||
|
||||
# UI Checkstyle / ESLint + Prettier + organize-imports
|
||||
|
||||
The `UI Checkstyle` GitHub workflow
|
||||
(`.github/workflows/ui-checkstyle.yml`) runs three per-area jobs:
|
||||
`lint-src` (`openmetadata-ui/src/main/resources/ui/src/...`),
|
||||
`lint-playwright` (`.../playwright/...`),
|
||||
`lint-core-components`
|
||||
(`openmetadata-ui-core-components/src/main/resources/ui/src/...`). Each job
|
||||
reformats only the files changed in the PR and fails if the reformat produces
|
||||
any diff — i.e. the committed tree must already be formatted.
|
||||
|
||||
This skill runs the same sequence locally so the CI never has to ask.
|
||||
|
||||
## When to activate
|
||||
|
||||
- The user asks to "fix UI checkstyle", "fix UI lint", "run prettier", "run
|
||||
eslint", "fix the UI format", "apply UI format", or similar.
|
||||
- CI posts a `UI Checkstyle / lint-src|lint-playwright|lint-core-components`
|
||||
failure (the bot lists the modified files in the job summary).
|
||||
- After the assistant has finished authoring or editing any `.ts`/`.tsx`/
|
||||
`.js`/`.jsx`/`.json` under the three UI trees — before opening a PR or
|
||||
pushing a commit that touches UI.
|
||||
|
||||
## Arguments
|
||||
|
||||
- `--src` (default for files under `openmetadata-ui/.../ui/src/`)
|
||||
- `--playwright` (files under `.../ui/playwright/`)
|
||||
- `--core-components` (files under `openmetadata-ui-core-components/...`)
|
||||
- `--all` — run all three areas
|
||||
- `--check` — verify only: run the sequence in a dry-run pass and report
|
||||
which files are still dirty, without writing. Useful before push.
|
||||
|
||||
If invoked with no flag, auto-detect the affected areas from
|
||||
`git diff --name-only origin/main...HEAD` and run only those.
|
||||
|
||||
## Process
|
||||
|
||||
### Step 1: Compute the file list
|
||||
|
||||
For each area you are running against:
|
||||
|
||||
```bash
|
||||
# from the repo root
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui/src/main/resources/ui/src/**/*.{ts,tsx,js,jsx,json}' \
|
||||
| sed 's|openmetadata-ui/src/main/resources/ui/||' > /tmp/src_files.txt
|
||||
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui/src/main/resources/ui/playwright/**/*.{ts,tsx,js,jsx}' \
|
||||
| sed 's|openmetadata-ui/src/main/resources/ui/||' > /tmp/pw_files.txt
|
||||
|
||||
git diff --name-only origin/main...HEAD -- \
|
||||
'openmetadata-ui-core-components/**/*.{ts,tsx,js,jsx,json}' \
|
||||
| sed 's|openmetadata-ui-core-components/src/main/resources/ui/||' \
|
||||
> /tmp/core_files.txt
|
||||
```
|
||||
|
||||
Skip any list that is empty — that area has no changes so the CI job for it
|
||||
wouldn't run anyway.
|
||||
|
||||
### Step 2: Run the CI sequence
|
||||
|
||||
From the corresponding working directory:
|
||||
|
||||
```bash
|
||||
cd openmetadata-ui/src/main/resources/ui # or .../openmetadata-ui-core-components/src/main/resources/ui
|
||||
|
||||
# 1) imports first — organize-imports-cli only exists for the ui module
|
||||
cat /tmp/src_files.txt | xargs ./node_modules/.bin/organize-imports-cli
|
||||
|
||||
# 2) eslint --fix (same flags CI uses)
|
||||
NODE_OPTIONS='--max-old-space-size=8192' cat /tmp/src_files.txt \
|
||||
| xargs ./node_modules/.bin/eslint --no-error-on-unmatched-pattern --fix
|
||||
|
||||
# 3) prettier --write — this MUST run after organize-imports because
|
||||
# organize-imports uses 4-space indentation / drops trailing commas,
|
||||
# and prettier then puts them back to the repo's 2-space + trailing-comma
|
||||
# style. Running them in the other order leaves a dirty diff.
|
||||
cat /tmp/src_files.txt \
|
||||
| xargs ./node_modules/.bin/prettier \
|
||||
--config './.prettierrc.yaml' --ignore-path './.prettierignore' \
|
||||
--write
|
||||
```
|
||||
|
||||
For playwright, use the same three commands on `/tmp/pw_files.txt`.
|
||||
For core-components, the organize-imports step is skipped (no CLI there) —
|
||||
just eslint + prettier.
|
||||
|
||||
### Step 3: Report what changed
|
||||
|
||||
```bash
|
||||
cd <repo root>
|
||||
git status --short # should list only .ts/.tsx/.js/.jsx/.json files
|
||||
git diff --stat
|
||||
```
|
||||
|
||||
If `git status --short` is empty, the tree is already clean — tell the user
|
||||
and stop.
|
||||
|
||||
### Step 4: Commit (only if the user asked to)
|
||||
|
||||
Do NOT auto-commit. Surface the list of modified files to the user; they
|
||||
decide whether to fold the reformat into the in-progress commit or create a
|
||||
dedicated "Fix UI checkstyle" commit (matches the repo's existing history for
|
||||
bot-triggered formatting-only commits).
|
||||
|
||||
## Notes
|
||||
|
||||
- The `--check` mode mirrors CI's behavior: run the three commands and then
|
||||
verify `git status --short` is empty. Revert any writes before exiting so
|
||||
the user's working tree isn't touched.
|
||||
- If ESLint reports hard errors (not warnings, not auto-fixable), stop and
|
||||
surface them — they need a real code change, not a format pass. Warnings
|
||||
(e.g. `playwright/no-wait-for-selector`) don't fail CI and can be left.
|
||||
- The analogous Java command is `mvn spotless:apply` — see the
|
||||
`java-checkstyle` skill.
|
||||
- TypeScript type-check errors (`tsc`) are a separate concern and are
|
||||
*not* fixed by this skill — the `tsc-src` / `tsc-playwright` jobs are
|
||||
currently either skipped or have their own failures surfaced via the CI
|
||||
report.
|
||||
29
.github/ISSUE_TEMPLATE/bug_report.md
vendored
29
.github/ISSUE_TEMPLATE/bug_report.md
vendored
|
|
@ -1,29 +0,0 @@
|
|||
---
|
||||
name: Bug report
|
||||
about: Create a report to help us improve
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
---
|
||||
|
||||
**Affected module**
|
||||
Does it impact the UI, backend or Ingestion Framework?
|
||||
|
||||
**Describe the bug**
|
||||
A clear and concise description of what the bug is.
|
||||
|
||||
**To Reproduce**
|
||||
|
||||
Screenshots or steps to reproduce
|
||||
|
||||
**Expected behavior**
|
||||
A clear and concise description of what you expected to happen.
|
||||
|
||||
**Version:**
|
||||
- OS: [e.g. iOS]
|
||||
- Python version:
|
||||
- OpenMetadata version: [e.g. 0.8]
|
||||
- OpenMetadata Ingestion package version: [e.g. `openmetadata-ingestion[docker]==XYZ`]
|
||||
|
||||
**Additional context**
|
||||
Add any other context about the problem here.
|
||||
87
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
87
.github/ISSUE_TEMPLATE/bug_report.yml
vendored
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
name: Bug report
|
||||
description: Create a report to help us improve
|
||||
labels: ["bug"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
> **Bug in a specific connector?** (Snowflake, Databricks, BigQuery, etc.) — use the **[Connector Bug](https://github.com/open-metadata/OpenMetadata/issues/new?template=connector_bug.yml)** template instead for faster triage.
|
||||
|
||||
Thanks for taking the time to file a bug! Before you go further:
|
||||
- Search [existing issues](https://github.com/open-metadata/OpenMetadata/issues) for duplicates.
|
||||
- Check the [docs](https://docs.open-metadata.org/) and [Slack](https://slack.open-metadata.org/) for known workarounds.
|
||||
- **Redact credentials, hostnames, emails, and other sensitive data** from logs and config before submitting.
|
||||
- type: dropdown
|
||||
id: affected_module
|
||||
attributes:
|
||||
label: Affected module
|
||||
description: Which area of OpenMetadata does this bug affect?
|
||||
options:
|
||||
- UI
|
||||
- Backend
|
||||
- Ingestion Framework
|
||||
- Connector
|
||||
- Data Quality / Profiler
|
||||
- Lineage
|
||||
- Search / Discovery
|
||||
- Authentication / Security
|
||||
- Governance (Glossary / Classification / Domains)
|
||||
- Documentation
|
||||
- Other
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: describe
|
||||
attributes:
|
||||
label: Describe the bug
|
||||
description: A clear and concise description of what the bug is.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: reproduce
|
||||
attributes:
|
||||
label: To Reproduce
|
||||
description: Screenshots or steps to reproduce.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: expected
|
||||
attributes:
|
||||
label: Expected behavior
|
||||
description: A clear and concise description of what you expected to happen.
|
||||
validations:
|
||||
required: true
|
||||
- type: input
|
||||
id: os
|
||||
attributes:
|
||||
label: OS
|
||||
placeholder: "macOS 14.4 / Ubuntu 22.04 / Windows 11"
|
||||
- type: input
|
||||
id: python_version
|
||||
attributes:
|
||||
label: Python version
|
||||
placeholder: "3.11.7"
|
||||
- type: input
|
||||
id: om_version
|
||||
attributes:
|
||||
label: OpenMetadata version
|
||||
placeholder: "1.9.2"
|
||||
- type: input
|
||||
id: ingestion_version
|
||||
attributes:
|
||||
label: OpenMetadata Ingestion package version
|
||||
placeholder: "openmetadata-ingestion==1.9.2"
|
||||
- type: textarea
|
||||
id: additional_context
|
||||
attributes:
|
||||
label: Additional context
|
||||
description: Add any other context about the problem here. Redact sensitive data.
|
||||
- type: checkboxes
|
||||
id: checks
|
||||
attributes:
|
||||
label: Pre-submission checklist
|
||||
options:
|
||||
- label: I searched for duplicate issues.
|
||||
required: true
|
||||
- label: I removed credentials, hostnames, emails, and other sensitive data from logs and config.
|
||||
required: true
|
||||
101
.github/ISSUE_TEMPLATE/connector_bug.yml
vendored
Normal file
101
.github/ISSUE_TEMPLATE/connector_bug.yml
vendored
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
name: Connector bug report
|
||||
description: Bug in a specific data connector (Snowflake, Databricks, BigQuery, etc.)
|
||||
labels: ["bug", "Ingestion"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for reporting a connector bug! Before you go further:
|
||||
- Search [existing issues](https://github.com/open-metadata/OpenMetadata/issues) for duplicates.
|
||||
- Check the [connector docs](https://docs.open-metadata.org/latest/connectors) and [Slack](https://slack.open-metadata.org/) for known workarounds.
|
||||
- **Redact credentials, hostnames, emails, and other sensitive data** from logs and config before submitting.
|
||||
- type: input
|
||||
id: connector
|
||||
attributes:
|
||||
label: Connector
|
||||
description: Name of the affected connector. See the [connector docs](https://docs.open-metadata.org/latest/connectors) for the full supported list.
|
||||
placeholder: "e.g. Snowflake, Databricks, BigQuery, Power BI"
|
||||
validations:
|
||||
required: true
|
||||
- type: dropdown
|
||||
id: feature_area
|
||||
attributes:
|
||||
label: Feature area
|
||||
description: Which part of the connector is broken?
|
||||
options:
|
||||
- Metadata ingestion
|
||||
- Lineage
|
||||
- Profiler / Data Quality
|
||||
- Usage
|
||||
- Test Connection
|
||||
- Authentication / Connection
|
||||
- Other
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: describe
|
||||
attributes:
|
||||
label: Describe the bug
|
||||
description: A clear and concise description of what the bug is.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: reproduce
|
||||
attributes:
|
||||
label: To Reproduce
|
||||
description: Steps or screenshots to reproduce.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: expected
|
||||
attributes:
|
||||
label: Expected behavior
|
||||
description: A clear and concise description of what you expected to happen.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: connection_config
|
||||
attributes:
|
||||
label: Connection / ingestion config
|
||||
description: Paste the relevant YAML. **Redact credentials, hostnames, and other sensitive values.**
|
||||
render: yaml
|
||||
- type: textarea
|
||||
id: logs
|
||||
attributes:
|
||||
label: Logs
|
||||
description: Relevant log output. Redact sensitive data.
|
||||
render: shell
|
||||
- type: input
|
||||
id: os
|
||||
attributes:
|
||||
label: OS
|
||||
placeholder: "macOS 14.4 / Ubuntu 22.04 / Windows 11"
|
||||
- type: input
|
||||
id: python_version
|
||||
attributes:
|
||||
label: Python version
|
||||
placeholder: "3.11.7"
|
||||
- type: input
|
||||
id: om_version
|
||||
attributes:
|
||||
label: OpenMetadata version
|
||||
placeholder: "1.9.2"
|
||||
- type: input
|
||||
id: ingestion_version
|
||||
attributes:
|
||||
label: OpenMetadata Ingestion package version
|
||||
placeholder: "openmetadata-ingestion==1.9.2"
|
||||
- type: textarea
|
||||
id: additional_context
|
||||
attributes:
|
||||
label: Additional context
|
||||
description: Anything else that helps us understand the problem. Redact sensitive data.
|
||||
- type: checkboxes
|
||||
id: checks
|
||||
attributes:
|
||||
label: Pre-submission checklist
|
||||
options:
|
||||
- label: I searched for duplicate issues.
|
||||
required: true
|
||||
- label: I removed credentials, hostnames, emails, and other sensitive data from logs and config.
|
||||
required: true
|
||||
16
.github/ISSUE_TEMPLATE/doc_update.md
vendored
16
.github/ISSUE_TEMPLATE/doc_update.md
vendored
|
|
@ -1,16 +0,0 @@
|
|||
---
|
||||
name: Documentation Request
|
||||
about: Let us know what our docs can improve
|
||||
title: ''
|
||||
labels: 'documentation'
|
||||
assignees: ''
|
||||
---
|
||||
|
||||
**Is some content missing, wrong or not clear?**
|
||||
A clear and concise description of what the problem is and the source URL. Ex. Page [...] is not clear.
|
||||
|
||||
**Describe the solution you'd like**
|
||||
Let us know what could help us improve the docs.
|
||||
|
||||
**Additional context**
|
||||
Add any other context or screenshots about the feature request here.
|
||||
40
.github/ISSUE_TEMPLATE/doc_update.yml
vendored
Normal file
40
.github/ISSUE_TEMPLATE/doc_update.yml
vendored
Normal file
|
|
@ -0,0 +1,40 @@
|
|||
name: Documentation Request
|
||||
description: Let us know what our docs can improve
|
||||
labels: ["documentation"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for helping us improve the docs! Before you file:
|
||||
- Search [existing issues](https://github.com/open-metadata/OpenMetadata/issues) for duplicates.
|
||||
- Check the latest [docs](https://docs.open-metadata.org/) — content may have been updated recently.
|
||||
- type: input
|
||||
id: doc_url
|
||||
attributes:
|
||||
label: Documentation URL
|
||||
description: Link to the page that needs updating. Leave blank if the docs for this topic don't exist yet.
|
||||
placeholder: "https://docs.open-metadata.org/... (or leave blank if missing)"
|
||||
- type: textarea
|
||||
id: problem
|
||||
attributes:
|
||||
label: Is some content missing, wrong or not clear?
|
||||
description: A clear and concise description of what the problem is. Ex. Page [...] is not clear.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: solution
|
||||
attributes:
|
||||
label: Describe the solution you'd like
|
||||
description: Let us know what could help us improve the docs.
|
||||
- type: textarea
|
||||
id: additional_context
|
||||
attributes:
|
||||
label: Additional context
|
||||
description: Add any other context or screenshots about the request here.
|
||||
- type: checkboxes
|
||||
id: checks
|
||||
attributes:
|
||||
label: Pre-submission checklist
|
||||
options:
|
||||
- label: I searched for duplicate doc issues.
|
||||
required: true
|
||||
22
.github/ISSUE_TEMPLATE/epic.md
vendored
22
.github/ISSUE_TEMPLATE/epic.md
vendored
|
|
@ -1,22 +0,0 @@
|
|||
---
|
||||
name: Epic Feature
|
||||
about: Roadmap track of features
|
||||
title: ''
|
||||
labels: 'epic'
|
||||
assignees: ''
|
||||
---
|
||||
|
||||
**Is your feature request related to a problem? Please describe.**
|
||||
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
|
||||
**Describe the solution you'd like**
|
||||
A clear and concise description of what you want to happen.
|
||||
|
||||
**Describe alternatives you've considered**
|
||||
A clear and concise description of any alternative solutions or features you've considered.
|
||||
|
||||
**Additional context**
|
||||
Add any other context or screenshots about the feature request here.
|
||||
|
||||
**Related issues**
|
||||
- ...
|
||||
19
.github/ISSUE_TEMPLATE/feature_request.md
vendored
19
.github/ISSUE_TEMPLATE/feature_request.md
vendored
|
|
@ -1,19 +0,0 @@
|
|||
---
|
||||
name: Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: ''
|
||||
labels: 'enhancement'
|
||||
assignees: ''
|
||||
---
|
||||
|
||||
**Is your feature request related to a problem? Please describe.**
|
||||
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
|
||||
**Describe the solution you'd like**
|
||||
A clear and concise description of what you want to happen.
|
||||
|
||||
**Describe alternatives you've considered**
|
||||
A clear and concise description of any alternative solutions or features you've considered.
|
||||
|
||||
**Additional context**
|
||||
Add any other context or screenshots about the feature request here.
|
||||
41
.github/ISSUE_TEMPLATE/feature_request.yml
vendored
Normal file
41
.github/ISSUE_TEMPLATE/feature_request.yml
vendored
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
name: Feature request
|
||||
description: Suggest an idea for this project
|
||||
labels: ["enhancement"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for suggesting an improvement! Before you file:
|
||||
- Search [existing issues](https://github.com/open-metadata/OpenMetadata/issues) for duplicates.
|
||||
- Check the [roadmap](https://docs.open-metadata.org/) and [Slack](https://slack.open-metadata.org/) to see if it's already planned or discussed.
|
||||
- type: textarea
|
||||
id: problem
|
||||
attributes:
|
||||
label: Is your feature request related to a problem? Please describe.
|
||||
description: A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: solution
|
||||
attributes:
|
||||
label: Describe the solution you'd like
|
||||
description: A clear and concise description of what you want to happen.
|
||||
validations:
|
||||
required: true
|
||||
- type: textarea
|
||||
id: alternatives
|
||||
attributes:
|
||||
label: Describe alternatives you've considered
|
||||
description: A clear and concise description of any alternative solutions or features you've considered.
|
||||
- type: textarea
|
||||
id: additional_context
|
||||
attributes:
|
||||
label: Additional context
|
||||
description: Add any other context or screenshots about the feature request here.
|
||||
- type: checkboxes
|
||||
id: checks
|
||||
attributes:
|
||||
label: Pre-submission checklist
|
||||
options:
|
||||
- label: I searched for duplicate feature requests.
|
||||
required: true
|
||||
13
.github/ISSUE_TEMPLATE/feature_task.md
vendored
13
.github/ISSUE_TEMPLATE/feature_task.md
vendored
|
|
@ -1,13 +0,0 @@
|
|||
---
|
||||
name: Feature task
|
||||
about: Create a Feature based on an issue
|
||||
title: ''
|
||||
labels: ''
|
||||
assignees: ''
|
||||
---
|
||||
|
||||
**Feature**
|
||||
Add feature issue reference
|
||||
|
||||
**Describe the task**
|
||||
A clear and concise description of what the bug is.
|
||||
|
|
@ -84,7 +84,7 @@ runs:
|
|||
source env/bin/activate
|
||||
uv pip install "setuptools<81"
|
||||
uv pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
|
||||
uv pip install --no-deps "sqlalchemy-redshift==0.8.14" "sqlalchemy-databricks==0.2.0" "sqlalchemy-ibmi==0.9.3" "pydoris-custom==1.1.0"
|
||||
uv pip install --no-deps "sqlalchemy-redshift==0.8.14" "sqlalchemy-ibmi==0.9.3" "pydoris-custom==1.1.0"
|
||||
uv pip install "${{ github.workspace }}/ingestion[all]"
|
||||
uv pip install "${{ github.workspace }}/ingestion[test]"
|
||||
uv pip install nox
|
||||
|
|
|
|||
4
.github/copilot-instructions.md
vendored
4
.github/copilot-instructions.md
vendored
|
|
@ -448,8 +448,8 @@ yarn pre-commit # Run precommit checks (lint-staged): license headers, i18
|
|||
|
||||
### Python
|
||||
```bash
|
||||
make py_format # Format with black, isort, pycln
|
||||
make lint # Run pylint
|
||||
make py_format # Apply ruff lint-fix + format
|
||||
make py_format_check # Verify lint + format (matches CI; catches non-auto-fixable issues)
|
||||
make static-checks # Run type checking with basedpyright
|
||||
```
|
||||
|
||||
|
|
|
|||
67
.github/dependabot.yml
vendored
Normal file
67
.github/dependabot.yml
vendored
Normal file
|
|
@ -0,0 +1,67 @@
|
|||
version: 2
|
||||
|
||||
# NOTE: This file controls Dependabot version-update PRs only.
|
||||
# It does NOT suppress Dependabot security alerts on the Security tab.
|
||||
# To auto-dismiss transitive (indirect) alerts, configure auto-triage rules at
|
||||
# Settings -> Code security -> Dependabot -> "Manage rules".
|
||||
|
||||
updates:
|
||||
- package-ecosystem: "pip"
|
||||
directory: "/ingestion"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 5
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "python"
|
||||
groups:
|
||||
python-minor-patch:
|
||||
update-types:
|
||||
- "minor"
|
||||
- "patch"
|
||||
ignore:
|
||||
# urllib3 is pinned <2.0 transitively via tableauserverclient==0.25.
|
||||
# See ingestion/setup.py comment on the tableau pin.
|
||||
- dependency-name: "urllib3"
|
||||
versions: [">=2.0.0"]
|
||||
|
||||
- package-ecosystem: "maven"
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 5
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "java"
|
||||
groups:
|
||||
maven-minor-patch:
|
||||
update-types:
|
||||
- "minor"
|
||||
- "patch"
|
||||
|
||||
- package-ecosystem: "npm"
|
||||
directory: "/openmetadata-ui/src/main/resources/ui"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 5
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "javascript"
|
||||
groups:
|
||||
npm-minor-patch:
|
||||
update-types:
|
||||
- "minor"
|
||||
- "patch"
|
||||
|
||||
- package-ecosystem: "github-actions"
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
day: "monday"
|
||||
open-pull-requests-limit: 3
|
||||
labels:
|
||||
- "dependencies"
|
||||
- "github-actions"
|
||||
89
.github/pull_request_template.md
vendored
89
.github/pull_request_template.md
vendored
|
|
@ -5,19 +5,21 @@ Unless your change is trivial, please create an issue to discuss the change befo
|
|||
|
||||
### Describe your changes:
|
||||
|
||||
Fixes <issue-number>
|
||||
Fixes #<issue-number>
|
||||
<!--
|
||||
Linking an issue is REQUIRED. Replace <issue-number> with the GitHub issue number this PR addresses
|
||||
(e.g., `Fixes #12345`). GitHub will auto-link it. If no issue exists, please open one first so the
|
||||
problem and design can be discussed before review.
|
||||
-->
|
||||
|
||||
<!--
|
||||
Short blurb explaining:
|
||||
- What changes did you make?
|
||||
- Why did you make them?
|
||||
- How did you test your changes?
|
||||
-->
|
||||
|
||||
I worked on ... because ...
|
||||
|
||||
<!-- For frontend related change, please add screenshots and/or videos of your changes preview! -->
|
||||
|
||||
#
|
||||
### Type of change:
|
||||
<!-- You should choose 1 option and delete options that aren't relevant -->
|
||||
|
|
@ -27,13 +29,90 @@ I worked on ... because ...
|
|||
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
|
||||
- [ ] Documentation
|
||||
|
||||
#
|
||||
### High-level design:
|
||||
<!--
|
||||
REQUIRED for large PRs (new features, refactors, breaking changes, or anything touching >5 files).
|
||||
Skip for small bug fixes and trivial changes.
|
||||
|
||||
Cover:
|
||||
- Architecture / approach you took and why
|
||||
- Key components or files added/changed and how they interact
|
||||
- Alternatives considered and why you rejected them
|
||||
- Any migration, backward-compatibility, or rollout concerns
|
||||
- Diagrams or links to design docs / RFCs if available
|
||||
-->
|
||||
|
||||
N/A — small change. <!-- Or fill in the design above -->
|
||||
|
||||
#
|
||||
### Tests:
|
||||
|
||||
#### Use cases covered
|
||||
<!--
|
||||
List the user-visible scenarios this PR exercises. Example:
|
||||
- User with Admin role can create a Glossary Term with a parent term
|
||||
- Ingestion run for Snowflake correctly extracts row counts for partitioned tables
|
||||
-->
|
||||
|
||||
#### Unit tests
|
||||
<!--
|
||||
- [ ] I added unit tests for the new/changed logic.
|
||||
- Files added/updated:
|
||||
- Coverage on changed classes (run `mvn jacoco:report` for backend, `yarn test:coverage` for UI,
|
||||
`make unit_ingestion` for ingestion). Target is 90% line coverage on changed classes.
|
||||
- Coverage %: <e.g., 92% on EntityRepository.java>
|
||||
-->
|
||||
|
||||
#### Backend integration tests
|
||||
<!--
|
||||
- [ ] I added integration tests in `openmetadata-integration-tests/` for new/changed API endpoints.
|
||||
- [ ] Not applicable (no backend API changes).
|
||||
- Files added/updated:
|
||||
-->
|
||||
|
||||
#### Ingestion integration tests
|
||||
<!--
|
||||
- [ ] I added/updated ingestion integration tests for connector changes.
|
||||
- [ ] Not applicable (no ingestion changes).
|
||||
- Files added/updated:
|
||||
-->
|
||||
|
||||
#### Playwright (UI) tests
|
||||
<!--
|
||||
- [ ] I added Playwright E2E tests under `openmetadata-ui/.../ui/playwright/` for UI changes.
|
||||
- [ ] Not applicable (no UI changes).
|
||||
- Files added/updated:
|
||||
-->
|
||||
|
||||
#### Manual testing performed
|
||||
<!--
|
||||
List the manual test steps you performed before requesting review. Example:
|
||||
1. Started local stack via `./docker/run_local_docker.sh -m ui -d mysql`
|
||||
2. Logged in as admin, created entity X, verified Y appears in the UI
|
||||
3. Triggered ingestion for Snowflake source, confirmed lineage edges in the explore page
|
||||
-->
|
||||
|
||||
#
|
||||
### UI screen recording / screenshots:
|
||||
<!--
|
||||
REQUIRED for any PR that changes the UI. Drag-and-drop a short screen recording (.mov / .mp4 / .gif)
|
||||
demonstrating the change end-to-end, plus before/after screenshots where relevant.
|
||||
Mark "Not applicable" if there are no UI changes.
|
||||
-->
|
||||
|
||||
Not applicable. <!-- Or attach recording/screenshots above -->
|
||||
|
||||
#
|
||||
### Checklist:
|
||||
<!-- add an x in [] if done, don't mark items that you didn't do !-->
|
||||
- [x] I have read the [**CONTRIBUTING**](https://docs.open-metadata.org/developers/contribute) document.
|
||||
- [ ] My PR title is `Fixes <issue-number>: <short explanation>`
|
||||
- [ ] I have commented on my code, particularly in hard-to-understand areas.
|
||||
- [ ] My PR is linked to a GitHub issue via `Fixes #<issue-number>` above.
|
||||
- [ ] I have commented on my code, particularly in hard-to-understand areas.
|
||||
- [ ] For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
|
||||
- [ ] For UI changes: I attached a screen recording and/or screenshots above.
|
||||
- [ ] I have added tests (unit / integration / Playwright as applicable) and listed them above.
|
||||
|
||||
<!-- Based on the type(s) of your change, uncomment the required checklist 👇 -->
|
||||
|
||||
|
|
|
|||
98
.github/scripts/label_connector.py
vendored
Normal file
98
.github/scripts/label_connector.py
vendored
Normal file
|
|
@ -0,0 +1,98 @@
|
|||
"""Auto-label connector bugs.
|
||||
|
||||
Reads the "Connector" field from the issue body and applies one connector:* label.
|
||||
- Exactly one rule matches → that label.
|
||||
- Zero or multiple matches → connector:other.
|
||||
Any other connector:* label this script manages is removed.
|
||||
|
||||
To add a connector: append one row to RULES.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from urllib.error import HTTPError
|
||||
from urllib.parse import quote
|
||||
from urllib.request import Request, urlopen
|
||||
|
||||
RULES = [
|
||||
("connector:mssql", r"\b(mssql|ms ?sql|sql ?server)\b"),
|
||||
("connector:mysql", r"\bmysql\b"),
|
||||
("connector:s3", r"\b(aws )?s3\b"),
|
||||
("connector:bigquery", r"\b(big ?query|gcp bigquery)\b"),
|
||||
("connector:snowflake", r"\bsnowflake\b"),
|
||||
("connector:redshift", r"\b(aws )?redshift\b"),
|
||||
("connector:unity-catalog", r"\bunity ?catalog\b"),
|
||||
("connector:powerbi", r"\bpower ?bi\b"),
|
||||
("connector:postgres", r"\bpostgres(ql)?\b"),
|
||||
("connector:athena", r"\b(aws )?athena\b"),
|
||||
("connector:tableau", r"\btableau\b"),
|
||||
("connector:looker", r"\blooker\b"),
|
||||
("connector:airflow", r"\b(apache )?airflow\b"),
|
||||
("connector:dbt", r"\bdbt( ?cloud| ?core)?\b"),
|
||||
("connector:databricks", r"\bdatabricks\b"),
|
||||
("connector:fabric", r"\b(microsoft |ms )?fabric\b"),
|
||||
]
|
||||
|
||||
OTHER = "connector:other"
|
||||
MANAGED = {label for label, _ in RULES} | {OTHER}
|
||||
|
||||
TOKEN = os.environ["GITHUB_TOKEN"]
|
||||
REPO = os.environ["GITHUB_REPOSITORY"]
|
||||
|
||||
|
||||
def gh(method, path, body=None, ok=(200, 201, 204)):
|
||||
data = json.dumps(body).encode() if body else None
|
||||
req = Request(
|
||||
f"https://api.github.com/repos/{REPO}{path}",
|
||||
data=data, method=method,
|
||||
headers={
|
||||
"Authorization": f"Bearer {TOKEN}",
|
||||
"Accept": "application/vnd.github+json",
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
)
|
||||
try:
|
||||
with urlopen(req) as r:
|
||||
status = r.status
|
||||
except HTTPError as e:
|
||||
status = e.code
|
||||
if status not in ok:
|
||||
raise RuntimeError(f"{method} {path} returned HTTP {status}")
|
||||
return status
|
||||
|
||||
|
||||
def classify(field_value):
|
||||
norm = re.sub(r"\s+", " ", re.sub(r"[_\-/.,()]", " ", field_value.lower())).strip()
|
||||
hits = [label for label, pattern in RULES if re.search(pattern, norm)]
|
||||
return hits[0] if len(hits) == 1 else OTHER
|
||||
|
||||
|
||||
def main():
|
||||
with open(os.environ["GITHUB_EVENT_PATH"]) as f:
|
||||
issue = json.load(f)["issue"]
|
||||
|
||||
match = re.search(r"### Connector\s*\n+\s*([^\n]+)", issue.get("body") or "")
|
||||
field = match.group(1).strip() if match else ""
|
||||
if not field or field == "_No response_":
|
||||
print("No Connector field — skipping.")
|
||||
return
|
||||
|
||||
target = classify(field)
|
||||
current = {label["name"] for label in issue.get("labels", [])}
|
||||
print(f'Resolved to "{target}"')
|
||||
|
||||
if gh("GET", f"/labels/{quote(target, safe='')}", ok=(200, 404)) == 404:
|
||||
gh("POST", "/labels", {"name": target, "color": "aaaaaa", "description": "Connector"})
|
||||
|
||||
for label in current & MANAGED - {target}:
|
||||
gh("DELETE", f"/issues/{issue['number']}/labels/{quote(label, safe='')}", ok=(200, 404))
|
||||
print(f'Removed "{label}"')
|
||||
|
||||
if target not in current:
|
||||
gh("POST", f"/issues/{issue['number']}/labels", {"labels": [target]})
|
||||
print(f'Added "{target}"')
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
44
.github/workflows/airflow-apis-tests.yml
vendored
44
.github/workflows/airflow-apis-tests.yml
vendored
|
|
@ -16,13 +16,24 @@ on:
|
|||
types: [labeled, opened, synchronize, reopened, ready_for_review]
|
||||
paths:
|
||||
- 'openmetadata-airflow-apis/**'
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
concurrency:
|
||||
concurrency:
|
||||
group: airflow-apis-tests-${{ github.event.pull_request.number || github.run_id }}
|
||||
cancel-in-progress: true
|
||||
|
||||
env:
|
||||
SONAR_OPTS: >-
|
||||
-Dproject.settings=openmetadata-airflow-apis/sonar-project.properties
|
||||
-Dsonar.pullrequest.key=${{ github.event.pull_request.number }}
|
||||
-Dsonar.pullrequest.branch=${{ github.event.pull_request.head.ref }}
|
||||
-Dsonar.pullrequest.github.repository=OpenMetadata
|
||||
-Dsonar.scm.revision=${{ github.event.pull_request.head.sha }}
|
||||
-Dsonar.pullrequest.provider=github
|
||||
|
||||
jobs:
|
||||
airflow-apis-tests:
|
||||
runs-on: ubuntu-latest
|
||||
|
|
@ -113,26 +124,39 @@ jobs:
|
|||
sed -i 's/openmetadata_managed_apis/\/github\/workspace\/openmetadata-airflow-apis\/openmetadata_managed_apis/g' openmetadata-airflow-apis/ci-coverage.xml
|
||||
|
||||
- name: Push Results in PR to Sonar
|
||||
uses: sonarsource/sonarcloud-github-action@master
|
||||
id: push-to-sonar
|
||||
if: ${{ github.event_name == 'pull_request_target' }}
|
||||
continue-on-error: true
|
||||
uses: SonarSource/sonarqube-scan-action@v7
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
SONAR_TOKEN: ${{ secrets.AIRFLOW_APIS_SONAR_TOKEN }}
|
||||
with:
|
||||
projectBaseDir: openmetadata-airflow-apis/
|
||||
args: >
|
||||
-Dproject.settings=openmetadata-airflow-apis/sonar-project.properties
|
||||
-Dsonar.pullrequest.key=${{ github.event.pull_request.number }}
|
||||
-Dsonar.pullrequest.branch=${{ github.event.pull_request.head.ref }}
|
||||
-Dsonar.pullrequest.github.repository=OpenMetadata
|
||||
-Dsonar.scm.revision=${{ github.event.pull_request.head.sha }}
|
||||
-Dsonar.pullrequest.provider=github
|
||||
args: ${{ env.SONAR_OPTS }}
|
||||
|
||||
# next two steps are for retrying "Push Results in PR to Sonar" step in case it fails
|
||||
- name: Wait to retry 'Push Results in PR to Sonar'
|
||||
if: ${{ github.event_name == 'pull_request_target' && steps.push-to-sonar.outcome != 'success' }}
|
||||
run: sleep 20s
|
||||
shell: bash
|
||||
|
||||
- name: Retry 'Push Results in PR to Sonar'
|
||||
uses: SonarSource/sonarqube-scan-action@v7
|
||||
if: ${{ github.event_name == 'pull_request_target' && steps.push-to-sonar.outcome != 'success' }}
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
SONAR_TOKEN: ${{ secrets.AIRFLOW_APIS_SONAR_TOKEN }}
|
||||
with:
|
||||
projectBaseDir: openmetadata-airflow-apis/
|
||||
args: ${{ env.SONAR_OPTS }}
|
||||
|
||||
- name: Push Results to Sonar
|
||||
uses: sonarsource/sonarcloud-github-action@master
|
||||
uses: SonarSource/sonarqube-scan-action@v7
|
||||
if: ${{ github.event_name == 'push' }}
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
SONAR_TOKEN: ${{ secrets.AIRFLOW_APIS_SONAR_TOKEN }}
|
||||
with:
|
||||
projectBaseDir: openmetadata-airflow-apis/
|
||||
args: -Dproject.settings=openmetadata-airflow-apis/sonar-project.properties
|
||||
|
|
|
|||
|
|
@ -16,49 +16,61 @@ permissions:
|
|||
env:
|
||||
CURRENT_RELEASE_ENDPOINT: ${{ vars.CURRENT_RELEASE_ENDPOINT }} # Endpoint that returns the current release version in json format
|
||||
jobs:
|
||||
cherry_pick_to_release_branch:
|
||||
get_release_branch:
|
||||
if: github.event.pull_request.merged == true &&
|
||||
contains(github.event.pull_request.labels.*.name, 'To release')
|
||||
runs-on: ubuntu-latest
|
||||
outputs:
|
||||
release_branches: ${{ steps.get_release_version.outputs.release_branches }}
|
||||
steps:
|
||||
- name: Get the release version
|
||||
id: get_release_version
|
||||
run: |
|
||||
CURRENT_RELEASE=$(curl -s $CURRENT_RELEASE_ENDPOINT | jq -c '.collate_branches // []')
|
||||
echo "release_branches=${CURRENT_RELEASE}" >> $GITHUB_OUTPUT
|
||||
|
||||
cherry_pick_to_release_branch:
|
||||
needs: get_release_branch
|
||||
if: needs.get_release_branch.outputs.release_branches != '' && needs.get_release_branch.outputs.release_branches != '[]'
|
||||
runs-on: ubuntu-latest # Running it on ubuntu-latest on purpose (we're not using all the free minutes)
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: ${{ fromJson(needs.get_release_branch.outputs.release_branches) }}
|
||||
steps:
|
||||
- name: Checkout main branch
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: main
|
||||
fetch-depth: 0
|
||||
- name: Get the release version
|
||||
id: get_release_version
|
||||
run: |
|
||||
CURRENT_RELEASE=$(curl -s $CURRENT_RELEASE_ENDPOINT | jq -r .om_branch)
|
||||
echo "CURRENT_RELEASE=${CURRENT_RELEASE}" >> $GITHUB_ENV
|
||||
- name: Cherry-pick changes from PR
|
||||
id: cherry_pick
|
||||
continue-on-error: true
|
||||
run: |
|
||||
git config --global user.email "release-bot@open-metadata.org"
|
||||
git config --global user.name "OpenMetadata Release Bot"
|
||||
git fetch origin ${CURRENT_RELEASE}
|
||||
git checkout ${CURRENT_RELEASE}
|
||||
git fetch origin ${{ matrix.branch }}
|
||||
git checkout ${{ matrix.branch }}
|
||||
git cherry-pick -x ${{ github.event.pull_request.merge_commit_sha }}
|
||||
- name: Push changes to release branch
|
||||
id: push_changes
|
||||
continue-on-error: true
|
||||
if: steps.cherry_pick.outcome == 'success'
|
||||
run: |
|
||||
git push origin ${CURRENT_RELEASE}
|
||||
git push origin ${{ matrix.branch }}
|
||||
- name: Post a comment on failure
|
||||
if: steps.cherry_pick.outcome != 'success' || steps.push_changes.outcome != 'success'
|
||||
uses: actions/github-script@v7
|
||||
with:
|
||||
script: |
|
||||
const prNumber = context.payload.pull_request.number;
|
||||
const releaseVersion = process.env.CURRENT_RELEASE;
|
||||
const releaseBranch = '${{ matrix.branch }}';
|
||||
const workflowRunUrl = `${process.env.GITHUB_SERVER_URL}/${process.env.GITHUB_REPOSITORY}/actions/runs/${process.env.GITHUB_RUN_ID}`;
|
||||
github.rest.issues.createComment({
|
||||
owner: context.repo.owner,
|
||||
repo: context.repo.repo,
|
||||
issue_number: prNumber,
|
||||
body: `Failed to cherry-pick changes to the ${releaseVersion} branch.
|
||||
body: `Failed to cherry-pick changes to the ${releaseBranch} branch.
|
||||
Please cherry-pick the changes manually.
|
||||
You can find more details [here](${workflowRunUrl}).`
|
||||
})
|
||||
|
|
@ -68,10 +80,10 @@ jobs:
|
|||
with:
|
||||
script: |
|
||||
const prNumber = context.payload.pull_request.number;
|
||||
const releaseVersion = process.env.CURRENT_RELEASE;
|
||||
const releaseBranch = '${{ matrix.branch }}';
|
||||
github.rest.issues.createComment({
|
||||
owner: context.repo.owner,
|
||||
repo: context.repo.repo,
|
||||
issue_number: prNumber,
|
||||
body: `Changes have been cherry-picked to the ${releaseVersion} branch.`
|
||||
body: `Changes have been cherry-picked to the ${releaseBranch} branch.`
|
||||
})
|
||||
|
|
|
|||
178
.github/workflows/integration-tests-postgres-elasticsearch-redis.yml
vendored
Normal file
178
.github/workflows/integration-tests-postgres-elasticsearch-redis.yml
vendored
Normal file
|
|
@ -0,0 +1,178 @@
|
|||
# Copyright 2026 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Runs the full integration test suite with the Redis cache enabled (postgres + elasticsearch +
|
||||
# redis), via the cache-tests Maven profile. Catches cache-invalidation and stale-data bugs that
|
||||
# only surface when every test path goes through the cache layer.
|
||||
#
|
||||
# Security note (CodeQL "pull_request_target + checkout untrusted code"):
|
||||
# This workflow uses `pull_request_target` so PRs from forks can produce a required check.
|
||||
# CodeQL flags the pattern as risky because it checks out PR-controlled code while having
|
||||
# access to secrets. The mitigation is the explicit `safe to test` label gate below — the
|
||||
# verify-pr-label step rejects the workflow run before any PR code is checked out unless a
|
||||
# maintainer has applied the label. This matches the mitigation used by every other
|
||||
# integration-tests-*.yml workflow in this repo. If you remove the label gate, you reopen
|
||||
# the vulnerability.
|
||||
name: Integration Tests - PostgreSQL + Elasticsearch + Redis
|
||||
|
||||
on:
|
||||
merge_group:
|
||||
workflow_dispatch:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
paths:
|
||||
- "openmetadata-service/**"
|
||||
- "openmetadata-integration-tests/**"
|
||||
- "openmetadata-spec/src/main/resources/json/schema/**"
|
||||
- "openmetadata-sdk/**"
|
||||
- "common/**"
|
||||
- "pom.xml"
|
||||
- "bootstrap/**"
|
||||
# `pull_request_target` is intentional and required so the workflow runs against PRs from
|
||||
# forks (which `pull_request` cannot for security reasons). The `safe to test` label gate
|
||||
# below is what makes this safe — see security note in the file header.
|
||||
pull_request_target:
|
||||
types: [labeled, opened, synchronize, reopened, ready_for_review]
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
checks: write
|
||||
|
||||
concurrency:
|
||||
group: integration-tests-pg-es-redis-${{ github.event.pull_request.number || github.run_id }}
|
||||
cancel-in-progress: true
|
||||
jobs:
|
||||
# Detect whether relevant paths changed. When no matching files are modified
|
||||
# the downstream job is skipped via its `if` condition.
|
||||
# A job skipped by `if` reports as "Success", so required checks still pass.
|
||||
changes:
|
||||
name: Detect Changes
|
||||
runs-on: ubuntu-latest
|
||||
if: ${{ !github.event.pull_request.draft }}
|
||||
outputs:
|
||||
backend: ${{ github.event_name == 'workflow_dispatch' && 'true' || steps.filter.outputs.backend }}
|
||||
steps:
|
||||
- uses: dorny/paths-filter@v3
|
||||
id: filter
|
||||
if: ${{ github.event_name != 'workflow_dispatch' }}
|
||||
with:
|
||||
filters: |
|
||||
backend:
|
||||
- 'openmetadata-service/**'
|
||||
- 'openmetadata-integration-tests/**'
|
||||
- 'openmetadata-spec/src/main/resources/json/schema/**'
|
||||
- 'openmetadata-sdk/**'
|
||||
- 'common/**'
|
||||
- 'pom.xml'
|
||||
- 'bootstrap/**'
|
||||
|
||||
integration-tests-postgres-elasticsearch-redis:
|
||||
needs: changes
|
||||
runs-on: ubuntu-latest
|
||||
if: ${{ needs.changes.outputs.backend == 'true' }}
|
||||
steps:
|
||||
- name: Free Disk Space (Ubuntu)
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: true
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: true
|
||||
docker-images: false
|
||||
swap-storage: true
|
||||
|
||||
- name: Wait for the labeler
|
||||
uses: lewagon/wait-on-check-action@v1.3.4
|
||||
if: ${{ github.event_name == 'pull_request_target' }}
|
||||
with:
|
||||
ref: ${{ github.event.pull_request.head.sha }}
|
||||
check-name: Team Label
|
||||
repo-token: ${{ secrets.GITHUB_TOKEN }}
|
||||
wait-interval: 90
|
||||
|
||||
- name: Verify PR labels
|
||||
uses: jesusvasquez333/verify-pr-label-action@v1.4.0
|
||||
if: ${{ github.event_name == 'pull_request_target' }}
|
||||
with:
|
||||
github-token: '${{ secrets.GITHUB_TOKEN }}'
|
||||
valid-labels: 'safe to test'
|
||||
pull-request-number: '${{ github.event.pull_request.number }}'
|
||||
disable-reviews: true # To not auto approve changes
|
||||
|
||||
# SECURITY: this step checks out PR-controlled code while the workflow runs with
|
||||
# `pull_request_target` privileges (secrets access). The `Verify PR labels` step above
|
||||
# gates this — the workflow halts before we get here unless a maintainer has applied
|
||||
# the `safe to test` label. CodeQL flags the pattern; the label gate is the accepted
|
||||
# mitigation, mirroring how every other integration-tests-*.yml workflow in this repo
|
||||
# handles fork PRs.
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ github.event_name == 'merge_group' && github.sha || github.event.pull_request.head.sha }}
|
||||
|
||||
- name: Cache Maven dependencies
|
||||
id: cache-output
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: ~/.m2
|
||||
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
|
||||
restore-keys: |
|
||||
${{ runner.os }}-maven-
|
||||
|
||||
# Run unconditionally. The previous `if: steps.cache-output.outputs.exit-code == 0` was a
|
||||
# bug — `actions/cache@v4` exposes `cache-hit` (boolean) and `cache-primary-key`, never
|
||||
# `exit-code`. The expression always evaluated to false and the steps never ran. Maven
|
||||
# then ran against whatever JDK the runner happened to ship with, masking the issue.
|
||||
- name: Set up JDK 21
|
||||
uses: actions/setup-java@v4
|
||||
with:
|
||||
java-version: '21'
|
||||
distribution: 'temurin'
|
||||
|
||||
- name: Install Ubuntu dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y unixodbc-dev python3-venv librdkafka-dev gcc libsasl2-dev build-essential libssl-dev libffi-dev \
|
||||
librdkafka-dev unixodbc-dev libevent-dev jq
|
||||
sudo make install_antlr_cli
|
||||
|
||||
- name: Build for Integration Tests (PostgreSQL + Elasticsearch + Redis)
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: mvn -DskipTests clean install -pl :openmetadata-integration-tests -am
|
||||
|
||||
- name: Free build artifacts
|
||||
run: |
|
||||
rm -rf openmetadata-service/target/lib openmetadata-service/target/classes
|
||||
rm -rf openmetadata-spec/target openmetadata-sdk/target common/target
|
||||
rm -rf openmetadata-shaded-deps/*/target
|
||||
df -h /
|
||||
|
||||
- name: Run Integration Tests (PostgreSQL + Elasticsearch + Redis)
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: mvn verify -pl :openmetadata-integration-tests -Pcache-tests
|
||||
|
||||
- name: Clean Up
|
||||
run: |
|
||||
cd ./docker/development
|
||||
docker compose down --remove-orphans
|
||||
sudo rm -rf ${PWD}/docker-volume
|
||||
|
||||
- name: Publish Test Report
|
||||
if: ${{ always() }}
|
||||
uses: scacap/action-surefire-report@v1
|
||||
with:
|
||||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
fail_on_test_failures: true
|
||||
report_paths: 'openmetadata-integration-tests/target/failsafe-reports/TEST-*.xml'
|
||||
161
.github/workflows/java-playwright-nightly.yml
vendored
Normal file
161
.github/workflows/java-playwright-nightly.yml
vendored
Normal file
|
|
@ -0,0 +1,161 @@
|
|||
# Copyright 2026 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
# Full run of the UI integration suite (*UIIT.java) — lives inside
|
||||
# openmetadata-integration-tests under the `ui-it` Maven profile. Runs the
|
||||
# external-mode matrix (ES + OS). Tracks EPIC #3731 / tickets #3767, #3792.
|
||||
#
|
||||
# The schedule trigger is intentionally disabled while the suite stabilises;
|
||||
# run on demand via workflow_dispatch (pick the branch to test as the ref).
|
||||
# Re-add `schedule: - cron: '0 2 * * *'` once the suite is green on main.
|
||||
|
||||
name: UI Integration Tests (Nightly)
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
checks: write
|
||||
|
||||
jobs:
|
||||
ui-it-nightly:
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 90
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
searchEngine: [opensearch, elasticsearch]
|
||||
steps:
|
||||
- name: Free Disk Space (Ubuntu)
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: true
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: true
|
||||
docker-images: false
|
||||
swap-storage: true
|
||||
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
# Cron always runs against main. workflow_dispatch honours the ref the
|
||||
# workflow was dispatched against so feature branches can validate the
|
||||
# nightly matrix before merge (EPIC #3731 / PR #28008).
|
||||
ref: ${{ github.event_name == 'workflow_dispatch' && github.ref || 'main' }}
|
||||
|
||||
- name: Set up JDK 21
|
||||
uses: actions/setup-java@v4
|
||||
with:
|
||||
java-version: '21'
|
||||
distribution: 'temurin'
|
||||
|
||||
- name: Cache Maven dependencies
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: ~/.m2
|
||||
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
|
||||
restore-keys: |
|
||||
${{ runner.os }}-maven-
|
||||
|
||||
- name: Install Ubuntu dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y jq
|
||||
|
||||
- name: Add /etc/hosts entry for mock OIDC server
|
||||
# The SSO test infrastructure (MockOidcServer) needs `om-mock-idp` to resolve to
|
||||
# loopback on the host so the same URL works inside the Docker network and from
|
||||
# the host-side Playwright browser, keeping the issued tokens' `iss` claim
|
||||
# consistent across actors.
|
||||
run: echo "127.0.0.1 om-mock-idp" | sudo tee -a /etc/hosts
|
||||
|
||||
- name: Build dependencies for integration-tests
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: mvn -DskipTests clean install -pl :openmetadata-integration-tests -am
|
||||
|
||||
- name: Install Playwright browsers
|
||||
run: |
|
||||
mvn -pl :openmetadata-integration-tests dependency:build-classpath -Dmdep.outputFile=/tmp/cp.txt -q
|
||||
java -cp "$(cat /tmp/cp.txt)" com.microsoft.playwright.CLI install --with-deps chromium
|
||||
|
||||
- name: Free build artifacts
|
||||
run: |
|
||||
rm -rf openmetadata-service/target/lib openmetadata-service/target/classes
|
||||
rm -rf openmetadata-spec/target openmetadata-sdk/target common/target
|
||||
rm -rf openmetadata-shaded-deps/*/target
|
||||
df -h /
|
||||
|
||||
- name: Run UI integration tests (${{ matrix.searchEngine }})
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
# UiSessionExtension reads PW_VIDEO and records every test's BrowserContext
|
||||
# to target/playwright-videos when true. Kept off locally; on in CI for triage.
|
||||
PW_VIDEO: 'true'
|
||||
run: |
|
||||
if [ "${{ matrix.searchEngine }}" = "elasticsearch" ]; then
|
||||
mvn verify -P ui-it -pl :openmetadata-integration-tests \
|
||||
-DsearchType=elasticsearch \
|
||||
-DsearchImage=docker.elastic.co/elasticsearch/elasticsearch:9.3.0
|
||||
else
|
||||
mvn verify -P ui-it -pl :openmetadata-integration-tests
|
||||
fi
|
||||
|
||||
- name: Upload Playwright traces
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: playwright-traces-${{ matrix.searchEngine }}-${{ github.run_id }}
|
||||
path: openmetadata-integration-tests/target/playwright-traces
|
||||
if-no-files-found: ignore
|
||||
retention-days: 14
|
||||
|
||||
- name: Upload Playwright videos
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: playwright-videos-${{ matrix.searchEngine }}-${{ github.run_id }}
|
||||
path: openmetadata-integration-tests/target/playwright-videos
|
||||
if-no-files-found: ignore
|
||||
retention-days: 14
|
||||
|
||||
- name: Upload Failsafe Reports
|
||||
if: ${{ always() }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: failsafe-reports-${{ matrix.searchEngine }}-${{ github.run_id }}
|
||||
path: openmetadata-integration-tests/target/failsafe-reports
|
||||
if-no-files-found: ignore
|
||||
retention-days: 14
|
||||
|
||||
- name: Publish Test Report
|
||||
id: report
|
||||
if: ${{ always() }}
|
||||
uses: scacap/action-surefire-report@v1
|
||||
with:
|
||||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
fail_on_test_failures: true
|
||||
report_paths: 'openmetadata-integration-tests/target/failsafe-reports/TEST-*.xml'
|
||||
|
||||
- name: Slack notification
|
||||
if: ${{ always() }}
|
||||
uses: slackapi/slack-github-action@v1.23.0
|
||||
with:
|
||||
payload: |
|
||||
{
|
||||
"text": "${{ job.status == 'success' && ':white_check_mark:' || ':fire:' }} Java Playwright Nightly (${{ matrix.searchEngine }}): ${{ job.status }}\nRef: ${{ github.ref_name }}\nLogs: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
|
||||
}
|
||||
env:
|
||||
SLACK_WEBHOOK_URL: ${{ secrets.E2E_SLACK_WEBHOOK }}
|
||||
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
|
||||
25
.github/workflows/label-connector.yml
vendored
Normal file
25
.github/workflows/label-connector.yml
vendored
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
name: Label connector bug
|
||||
|
||||
on:
|
||||
issues:
|
||||
types: [opened, edited]
|
||||
|
||||
permissions:
|
||||
issues: write
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
label:
|
||||
if: contains(github.event.issue.labels.*.name, 'Ingestion')
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
with:
|
||||
sparse-checkout: .github/scripts
|
||||
sparse-checkout-cone-mode: false
|
||||
- uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.11'
|
||||
- run: python .github/scripts/label_connector.py
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
13
.github/workflows/mysql-nightly-e2e.yml
vendored
13
.github/workflows/mysql-nightly-e2e.yml
vendored
|
|
@ -88,21 +88,14 @@ jobs:
|
|||
--project=DataAssetRulesEnabled \
|
||||
--project=DataAssetRulesDisabled
|
||||
|
||||
elif [ "${{ matrix.shardIndex }}" -eq "6" ]; then
|
||||
echo "🔹 Running stateful Playwright tests serially on shard 6"
|
||||
npx playwright test \
|
||||
--project=stateful \
|
||||
--workers=1
|
||||
|
||||
else
|
||||
# Shards 2-5 handle chromium tests (4 shards total)
|
||||
# Shards 2-6 handle chromium tests (5 shards total)
|
||||
CHROMIUM_SHARD=$(( ${{ matrix.shardIndex }} - 1 ))
|
||||
echo "🔹 Running all tests (excluding DataAssetRules/stateful) on chromium shard ${CHROMIUM_SHARD}/4"
|
||||
echo "🔹 Running all tests (excluding DataAssetRules) on chromium shard ${CHROMIUM_SHARD}/5"
|
||||
npx playwright test \
|
||||
--project=chromium \
|
||||
--grep-invert @dataAssetRules \
|
||||
--shard=${CHROMIUM_SHARD}/4 \
|
||||
--workers=50%
|
||||
--shard=${CHROMIUM_SHARD}/5
|
||||
fi
|
||||
|
||||
env:
|
||||
|
|
|
|||
|
|
@ -64,15 +64,18 @@ jobs:
|
|||
k8s_operator:
|
||||
- 'openmetadata-k8s-operator/**'
|
||||
|
||||
# The openmetadata-service unit tests are pure JVM tests with no database
|
||||
# interaction (no testcontainers, no JDBC). The {mysql, postgresql} matrix used
|
||||
# to run the suite twice with different `-Pmysql` / `-Ppostgresql` profiles, but
|
||||
# those profiles are only defined in openmetadata-sdk/pom.xml and only affect
|
||||
# failsafe (integration) tests that aren't enabled in this workflow. Result:
|
||||
# both matrix jobs ran an identical surefire suite. DB-specific coverage
|
||||
# belongs in `openmetadata-integration-tests`, not here.
|
||||
openmetadata-service-unit-tests:
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 90
|
||||
needs: changes
|
||||
if: ${{ needs.changes.outputs.java == 'true' }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
database: [mysql, postgresql]
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
|
@ -100,12 +103,12 @@ jobs:
|
|||
librdkafka-dev unixodbc-dev libevent-dev jq
|
||||
sudo make install_antlr_cli
|
||||
|
||||
- name: Run openmetadata-service unit tests (${{ matrix.database }})
|
||||
- name: Run openmetadata-service unit tests
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
run: |
|
||||
mvn -B clean package -pl openmetadata-service -am \
|
||||
-Pstatic-code-analysis,${{ matrix.database }} \
|
||||
-Pstatic-code-analysis \
|
||||
-DfailIfNoTests=false \
|
||||
-Dsonar.skip=true
|
||||
|
||||
|
|
@ -113,7 +116,7 @@ jobs:
|
|||
if: ${{ failure() && hashFiles('openmetadata-service/target/surefire-reports/TEST-*.xml') != '' }}
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: openmetadata-service-surefire-reports-${{ matrix.database }}
|
||||
name: openmetadata-service-surefire-reports
|
||||
path: openmetadata-service/target/surefire-reports/
|
||||
|
||||
- name: Publish Test Report
|
||||
|
|
@ -123,7 +126,7 @@ jobs:
|
|||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||
fail_on_test_failures: true
|
||||
report_paths: "openmetadata-service/target/surefire-reports/TEST-*.xml"
|
||||
check_name: "Test Report (${{ matrix.database }})"
|
||||
check_name: "Test Report"
|
||||
|
||||
k8s_operator-unit-tests:
|
||||
runs-on: ubuntu-latest
|
||||
|
|
|
|||
104
.github/workflows/playwright-search-nightly.yml
vendored
Normal file
104
.github/workflows/playwright-search-nightly.yml
vendored
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# Copyright 2026 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
name: playwright-search-nightly
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
concurrency:
|
||||
group: playwright-search-nightly-${{ github.ref }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
playwright-search-nightly:
|
||||
runs-on: ubuntu-latest
|
||||
environment: test
|
||||
timeout-minutes: 45
|
||||
steps:
|
||||
- name: Free Disk Space (Ubuntu)
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: false
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: false
|
||||
swap-storage: true
|
||||
docker-images: false
|
||||
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Cache Maven Dependencies
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: ~/.m2
|
||||
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
|
||||
restore-keys: |
|
||||
${{ runner.os }}-maven-
|
||||
|
||||
- name: Setup OpenMetadata Test Environment
|
||||
uses: ./.github/actions/setup-openmetadata-test-environment
|
||||
with:
|
||||
python-version: '3.10'
|
||||
args: '-d postgresql -i false'
|
||||
ingestion_dependency: 'all'
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version-file: 'openmetadata-ui/src/main/resources/ui/.nvmrc'
|
||||
|
||||
- name: Install dependencies
|
||||
working-directory: openmetadata-ui/src/main/resources/ui/
|
||||
run: yarn --ignore-scripts --frozen-lockfile
|
||||
|
||||
- name: Install Playwright Browsers
|
||||
run: npx playwright@1.57.0 install chromium --with-deps
|
||||
|
||||
- name: Run Search Nightly
|
||||
working-directory: openmetadata-ui/src/main/resources/ui
|
||||
env:
|
||||
PLAYWRIGHT_IS_OSS: true
|
||||
run: |
|
||||
# All search tests live in playwright/e2e/Search/. The search-nightly
|
||||
# project in playwright.config.ts maps testMatch to **/Search/** so only
|
||||
# that folder is picked up. Add new search specs to that folder.
|
||||
npx playwright test --project=search-nightly --workers=1
|
||||
|
||||
- name: Upload HTML report
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: search-nightly-html-report
|
||||
path: openmetadata-ui/src/main/resources/ui/playwright/output/playwright-report
|
||||
retention-days: 5
|
||||
|
||||
- name: Send Slack Notification
|
||||
if: always()
|
||||
working-directory: openmetadata-ui/src/main/resources/ui
|
||||
env:
|
||||
RUN_TITLE: "Playwright Search Nightly (${{ github.ref_name }})"
|
||||
RUN_URL: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
|
||||
SLACK_BOT_USER_OAUTH_TOKEN: ${{ secrets.E2E_SLACK_BOT_OAUTH_TOKEN }}
|
||||
run: |
|
||||
npx playwright-slack-report -c playwright/slack-cli.config.json -j playwright/output/results.json > slack_report.json
|
||||
|
||||
- name: Clean Up
|
||||
if: always()
|
||||
run: |
|
||||
cd ./docker/development
|
||||
docker compose down --remove-orphans
|
||||
sudo rm -rf ${PWD}/docker-volume
|
||||
142
.github/workflows/playwright-sso-login-nightly.yml
vendored
Normal file
142
.github/workflows/playwright-sso-login-nightly.yml
vendored
Normal file
|
|
@ -0,0 +1,142 @@
|
|||
# Copyright 2025 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
name: SSO Login Nightly
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 3 * * *'
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
sso_provider:
|
||||
description: 'SSO provider (or "all")'
|
||||
required: true
|
||||
default: okta
|
||||
type: choice
|
||||
options:
|
||||
- okta
|
||||
- keycloak-azure-saml
|
||||
- all
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
concurrency:
|
||||
group: sso-login-nightly-${{ github.event.inputs.sso_provider || 'scheduled' }}
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
# To onboard a new provider:
|
||||
# 1. Add a matrix entry below (`name` is the lowercase provider id used by
|
||||
# the Playwright helper; `env_prefix` is the uppercase/underscore form
|
||||
# used to look up credentials). Also add `name` to the dispatch
|
||||
# `options:` list above.
|
||||
# 2. Add <ENV_PREFIX>_SSO_USERNAME (variable) and <ENV_PREFIX>_SSO_PASSWORD
|
||||
# (variable) to the `test` environment. Use a secret instead of a
|
||||
# variable for the password if the provider uses a real (non-fixture)
|
||||
# credential.
|
||||
# 3. Register the helper in playwright/utils/sso-providers/index.ts.
|
||||
sso-login:
|
||||
runs-on: ubuntu-latest
|
||||
environment: test
|
||||
timeout-minutes: 45
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
provider:
|
||||
${{ (github.event_name == 'schedule' || github.event.inputs.sso_provider == 'all')
|
||||
&& fromJSON('[{"name":"okta","env_prefix":"OKTA"},{"name":"keycloak-azure-saml","env_prefix":"KEYCLOAK_AZURE_SAML"}]')
|
||||
|| (github.event.inputs.sso_provider == 'keycloak-azure-saml'
|
||||
&& fromJSON('[{"name":"keycloak-azure-saml","env_prefix":"KEYCLOAK_AZURE_SAML"}]')
|
||||
|| fromJSON('[{"name":"okta","env_prefix":"OKTA"}]')) }}
|
||||
steps:
|
||||
- name: Free Disk Space (Ubuntu)
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: false
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: false
|
||||
swap-storage: true
|
||||
docker-images: false
|
||||
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Cache Maven Dependencies
|
||||
uses: actions/cache@v4
|
||||
with:
|
||||
path: ~/.m2
|
||||
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
|
||||
restore-keys: |
|
||||
${{ runner.os }}-maven-
|
||||
|
||||
- name: Setup OpenMetadata Test Environment
|
||||
uses: ./.github/actions/setup-openmetadata-test-environment
|
||||
with:
|
||||
python-version: '3.10'
|
||||
args: '-d postgresql -i false'
|
||||
ingestion_dependency: 'all'
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version-file: 'openmetadata-ui/src/main/resources/ui/.nvmrc'
|
||||
|
||||
- name: Install dependencies
|
||||
working-directory: openmetadata-ui/src/main/resources/ui/
|
||||
run: yarn --ignore-scripts --frozen-lockfile
|
||||
|
||||
- name: Install Playwright Browsers
|
||||
run: npx playwright@1.57.0 install chromium --with-deps
|
||||
|
||||
- name: Start Keycloak SAML IdP
|
||||
if: startsWith(matrix.provider.name, 'keycloak-')
|
||||
run: |
|
||||
docker compose -f docker/local-sso/keycloak-saml/docker-compose.yml up -d
|
||||
timeout 180 bash -c 'until curl -fsS http://localhost:8080/realms/om-azure-saml >/dev/null; do sleep 2; done'
|
||||
|
||||
- name: Run SSO Login Spec
|
||||
working-directory: openmetadata-ui/src/main/resources/ui
|
||||
env:
|
||||
SSO_PROVIDER_TYPE: ${{ matrix.provider.name }}
|
||||
SSO_USERNAME: ${{ vars[format('{0}_SSO_USERNAME', matrix.provider.env_prefix)] }}
|
||||
SSO_PASSWORD: ${{ vars[format('{0}_SSO_PASSWORD', matrix.provider.env_prefix)] || secrets[format('{0}_SSO_PASSWORD', matrix.provider.env_prefix)] }}
|
||||
KEYCLOAK_SAML_BASE_URL: http://localhost:8080
|
||||
PLAYWRIGHT_IS_OSS: true
|
||||
run: npx playwright test --project=sso-auth --workers=1
|
||||
|
||||
- name: Upload HTML report
|
||||
if: always()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: sso-login-html-report-${{ matrix.provider.name }}
|
||||
path: openmetadata-ui/src/main/resources/ui/playwright/output/playwright-report
|
||||
retention-days: 5
|
||||
|
||||
- name: Send Slack Notification
|
||||
if: always()
|
||||
working-directory: openmetadata-ui/src/main/resources/ui
|
||||
env:
|
||||
RUN_TITLE: "SSO Login Nightly: ${{ matrix.provider.name }} (${{ github.ref_name }})"
|
||||
RUN_URL: "${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
|
||||
SLACK_BOT_USER_OAUTH_TOKEN: ${{ secrets.E2E_SLACK_BOT_OAUTH_TOKEN }}
|
||||
run: |
|
||||
npx playwright-slack-report -c playwright/slack-cli.config.json -j playwright/output/results.json > slack_report.json
|
||||
|
||||
- name: Clean Up
|
||||
if: always()
|
||||
run: |
|
||||
docker compose -f docker/local-sso/keycloak-saml/docker-compose.yml down --remove-orphans || true
|
||||
cd ./docker/development
|
||||
docker compose down --remove-orphans
|
||||
sudo rm -rf ${PWD}/docker-volume
|
||||
13
.github/workflows/postgresql-nightly-e2e.yml
vendored
13
.github/workflows/postgresql-nightly-e2e.yml
vendored
|
|
@ -88,21 +88,14 @@ jobs:
|
|||
--project=DataAssetRulesEnabled \
|
||||
--project=DataAssetRulesDisabled
|
||||
|
||||
elif [ "${{ matrix.shardIndex }}" -eq "6" ]; then
|
||||
echo "🔹 Running stateful Playwright tests serially on shard 6"
|
||||
npx playwright test \
|
||||
--project=stateful \
|
||||
--workers=1
|
||||
|
||||
else
|
||||
# Shards 2-5 handle chromium tests (4 shards total)
|
||||
# Shards 2-5 handle chromium tests (5 shards total)
|
||||
CHROMIUM_SHARD=$(( ${{ matrix.shardIndex }} - 1 ))
|
||||
echo "🔹 Running all tests (excluding DataAssetRules/stateful) on chromium shard ${CHROMIUM_SHARD}/4"
|
||||
echo "🔹 Running all tests (excluding DataAssetRules) on chromium shard ${CHROMIUM_SHARD}/5"
|
||||
npx playwright test \
|
||||
--project=chromium \
|
||||
--grep-invert @dataAssetRules \
|
||||
--shard=${CHROMIUM_SHARD}/4 \
|
||||
--workers=50%
|
||||
--shard=${CHROMIUM_SHARD}/5
|
||||
fi
|
||||
|
||||
env:
|
||||
|
|
|
|||
12
.github/workflows/py-cli-e2e-tests.yml
vendored
12
.github/workflows/py-cli-e2e-tests.yml
vendored
|
|
@ -18,7 +18,7 @@ on:
|
|||
e2e-tests:
|
||||
description: "E2E Tests to run"
|
||||
required: True
|
||||
default: '["bigquery", "dbt_redshift", "metabase", "mssql", "mysql", "redash", "snowflake", "tableau", "python-unittests", "python-integration", "redshift", "quicksight", "datalake_s3", "postgres", "oracle", "athena", "bigquery_multiple_project"]'
|
||||
default: '["bigquery", "dbt_redshift", "metabase", "mssql", "mysql", "redash", "snowflake", "tableau", "python-unittests", "python-integration", "redshift", "quicksight", "datalake_s3", "postgres", "oracle", "athena", "bigquery_multiple_project", "exasol"]'
|
||||
debug:
|
||||
description: "If Debugging the Pipeline, Slack and Sonar events won't be triggered [default, true or false]. Default will trigger only on main branch."
|
||||
required: False
|
||||
|
|
@ -45,7 +45,7 @@ jobs:
|
|||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
e2e-test: ${{ fromJSON(inputs.e2e-tests || '["bigquery", "dbt_redshift", "metabase", "mssql", "mysql", "redash", "snowflake", "tableau", "python-unittests", "python-integration", "redshift", "quicksight", "datalake_s3", "postgres", "oracle", "athena", "bigquery_multiple_project"]') }}
|
||||
e2e-test: ${{ fromJSON(inputs.e2e-tests || '["bigquery", "dbt_redshift", "metabase", "mssql", "mysql", "redash", "snowflake", "tableau", "python-unittests", "python-integration", "redshift", "quicksight", "datalake_s3", "postgres", "oracle", "athena", "bigquery_multiple_project", "exasol"]') }}
|
||||
environment: test
|
||||
|
||||
steps:
|
||||
|
|
@ -182,12 +182,12 @@ jobs:
|
|||
echo "import os" >> $SITE_CUSTOMIZE_PATH
|
||||
echo "try:" >> $SITE_CUSTOMIZE_PATH
|
||||
echo " import coverage" >> $SITE_CUSTOMIZE_PATH
|
||||
echo " os.environ['COVERAGE_PROCESS_START'] = 'ingestion/pyproject.toml'" >> $SITE_CUSTOMIZE_PATH
|
||||
echo " os.environ['COVERAGE_PROCESS_START'] = os.path.join(os.environ.get('GITHUB_WORKSPACE', os.getcwd()), 'ingestion', 'pyproject.toml')" >> $SITE_CUSTOMIZE_PATH
|
||||
echo " coverage.process_startup()" >> $SITE_CUSTOMIZE_PATH
|
||||
echo "except ImportError:" >> $SITE_CUSTOMIZE_PATH
|
||||
echo " pass" >> $SITE_CUSTOMIZE_PATH
|
||||
coverage run --rcfile ingestion/pyproject.toml -a --branch -m pytest -c ingestion/pyproject.toml --junitxml=ingestion/junit/test-results-$E2E_TEST.xml --ignore=ingestion/tests/unit/source ingestion/tests/cli_e2e/test_cli_$E2E_TEST.py
|
||||
coverage combine --data-file=.coverage.$E2E_TEST --rcfile=ingestion/pyproject.toml --keep -a .coverage*
|
||||
coverage run --rcfile ingestion/pyproject.toml --branch -m pytest -c ingestion/pyproject.toml --junitxml=ingestion/junit/test-results-$E2E_TEST.xml --ignore=ingestion/tests/unit/source ingestion/tests/cli_e2e/test_cli_$E2E_TEST.py
|
||||
coverage combine --data-file=.coverage.$E2E_TEST --rcfile=ingestion/pyproject.toml --keep .coverage*
|
||||
coverage report --rcfile ingestion/pyproject.toml --data-file .coverage.$E2E_TEST || true
|
||||
|
||||
- name: Upload coverage artifact for Python unit tests
|
||||
|
|
@ -293,7 +293,7 @@ jobs:
|
|||
done
|
||||
source env/bin/activate
|
||||
cd ingestion
|
||||
coverage combine --rcfile=pyproject.toml --keep -a .coverage*
|
||||
coverage combine --rcfile=pyproject.toml --keep .coverage*
|
||||
coverage xml --rcfile=pyproject.toml --data-file=.coverage
|
||||
shell: bash
|
||||
|
||||
|
|
|
|||
5
.github/workflows/py-tests.yml
vendored
5
.github/workflows/py-tests.yml
vendored
|
|
@ -99,6 +99,11 @@ jobs:
|
|||
install-server: 'false'
|
||||
|
||||
- name: Run Static Checks
|
||||
# basedpyright is configured with `pythonVersion = "3.10"` (the lowest
|
||||
# supported version) so type-checking results are identical across the
|
||||
# 3.10/3.11/3.12 matrix. Run on the lowest version only to avoid
|
||||
# redundant work and keep the baseline file deterministic.
|
||||
if: matrix.py-version == '3.10'
|
||||
run: |
|
||||
source env/bin/activate
|
||||
cd ingestion
|
||||
|
|
|
|||
171
.github/workflows/security-scan.yml
vendored
171
.github/workflows/security-scan.yml
vendored
|
|
@ -12,7 +12,7 @@
|
|||
name: security-scan
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 */2 * *'
|
||||
- cron: "0 0 */2 * *"
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
|
|
@ -27,7 +27,7 @@ jobs:
|
|||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version-file: 'openmetadata-ui/src/main/resources/ui/.nvmrc'
|
||||
node-version-file: "openmetadata-ui/src/main/resources/ui/.nvmrc"
|
||||
|
||||
- name: Enable yarn
|
||||
run: corepack enable
|
||||
|
|
@ -43,7 +43,7 @@ jobs:
|
|||
run: |
|
||||
npx retire@5 \
|
||||
--path node_modules/ \
|
||||
--severity medium \
|
||||
--severity high \
|
||||
--outputformat json \
|
||||
--outputpath retire-report.json
|
||||
|
||||
|
|
@ -124,30 +124,6 @@ jobs:
|
|||
print()
|
||||
EOF
|
||||
|
||||
- name: Slack on Failure
|
||||
if: steps.retire-scan.outcome == 'failure'
|
||||
uses: slackapi/slack-github-action@v1.23.0
|
||||
with:
|
||||
channel-id: ${{ secrets.SLACK_CHANNEL_IDS }}
|
||||
payload: |
|
||||
{
|
||||
"text": "🚨 Vulnerability scan failed, please check it <https://github.com/open-metadata/OpenMetadata/actions/runs/${{ github.run_id }}|here>. 🚨"
|
||||
}
|
||||
env:
|
||||
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
|
||||
|
||||
- name: Slack on Success
|
||||
if: steps.retire-scan.outcome == 'success'
|
||||
uses: slackapi/slack-github-action@v1.23.0
|
||||
with:
|
||||
channel-id: ${{ secrets.SLACK_CHANNEL_IDS }}
|
||||
payload: |
|
||||
{
|
||||
"text": "🟢 Vulnerability scan passed for OpenMetadata Repo, please check it <https://github.com/open-metadata/OpenMetadata/actions/runs/${{ github.run_id }}|here>."
|
||||
}
|
||||
env:
|
||||
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
|
||||
|
||||
- name: Force failure on vulnerabilities found
|
||||
if: steps.retire-scan.outcome == 'failure'
|
||||
run: exit 1
|
||||
|
|
@ -163,25 +139,25 @@ jobs:
|
|||
- name: Free Disk Space (Ubuntu)
|
||||
uses: jlumbroso/free-disk-space@main
|
||||
with:
|
||||
tool-cache: false
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: false
|
||||
docker-images: true
|
||||
swap-storage: true
|
||||
tool-cache: false
|
||||
android: true
|
||||
dotnet: true
|
||||
haskell: true
|
||||
large-packages: false
|
||||
docker-images: true
|
||||
swap-storage: true
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- name: Set up Python 3.10
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: '3.10'
|
||||
python-version: "3.10"
|
||||
|
||||
- name: Set up JDK 21
|
||||
uses: actions/setup-java@v4
|
||||
with:
|
||||
java-version: '21'
|
||||
distribution: 'temurin'
|
||||
java-version: "21"
|
||||
distribution: "temurin"
|
||||
|
||||
- name: Install Ubuntu dependencies
|
||||
run: |
|
||||
|
|
@ -215,40 +191,111 @@ jobs:
|
|||
continue-on-error: true
|
||||
run: |
|
||||
source env/bin/activate
|
||||
make snyk-report
|
||||
rm -rf security-report
|
||||
mkdir -p security-report
|
||||
# Run snyk subtargets directly; skip `export-snyk-pdf-report` which deletes JSONs after PDF conversion.
|
||||
make snyk-ingestion-report || true
|
||||
make snyk-ingestion-base-slim-report || true
|
||||
make snyk-airflow-apis-report || true
|
||||
make snyk-server-report || true
|
||||
make snyk-ui-report || true
|
||||
|
||||
- name: Slack on Failure
|
||||
if: steps.security-report.outcome != 'success'
|
||||
uses: slackapi/slack-github-action@v1.23.0
|
||||
with:
|
||||
channel-id: ${{ secrets.SLACK_CHANNEL_IDS }}
|
||||
payload: |
|
||||
{
|
||||
"text": "🚨 Security report failed, please check it <https://github.com/open-metadata/OpenMetadata/actions/runs/${{ github.run_id }}|here>. 🚨"
|
||||
}
|
||||
env:
|
||||
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
|
||||
- name: Publish Snyk Summary
|
||||
id: snyk-summary
|
||||
if: always() && steps.maven-build.outcome == 'success'
|
||||
run: |
|
||||
python3 scripts/snyk_summary.py security-report \
|
||||
--counts-file security-report/_counts.json \
|
||||
--slack-file security-report/_slack.txt \
|
||||
>> $GITHUB_STEP_SUMMARY
|
||||
# Expose counts as step output for downstream gating.
|
||||
counts=$(cat security-report/_counts.json)
|
||||
echo "counts=$counts" >> $GITHUB_OUTPUT
|
||||
high=$(jq '.high + .critical' security-report/_counts.json)
|
||||
echo "high_critical=$high" >> $GITHUB_OUTPUT
|
||||
|
||||
- name: Slack on Success
|
||||
if: steps.security-report.outcome == 'success'
|
||||
uses: slackapi/slack-github-action@v1.23.0
|
||||
with:
|
||||
channel-id: ${{ secrets.SLACK_CHANNEL_IDS }}
|
||||
payload: |
|
||||
{
|
||||
"text": "🟢 Security report generated for OpenMetadata Repo , please check it <https://github.com/open-metadata/OpenMetadata/actions/runs/${{ github.run_id }}|here>."
|
||||
}
|
||||
env:
|
||||
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
|
||||
- name: Fail on high/critical Snyk findings
|
||||
if: always() && steps.snyk-summary.outputs.high_critical != '' && steps.snyk-summary.outputs.high_critical != '0'
|
||||
run: |
|
||||
echo "::error::Snyk found ${{ steps.snyk-summary.outputs.high_critical }} high/critical vulnerabilities (see Job Summary)"
|
||||
exit 1
|
||||
|
||||
- name: Upload Snyk Report HTML files
|
||||
if: steps.security-report.outcome == 'success'
|
||||
- name: Generate Snyk HTML/PDF
|
||||
if: always() && steps.maven-build.outcome == 'success'
|
||||
run: |
|
||||
# Back up JSONs because html_to_pdf.py deletes them after PDF conversion.
|
||||
mkdir -p /tmp/snyk-json-backup
|
||||
cp security-report/*.json /tmp/snyk-json-backup/ 2>/dev/null || true
|
||||
make export-snyk-pdf-report || true
|
||||
# Restore JSONs alongside generated PDFs/HTMLs.
|
||||
cp /tmp/snyk-json-backup/*.json security-report/ 2>/dev/null || true
|
||||
|
||||
- name: Upload Snyk Reports
|
||||
if: always() && steps.maven-build.outcome == 'success'
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: security-report
|
||||
path: security-report
|
||||
retention-days: 30
|
||||
|
||||
- name: Force failure
|
||||
if: steps.maven-build.outcome != 'success' || steps.security-report.outcome != 'success'
|
||||
run: |
|
||||
exit 1
|
||||
|
||||
notify:
|
||||
runs-on: ubuntu-latest
|
||||
environment: security-scan
|
||||
needs: [vulnerability-scan, security-scan]
|
||||
if: always()
|
||||
steps:
|
||||
- name: Download Snyk artifact
|
||||
if: needs.security-scan.result != 'skipped'
|
||||
uses: actions/download-artifact@v4
|
||||
with:
|
||||
name: security-report
|
||||
path: security-report
|
||||
continue-on-error: true
|
||||
|
||||
- name: Build Slack payload
|
||||
id: build
|
||||
run: |
|
||||
retire="${{ needs.vulnerability-scan.result }}"
|
||||
snyk="${{ needs.security-scan.result }}"
|
||||
status_icon() {
|
||||
case "$1" in
|
||||
success) echo "✅" ;;
|
||||
cancelled) echo "⚠️ (cancelled)" ;;
|
||||
skipped) echo "⚠️ (skipped)" ;;
|
||||
*) echo "❌" ;;
|
||||
esac
|
||||
}
|
||||
retire_icon=$(status_icon "$retire")
|
||||
snyk_icon=$(status_icon "$snyk")
|
||||
if [ "$retire" = "success" ] && [ "$snyk" = "success" ]; then
|
||||
icon="🟢"
|
||||
elif [ "$retire" = "failure" ] || [ "$snyk" = "failure" ]; then
|
||||
icon="🚨"
|
||||
else
|
||||
icon="⚠️"
|
||||
fi
|
||||
run_url="https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
|
||||
{
|
||||
echo "$icon *Security scan* — *OpenMetadata Repo* on branch \`${{ github.ref_name }}\`"
|
||||
echo "• Vulnerability scan (Retire.js): $retire_icon"
|
||||
echo "• Security scan (Snyk): $snyk_icon"
|
||||
echo "<$run_url|Open run details>"
|
||||
if [ -f security-report/_slack.txt ]; then
|
||||
echo
|
||||
cat security-report/_slack.txt
|
||||
fi
|
||||
} > slack_body.txt
|
||||
jq -Rs '{text: ., mrkdwn: true}' slack_body.txt > payload.json
|
||||
|
||||
- name: Send Slack Notification
|
||||
uses: slackapi/slack-github-action@v1.27.1
|
||||
with:
|
||||
channel-id: ${{ secrets.SLACK_CHANNEL_IDS }}
|
||||
payload-file-path: payload.json
|
||||
env:
|
||||
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
|
||||
|
|
|
|||
11
.gitignore
vendored
11
.gitignore
vendored
|
|
@ -23,6 +23,8 @@ release.properties
|
|||
dependency-reduced-pom.xml
|
||||
buildNumber.properties
|
||||
.mvn/timing.properties
|
||||
.claude/*
|
||||
.claude
|
||||
.maestro
|
||||
catalog-services/catalog-services.iml
|
||||
|
||||
|
|
@ -161,7 +163,7 @@ ingestion/.nox/
|
|||
_bmad/
|
||||
|
||||
# Claude Flow generated files
|
||||
.claude/settings.local.json
|
||||
.claude/*
|
||||
.mcp.json
|
||||
claude-flow.config.json
|
||||
.swarm/
|
||||
|
|
@ -197,5 +199,12 @@ ingestion/.claude/agents
|
|||
# Connector audit working files — per-session, never committed
|
||||
.claude/audit-results/
|
||||
.claude/connector-audit.json
|
||||
.claude/scheduled_tasks.lock
|
||||
.claude/plans/
|
||||
|
||||
# Serena MCP language-server cache — local tooling, not committed
|
||||
.serena/
|
||||
|
||||
test-results/
|
||||
|
||||
docs/superpowers/*
|
||||
|
|
|
|||
|
|
@ -2,28 +2,30 @@ default_language_version:
|
|||
python: python3
|
||||
repos:
|
||||
- repo: https://github.com/pre-commit/pre-commit-hooks
|
||||
rev: v2.3.0
|
||||
rev: v5.0.0
|
||||
hooks:
|
||||
- id: check-json
|
||||
exclude: vscode
|
||||
- repo: https://github.com/hadialqattan/pycln
|
||||
rev: v2.5.0
|
||||
# TODO: investigate and fix or remove the excluded files. The first
|
||||
# three carry real JSON issues (duplicate keys, malformed/empty
|
||||
# content) that pre-commit-hooks v2.3.0 didn't catch; v5.0.0 does.
|
||||
# The last is an intentionally malformed test fixture.
|
||||
exclude: |
|
||||
(?x)^(
|
||||
.*vscode.*|
|
||||
openmetadata-spec/src/main/resources/rdf/contexts/dataAsset\.jsonld|
|
||||
ingestion/examples/sample_data/pipelines/tasks\.json|
|
||||
openmetadata-service/src/main/resources/dataInsights/opensearch/indexSettingsTemplate\.json|
|
||||
openmetadata-ui/src/main/resources/ui/playwright/test-data/odcs-examples/invalid-malformed\.json
|
||||
)$
|
||||
- repo: https://github.com/astral-sh/ruff-pre-commit
|
||||
rev: v0.15.12
|
||||
hooks:
|
||||
- id: pycln
|
||||
- id: ruff-check
|
||||
files: ^(ingestion|openmetadata-airflow-apis)/
|
||||
args: [ "--config", "ingestion/pyproject.toml" ]
|
||||
- repo: https://github.com/timothycrosley/isort
|
||||
rev: 5.12.0
|
||||
hooks:
|
||||
- id: isort
|
||||
args: ["--fix", "--config", "ingestion/pyproject.toml"]
|
||||
- id: ruff-format
|
||||
files: ^(ingestion|openmetadata-airflow-apis)/
|
||||
args: [ "--settings-file", "ingestion/pyproject.toml" ]
|
||||
- repo: https://github.com/ambv/black
|
||||
rev: 22.3.0
|
||||
hooks:
|
||||
- id: black
|
||||
files: ^(ingestion|openmetadata-airflow-apis)/
|
||||
args: [ "--config", "ingestion/pyproject.toml" ]
|
||||
args: ["--config", "ingestion/pyproject.toml"]
|
||||
- repo: https://github.com/pre-commit/mirrors-prettier
|
||||
rev: v2.5.1
|
||||
hooks:
|
||||
|
|
|
|||
33
.pylintrc
33
.pylintrc
|
|
@ -1,33 +0,0 @@
|
|||
[BASIC]
|
||||
# W1203: logging-fstring-interpolation - f-string brings better readability and unifies style
|
||||
# W1202: logging-format-interpolation - lazy formatting in logging functions
|
||||
# R0903: too-few-public-methods - False negatives in pydantic classes
|
||||
# W0707: raise-missing-from - Tends to be a false positive as exception are closely encapsulated
|
||||
# R0901: too-many-ancestors - We are already inheriting from SQA classes with a bunch of ancestors
|
||||
# W0703: broad-except - We are dealing with many different source systems, but we want to make sure workflows run until the end
|
||||
# W0511: fixme - These are internal notes and guides
|
||||
# W1518: method-cache-max-size-none - allow us to use LRU Cache with maxsize `None` to speed up certain calls
|
||||
disable=W1203,W1202,R0903,W0707,R0901,W1201,W0703,W0511,W1518
|
||||
|
||||
docstring-min-length=20
|
||||
max-args=7
|
||||
max-attributes=12
|
||||
|
||||
# usual typevar naming
|
||||
good-names=T,C,fn,db,df,i
|
||||
module-rgx=(([a-z_][a-z0-9_]*)|([a-zA-Z0-9]+))$
|
||||
|
||||
[MASTER]
|
||||
fail-under=6.0
|
||||
init-hook='from pylint.config import find_default_config_files; import os, sys; sys.path.append(os.path.dirname(next(find_default_config_files())))'
|
||||
extension-pkg-allow-list=pydantic
|
||||
load-plugins=ingestion.plugins.print_checker,ingestion.plugins.import_checker
|
||||
max-public-methods=25
|
||||
|
||||
[MESSAGES CONTROL]
|
||||
disable=no-name-in-module,import-error,duplicate-code
|
||||
enable=useless-suppression
|
||||
|
||||
[FORMAT]
|
||||
# We all have big monitors now
|
||||
max-line-length=120
|
||||
255
AGENTS.md
Normal file
255
AGENTS.md
Normal file
|
|
@ -0,0 +1,255 @@
|
|||
# AGENTS.md
|
||||
|
||||
This file provides guidance to Codex (Codex.ai/code) when working with code in this repository.
|
||||
|
||||
## About OpenMetadata
|
||||
|
||||
OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance. This is a multi-module project with Java backend services, React frontend, Python ingestion framework, and comprehensive Docker infrastructure.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
- **Backend**: Java 21 + Dropwizard REST API framework, multi-module Maven project
|
||||
- **Frontend**: React + TypeScript + Ant Design, built with Webpack and Yarn
|
||||
- **Ingestion**: Python 3.10-3.12 with Pydantic 2.x, 75+ data source connectors
|
||||
- **Database**: MySQL (default) or PostgreSQL with Flyway migrations
|
||||
- **Search**: Elasticsearch 7.17+ or OpenSearch 2.6+ for metadata discovery
|
||||
- **Infrastructure**: Apache Airflow for workflow orchestration
|
||||
|
||||
## Essential Development Commands
|
||||
|
||||
### Prerequisites and Setup
|
||||
```bash
|
||||
make prerequisites # Check system requirements
|
||||
make install_dev_env # Install all development dependencies
|
||||
make yarn_install_cache # Install UI dependencies
|
||||
```
|
||||
|
||||
### Frontend Development
|
||||
```bash
|
||||
cd openmetadata-ui/src/main/resources/ui
|
||||
yarn start # Start development server on localhost:3000
|
||||
yarn test # Run Jest unit tests
|
||||
yarn test path/to/test.spec.ts # Run a specific test file
|
||||
yarn test:watch # Run tests in watch mode
|
||||
yarn playwright:run # Run E2E tests
|
||||
yarn lint # ESLint check
|
||||
yarn lint:fix # ESLint with auto-fix
|
||||
yarn build # Production build
|
||||
```
|
||||
|
||||
### Backend Development
|
||||
```bash
|
||||
mvn clean package -DskipTests # Build without tests
|
||||
mvn clean package -DonlyBackend -pl !openmetadata-ui # Backend only
|
||||
mvn test # Run unit tests
|
||||
mvn verify # Run integration tests
|
||||
mvn spotless:apply # Format Java code
|
||||
```
|
||||
|
||||
### Python Ingestion Development
|
||||
```bash
|
||||
cd ingestion
|
||||
make install_dev_env # Install in development mode
|
||||
make generate # Generate Pydantic models from JSON schemas
|
||||
make unit_ingestion_dev_env # Run unit tests
|
||||
make py_format # Apply ruff lint-fix + format
|
||||
make py_format_check # Verify lint + format (matches CI; catches non-auto-fixable issues)
|
||||
make static-checks # Run type checking with basedpyright
|
||||
```
|
||||
|
||||
### Full Local Environment
|
||||
```bash
|
||||
./docker/run_local_docker.sh -m ui -d mysql # Complete local setup with UI
|
||||
./docker/run_local_docker.sh -m no-ui -d postgresql # Backend only with PostgreSQL
|
||||
./docker/run_local_docker.sh -s true # Skip Maven build step
|
||||
```
|
||||
|
||||
### Testing
|
||||
```bash
|
||||
make run_e2e_tests # Full E2E test suite
|
||||
make unit_ingestion # Python unit tests with coverage
|
||||
yarn test:coverage # Frontend test coverage
|
||||
```
|
||||
|
||||
## Code Generation and Schemas
|
||||
|
||||
OpenMetadata uses a schema-first approach with JSON Schema definitions driving code generation:
|
||||
|
||||
```bash
|
||||
make generate # Generate all models from schemas
|
||||
make py_antlr # Generate Python ANTLR parsers
|
||||
make js_antlr # Generate JavaScript ANTLR parsers
|
||||
yarn parse-schema # Parse JSON schemas for frontend (connection and ingestion schemas)
|
||||
```
|
||||
|
||||
### Schema Architecture
|
||||
- **Source schemas** in `openmetadata-spec/` define the canonical data models
|
||||
- **Connection schemas** are pre-processed at build time via `parseSchemas.js` to resolve all `$ref` references
|
||||
- **Application schemas** in `openmetadata-ui/.../ApplicationSchemas/` are resolved at runtime using `schemaResolver.ts`
|
||||
- JSON schemas with `$ref` references to external files require resolution before use in forms
|
||||
|
||||
## Key Directories
|
||||
|
||||
- `openmetadata-service/` - Core Java backend services and REST APIs
|
||||
- `openmetadata-ui/src/main/resources/ui/` - React frontend application
|
||||
- `ingestion/` - Python ingestion framework with connectors
|
||||
- `openmetadata-spec/` - JSON Schema specifications for all entities
|
||||
- `bootstrap/sql/` - Database schema migrations and sample data
|
||||
- `conf/` - Configuration files for different environments
|
||||
- `docker/` - Docker configurations for local and production deployment
|
||||
|
||||
## Development Workflow
|
||||
|
||||
1. **Schema Changes**: Modify JSON schemas in `openmetadata-spec/`, then run `mvn clean install` on openmetadata-spec to update models
|
||||
2. **Backend**: Develop in Java using Dropwizard patterns, test with `mvn test`, format with `mvn spotless:apply`
|
||||
3. **Frontend**: Use React/TypeScript with Ant Design components, test with Jest/Playwright
|
||||
4. **Ingestion**: Python connectors follow plugin pattern, use `make install_dev_env` for development
|
||||
5. **Full Testing**: Use `make run_e2e_tests` before major changes
|
||||
|
||||
## Frontend Architecture Patterns
|
||||
|
||||
### React Component Patterns
|
||||
- **File Naming**: Components use `ComponentName.component.tsx`, interfaces use `ComponentName.interface.ts`
|
||||
- **State Management**: Use `useState` with proper typing, avoid `any`
|
||||
- **Side Effects**: Use `useEffect` with proper dependency arrays
|
||||
- **Performance**: Use `useCallback` for event handlers, `useMemo` for expensive computations
|
||||
- **Custom Hooks**: Prefix with `use`, place in `src/hooks/`, return typed objects
|
||||
- **Internationalization**: Use `useTranslation` hook from react-i18next, access with `t('key')`
|
||||
- **Component Structure**: Functional components only, no class components
|
||||
- **Props**: Define interfaces for all component props, place in `.interface.ts` files
|
||||
- **Loading States**: Use object state for multiple loading states: `useState<Record<string, boolean>>({})`
|
||||
- **Error Handling**: Use `showErrorToast` and `showSuccessToast` utilities from ToastUtils
|
||||
- **Navigation**: Use `useNavigate` from react-router-dom, not direct history manipulation
|
||||
- **Data Fetching**: Async functions with try-catch blocks, update loading states appropriately
|
||||
|
||||
### State Management
|
||||
- Use Zustand stores for global state (e.g., `useLimitStore`, `useWelcomeStore`)
|
||||
- Keep component state local when possible with `useState`
|
||||
- Use context providers for feature-specific shared state (e.g., `ApplicationsProvider`)
|
||||
|
||||
### Styling
|
||||
|
||||
- **MUI Migration**: The project is gradually migrating from Ant Design to Material-UI (MUI) v7.3.1
|
||||
- **Preferred Approach**: Use MUI components v7.3.1 and styles wherever possible for new features
|
||||
- **Theme and Styles**: MUI theme data and styles are defined in `openmetadata-ui-core-components`
|
||||
- **Colors and Design Tokens**: Always reference theme colors and design tokens from the MUI theme, not hardcoded values
|
||||
- **Legacy Components**: Ant Design components remain in existing code but should be replaced with MUI equivalents when refactoring
|
||||
- Do not add unnecessary spacing between logs and code.
|
||||
- In Java, avoid wildcards imports (e.g., use `import java.util.List;` instead of `import java.util.*;`)
|
||||
- Custom styles in `.less` files with component-specific naming (legacy pattern)
|
||||
- Follow BEM naming convention for custom CSS classes
|
||||
- Use CSS modules where appropriate
|
||||
|
||||
### UI considerations
|
||||
|
||||
- Do not use string literals at any place. You should use useTranslation hook and use it like const {t} = useTranslation(). And for example if you want to have "Run" as string, you should be using { t('label.run') }, this label is defined in locales.
|
||||
|
||||
|
||||
### Application Configuration
|
||||
- Applications use `ApplicationsClassBase` for schema loading and configuration
|
||||
- Dynamic imports handle application-specific schemas and assets
|
||||
- Form schemas use React JSON Schema Form (RJSF) with custom UI widgets
|
||||
|
||||
### Service Utilities
|
||||
- Each service type has dedicated utility files (e.g., `DatabaseServiceUtils.tsx`)
|
||||
- Connection schemas are imported statically and pre-resolved
|
||||
- Service configurations use switch statements to map types to schemas
|
||||
|
||||
### Type Safety
|
||||
- All API responses have generated TypeScript interfaces in `generated/`
|
||||
- Custom types extend base interfaces when needed
|
||||
- Avoid type assertions unless absolutely necessary
|
||||
- Use discriminated unions for action types and state variants
|
||||
|
||||
## Database and Migrations
|
||||
|
||||
- Flyway handles schema migrations in `bootstrap/sql/migrations/`
|
||||
- Use Docker containers for local database setup
|
||||
- Default MySQL, PostgreSQL supported as alternative
|
||||
- Sample data loaded automatically in development environment
|
||||
|
||||
## Security and Authentication
|
||||
|
||||
- JWT-based authentication with OAuth2/SAML support
|
||||
- Role-based access control defined in Java entities
|
||||
- Security configurations in `conf/openmetadata.yaml`
|
||||
- Never commit secrets - use environment variables or secure vaults
|
||||
|
||||
## Code Generation Standards
|
||||
|
||||
### Comments Policy
|
||||
- **Do NOT add unnecessary comments** - write self-documenting code
|
||||
- **NEVER add single-line comments that describe what the code obviously does**
|
||||
- Only include comments for:
|
||||
- Complex business logic that isn't obvious
|
||||
- Non-obvious algorithms or workarounds
|
||||
- Public API JavaDoc documentation
|
||||
- TODO/FIXME with ticket references
|
||||
- Bad examples (NEVER do this):
|
||||
- `// Create user` before `createUser()`
|
||||
- `// Get client` before `SdkClients.adminClient()`
|
||||
- `// Verify domain is set` before `assertNotNull(entity.getDomain())`
|
||||
- `// User names are lowercased` when the code `toLowerCase()` makes it obvious
|
||||
- If the code needs a comment to be understood, refactor the code to be clearer instead
|
||||
|
||||
### Java Code Requirements
|
||||
- **Always run `mvn spotless:apply`** before finishing any task that touched
|
||||
`.java` files. CI runs `mvn spotless:check` and will fail the PR otherwise
|
||||
(bot's exact phrasing: "Please run `mvn spotless:apply` in the root of your
|
||||
repository and commit the changes to this PR"). Scope with `-pl <module>`
|
||||
for speed if only one module changed. A reusable procedure is written up at
|
||||
`.agents/skills/java-checkstyle/SKILL.md`.
|
||||
- Use clear, descriptive variable and method names instead of comments
|
||||
- Follow existing project patterns and conventions
|
||||
- Generate production-ready code, not tutorial code
|
||||
- Create integration tests in openmetadata-integration-tests
|
||||
- Do not use Fully Qualified Names in the code such as org.openmetadata.schema.type.Status instead import the class name
|
||||
- Do not import wild-card packages instead import exactly required packages
|
||||
|
||||
### TypeScript/Frontend Code Requirements
|
||||
- **Always run the UI checkstyle sequence** before finishing any task that
|
||||
touched `.ts`/`.tsx`/`.js`/`.jsx`/`.json` under
|
||||
`openmetadata-ui/src/main/resources/ui/src/`, `.../playwright/`, or
|
||||
`openmetadata-ui-core-components/src/main/resources/ui/src/`. CI's
|
||||
`UI Checkstyle / lint-src|lint-playwright|lint-core-components` jobs fail
|
||||
the PR otherwise. Order matters: `organize-imports-cli` → `eslint --fix` →
|
||||
`prettier --write`. A reusable procedure lives at
|
||||
`.agents/skills/ui-checkstyle/SKILL.md`.
|
||||
- **NEVER use `any` type** in TypeScript code - always use proper types
|
||||
- Use `unknown` when the type is truly unknown and add type guards
|
||||
- Import types from existing type definitions (e.g., `RJSFSchema` from `@rjsf/utils`)
|
||||
- Follow ESLint rules strictly - the project enforces no-console, proper formatting
|
||||
- Add `// eslint-disable-next-line` comments only when absolutely necessary
|
||||
- **Import Organization** (in order):
|
||||
1. External libraries (React, Ant Design, etc.)
|
||||
2. Internal absolute imports from `generated/`, `constants/`, `hooks/`, etc.
|
||||
3. Relative imports for utilities and components
|
||||
4. Asset imports (SVGs, styles)
|
||||
5. Type imports grouped separately when needed
|
||||
|
||||
### Python Code Requirements
|
||||
- **Use pytest, not unittest** - write tests using pytest style with plain `assert` statements
|
||||
- Use pytest fixtures for test setup instead of `setUp`/`tearDown` methods
|
||||
- Use `unittest.mock` for mocking (MagicMock, patch) - this is compatible with pytest
|
||||
- Test classes should not inherit from `TestCase` - use plain classes prefixed with `Test`
|
||||
- Use `assert x == y` instead of `self.assertEqual(x, y)`
|
||||
- Use `assert x is None` instead of `self.assertIsNone(x)`
|
||||
- Use `assert "text" in string` instead of `self.assertIn("text", string)`
|
||||
|
||||
### Python Ingestion Connector Guidelines
|
||||
- **Keep connector-specific logic in connector-specific files**, not in generic/shared files like `builders.py`
|
||||
- Example: Redshift IAM auth should be in `ingestion/src/metadata/ingestion/source/database/redshift/connection.py`, not in `ingestion/src/metadata/ingestion/connections/builders.py`
|
||||
- This keeps the codebase modular and prevents generic utilities from becoming cluttered with connector-specific edge cases
|
||||
|
||||
### Testing Philosophy
|
||||
- **Test real behavior, not mock wiring** - if a test requires mocking 3+ classes just to verify a method call, it's testing the wrong thing
|
||||
- **Prefer integration tests** over heavily-mocked unit tests. This project has full integration test infrastructure (OpenMetadataApplicationTest, Docker containers, real OpenSearch). Use it.
|
||||
- **Mocks are for boundaries, not internals** - mock external services (HTTP clients, third-party APIs), not your own classes. If you're mocking static methods left and right to test internal plumbing, write an integration test instead.
|
||||
- **A test that mocks everything proves nothing** - it only verifies that your mocks are wired correctly, not that the system works
|
||||
- **Ask "what breaks if this test passes but the code is wrong?"** - if the answer is "nothing, because everything real is mocked out", delete the test and write a better one
|
||||
- **Test the outcome, not the implementation** - assert on observable results (API responses, database state, stats values) rather than verifying internal method calls with `verify()`
|
||||
|
||||
### Response Format
|
||||
- Provide clean code blocks without unnecessary explanations
|
||||
- Assume readers are experienced developers
|
||||
- Focus on functionality over education
|
||||
104
CLAUDE.md
104
CLAUDE.md
|
|
@ -120,8 +120,8 @@ cd ingestion
|
|||
make install_dev_env # Install in development mode
|
||||
make generate # Generate Pydantic models from JSON schemas
|
||||
make unit_ingestion_dev_env # Run unit tests
|
||||
make lint # Run pylint
|
||||
make py_format # Format with black, isort, pycln
|
||||
make py_format # Apply ruff lint-fix + format
|
||||
make py_format_check # Verify lint + format (matches CI; catches non-auto-fixable issues)
|
||||
make static-checks # Run type checking with basedpyright
|
||||
```
|
||||
|
||||
|
|
@ -139,6 +139,22 @@ make unit_ingestion # Python unit tests with coverage
|
|||
yarn test:coverage # Frontend test coverage
|
||||
```
|
||||
|
||||
### Backend Integration Tests
|
||||
All backend API integration tests MUST be placed in `openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/` directory. Tests should:
|
||||
- Use naming convention `*IT.java` (Integration Test)
|
||||
- Extend `BaseEntityIT<T, K>` for entity CRUD tests
|
||||
- Be designed to run concurrently (use `@Execution(ExecutionMode.CONCURRENT)`)
|
||||
- Use `TestNamespace` for test isolation
|
||||
- Use `SdkClients` for API calls (e.g., `SdkClients.adminClient().tables().create(...)`)
|
||||
|
||||
```bash
|
||||
# Run a specific integration test
|
||||
mvn test -pl openmetadata-integration-tests -Dtest=TaskResourceIT
|
||||
|
||||
# Run all integration tests
|
||||
mvn test -pl openmetadata-integration-tests
|
||||
```
|
||||
|
||||
## Code Generation and Schemas
|
||||
|
||||
OpenMetadata uses a schema-first approach with JSON Schema definitions driving code generation:
|
||||
|
|
@ -267,11 +283,42 @@ yarn parse-schema # Parse JSON schemas for frontend (connection and
|
|||
|
||||
### Java Code Requirements
|
||||
|
||||
**Always run `mvn spotless:apply` when generating/modifying .java files.**
|
||||
**Always run `mvn spotless:apply` before you finish any task that touched
|
||||
`.java` files.** CI runs `mvn spotless:check` and will fail the PR otherwise —
|
||||
the bot's exact suggestion is "Please run `mvn spotless:apply` in the root of
|
||||
your repository and commit the changes to this PR." Scope the run with
|
||||
`-pl <module>` for speed if only one module changed. When asked to "fix
|
||||
checkstyle" / "fix Java formatting" / "apply spotless", invoke the
|
||||
`java-checkstyle` skill (see `.claude/skills/java-checkstyle/`) rather than
|
||||
hand-editing formatting.
|
||||
|
||||
#### Method Size and Complexity (Kafka-Grade Standards)
|
||||
- **Methods must be 15 lines or fewer** (excluding blank lines and braces). If a method is longer, break it into smaller focused methods with descriptive names.
|
||||
- **Maximum 3 levels of nesting.** Use early returns to reduce nesting:
|
||||
- **Methods must be small and focused — aim for 15 lines or fewer** (excluding blank lines and braces). A method longer than that is almost always hiding multiple responsibilities; break it into smaller methods with descriptive names. "Meaningful" means each method does one nameable thing — if you can't fit the body comfortably on a screen, it's too big.
|
||||
- **One return statement per method, placed at the end.** No early-return guard clauses, no scattered returns in the middle. Initialize a `result` variable, structure the work as `if/else`, or extract a helper — the control flow then stays linear and easy to reason about. (Returns inside `lambda` bodies, `switch` expressions, and anonymous classes are scoped to those constructs and don't count against the outer method.)
|
||||
```java
|
||||
// BAD: four scattered early returns
|
||||
Map<UUID, X> compute(List<EntityInterface> entities) {
|
||||
if (entities == null) return Collections.emptyMap();
|
||||
if (entities.isEmpty()) return Collections.emptyMap();
|
||||
if (!supportsX(entities.get(0))) return null;
|
||||
Map<UUID, X> prefetched = doWork(entities);
|
||||
if (prefetched.isEmpty()) return null;
|
||||
return prefetched;
|
||||
}
|
||||
|
||||
// GOOD: single trailing return; guards become extracted helpers + a result variable
|
||||
Map<UUID, X> compute(List<EntityInterface> entities) {
|
||||
Map<UUID, X> result = null;
|
||||
if (entities != null && !entities.isEmpty() && supportsX(entities.get(0))) {
|
||||
Map<UUID, X> prefetched = doWork(entities);
|
||||
if (!prefetched.isEmpty()) {
|
||||
result = prefetched;
|
||||
}
|
||||
}
|
||||
return result;
|
||||
}
|
||||
```
|
||||
- **Maximum 3 levels of nesting.** Don't flatten by sprinkling early returns — extract a named helper or combine conditions into a single boolean:
|
||||
```java
|
||||
// BAD: deeply nested
|
||||
if (entity != null) {
|
||||
|
|
@ -282,11 +329,14 @@ yarn parse-schema # Parse JSON schemas for frontend (connection and
|
|||
}
|
||||
}
|
||||
|
||||
// GOOD: early returns, flat
|
||||
if (entity == null) return;
|
||||
if (!entity.isActive()) return;
|
||||
if (!hasPermission(entity)) return;
|
||||
process(entity);
|
||||
// GOOD: extract the eligibility check
|
||||
if (isEligibleForProcessing(entity)) {
|
||||
process(entity);
|
||||
}
|
||||
|
||||
private boolean isEligibleForProcessing(Entity entity) {
|
||||
return entity != null && entity.isActive() && hasPermission(entity);
|
||||
}
|
||||
```
|
||||
- **Maximum 10 cyclomatic complexity.** Extract complex conditions into named methods:
|
||||
```java
|
||||
|
|
@ -379,6 +429,19 @@ yarn parse-schema # Parse JSON schemas for frontend (connection and
|
|||
- Use `List.of()`, `Map.of()`, `Set.of()` for immutable collection literals
|
||||
- Use `Optional` correctly: never as a field type, never as a parameter, never assign `null` to it
|
||||
- Use text blocks `"""` for multi-line strings
|
||||
- **Use `SequencedCollection` accessors on Lists/Deques** — `list.getFirst()` / `list.getLast()` (Java 21) instead of `list.get(0)` / `list.get(list.size() - 1)`. Same for `removeFirst()` / `removeLast()`. Reads more clearly and avoids off-by-one indexing.
|
||||
- **Collection emptiness: use the project's `nullOrEmpty(...)` helper** from `org.openmetadata.common.utils.CommonUtil` instead of hand-rolling `coll != null && !coll.isEmpty()` (or its negation). It's the established idiom across this codebase, handles `null` correctly, and reads as a single semantic check. Same applies to `String` checks — use `nullOrEmpty(str)` not `str != null && !str.isEmpty()`.
|
||||
```java
|
||||
// BAD
|
||||
if (entities != null && !entities.isEmpty()) {
|
||||
process(entities.get(0));
|
||||
}
|
||||
|
||||
// GOOD
|
||||
if (!nullOrEmpty(entities)) {
|
||||
process(entities.getFirst());
|
||||
}
|
||||
```
|
||||
|
||||
#### Common Bug Patterns to Avoid
|
||||
- `equals()` without `hashCode()` (or vice versa)
|
||||
|
|
@ -405,6 +468,19 @@ yarn parse-schema # Parse JSON schemas for frontend (connection and
|
|||
- One statement per line — no `if (x) return y;` on one line
|
||||
|
||||
### TypeScript/Frontend Code Requirements
|
||||
|
||||
**Always run the UI checkstyle sequence before you finish any task that
|
||||
touched `.ts`/`.tsx`/`.js`/`.jsx`/`.json` under
|
||||
`openmetadata-ui/src/main/resources/ui/src/`, `.../playwright/`, or
|
||||
`openmetadata-ui-core-components/src/main/resources/ui/src/`.** CI's
|
||||
`UI Checkstyle / lint-src|lint-playwright|lint-core-components` jobs fail the
|
||||
PR otherwise. The order matters — run `organize-imports-cli`, then
|
||||
`eslint --fix`, then `prettier --write`; reversing organize-imports and
|
||||
prettier leaves a dirty diff (organize-imports uses 4-space indentation,
|
||||
prettier uses 2 + trailing commas). When asked to "fix UI checkstyle" / "run
|
||||
prettier" / "fix UI lint", invoke the `ui-checkstyle` skill (see
|
||||
`.claude/skills/ui-checkstyle/`) rather than hand-editing formatting.
|
||||
|
||||
- **NEVER use `any` type** in TypeScript code - always use proper types
|
||||
- Use `unknown` when the type is truly unknown and add type guards
|
||||
- Import types from existing type definitions (e.g., `RJSFSchema` from `@rjsf/utils`)
|
||||
|
|
@ -455,6 +531,14 @@ These checks run automatically in CI. Code that violates them **will not merge**
|
|||
- This keeps the codebase modular and prevents generic utilities from becoming cluttered with connector-specific edge cases
|
||||
- **Use `model_str()` for Pydantic RootModel to string conversion** — OpenMetadata schema types like `ColumnName`, `EntityName`, `FullyQualifiedEntityName`, and `UUID` are Pydantic `RootModel[str]` subclasses where `str()` returns `"root='value'"` instead of the raw value. Always use `model_str()` from `metadata.ingestion.ometa.utils` instead of manual `hasattr(x, "root")` / `str(x.root)` checks.
|
||||
|
||||
### Caching
|
||||
- **All caches MUST be bounded.** Never use a bare `dict` / `HashMap` / `Map` as a cache without an explicit size cap — they grow with the input and cause OOMs on large catalogs/ingestions. The only exception is when the user explicitly asks for an unbounded cache for a specific case.
|
||||
- Pick a sane default (typically 100–1000 entries depending on entity size); if you're unsure, ask the user.
|
||||
- **Python**: use `collections.OrderedDict` with `popitem(last=False)` eviction after insert, `@functools.lru_cache(maxsize=N)`, or `cachetools.LRUCache`. Cache both hits and misses (negative caching) — repeated unresolvable lookups are a common hot path.
|
||||
- **Java**: use Caffeine (`Caffeine.newBuilder().maximumSize(N).build()`) or Guava `CacheBuilder.newBuilder().maximumSize(N).build()`. Never a bare `HashMap`.
|
||||
- **TypeScript**: use `lru-cache` — never a bare `Map` or plain object.
|
||||
- **Before adding a cache, check whether the underlying call is already cached at a lower layer.** Example: `OpenMetadata._search_es_entity` is `@lru_cache(maxsize=512)`, so wrapping `get_entity_from_es` / `es_search_container_by_path` calls in a local dict cache is redundant — drop the local cache and rely on the existing LRU.
|
||||
|
||||
### Testing Philosophy
|
||||
- **Test real behavior, not mock wiring** - if a test requires mocking 3+ classes just to verify a method call, it's testing the wrong thing
|
||||
- **Prefer integration tests** over heavily-mocked unit tests. This project has full integration test infrastructure (OpenMetadataApplicationTest, Docker containers, real OpenSearch). Use it.
|
||||
|
|
|
|||
|
|
@ -74,7 +74,7 @@ For connector-specific development, see [skills/README.md](skills/README.md).
|
|||
2. /connector-standards — Load the relevant standards
|
||||
3. /tdd — Write pytest tests first
|
||||
4. Implement using topology pattern
|
||||
5. make py_format && make lint
|
||||
5. make py_format && make py_format_check
|
||||
6. /test-enforcement — Verify 90% coverage
|
||||
7. /verification — Show test + lint output
|
||||
8. /connector-review — Full review against golden standards (for connectors)
|
||||
|
|
|
|||
8
Makefile
8
Makefile
|
|
@ -39,7 +39,7 @@ yarn_start_e2e_ui: ## Run the e2e tests locally in UI mode with Yarn
|
|||
.PHONY: yarn_start_e2e_codegen
|
||||
yarn_start_e2e_codegen: ## generate playwright code
|
||||
cd openmetadata-ui/src/main/resources/ui && yarn playwright:codegen
|
||||
|
||||
|
||||
.PHONY: py_antlr
|
||||
py_antlr: ## Generate the Python code for parsing FQNs
|
||||
antlr4 -Dlanguage=Python3 -o ingestion/src/metadata/generated/antlr ${PWD}/openmetadata-spec/src/main/antlr4/org/openmetadata/schema/*.g4
|
||||
|
|
@ -254,21 +254,21 @@ ui-checkstyle-core-components:
|
|||
cd openmetadata-ui-core-components/src/main/resources/ui && yarn install --frozen-lockfile && yarn lint:fix && yarn pretty
|
||||
|
||||
# Fix linting and formatting errors in changed files in src folder
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# So make sure to run this after rebasing to main to get the correct list of changed files.
|
||||
.PHONY: ui-checkstyle-src-changed
|
||||
ui-checkstyle-src-changed:
|
||||
cd openmetadata-ui/src/main/resources/ui && yarn install --frozen-lockfile && yarn ui-checkstyle:changed
|
||||
|
||||
# Fix linting and formatting errors in changed playwright test files
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# So make sure to run this after rebasing to main to get the correct list of changed files.
|
||||
.PHONY: ui-checkstyle-playwright-changed
|
||||
ui-checkstyle-playwright-changed:
|
||||
cd openmetadata-ui/src/main/resources/ui && yarn install --frozen-lockfile && yarn ui-checkstyle:playwright:changed
|
||||
|
||||
# Fix linting and formatting errors in changed core components files
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# Changed files are detected based on the current branch against main branch.
|
||||
# So make sure to run this after rebasing to main to get the correct list of changed files.
|
||||
.PHONY: ui-checkstyle-core-components-changed
|
||||
ui-checkstyle-core-components-changed:
|
||||
|
|
|
|||
898
README.md
898
README.md
|
|
@ -14,84 +14,866 @@
|
|||
|
||||
</div>
|
||||
|
||||
## What is OpenMetadata?
|
||||
[OpenMetadata](https://open-metadata.org/) is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column-level lineage, and seamless team collaboration. It is one of the fastest-growing open-source projects with a vibrant community and adoption by a diverse set of companies in a variety of industry verticals. Based on Open Metadata Standards and APIs, supporting connectors to a wide range of data services, OpenMetadata enables end-to-end metadata management, giving you the freedom to unlock the value of your data assets.
|
||||
<div align="center">
|
||||
<img src="https://github.com/open-metadata/OpenMetadata/assets/40225091/ebfb4ec5-f0a2-4d58-8ce5-a082b5cf0f76" width=800>
|
||||
</div>
|
||||
# OpenMetadata
|
||||
|
||||
<br />
|
||||
Contents:
|
||||
## The Open Semantic Context Platform for Data and AI
|
||||
|
||||
- [Features](#key-features-of-openmetadata)
|
||||
- [Try our Sandbox](#try-our-sandbox)
|
||||
- [Install & Run](#install-and-run-openmetadata)
|
||||
- [Roadmap](https://docs.open-metadata.org/latest/roadmap)
|
||||
- [Documentation and Support](#documentation-and-support)
|
||||
- [Contributors](#contributors)
|
||||
OpenMetadata is the open platform for building trusted data context and business semantics for humans, AI assistants, and agents.
|
||||
|
||||
OpenMetadata Consists of Four Main Components:
|
||||
- **Metadata Schemas**: These are the core definitions and vocabulary for metadata based on common abstractions and types. They also allow for custom extensions and properties to suit different use cases and domains.
|
||||
- **Metadata Store**: This is the central repository for storing and managing the metadata graph, which connects data assets, users, and tool-generated metadata in a unified way.
|
||||
- **Metadata APIs**: These are the interfaces for producing and consuming metadata, built on top of the metadata schemas. They enable seamless integration of user interfaces and tools, systems, and services with the metadata store.
|
||||
- **Ingestion Framework**: This is a pluggable framework for ingesting metadata from various sources and tools to the metadata store. It supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.
|
||||
OpenMetadata connects technical metadata, data quality signals, data lineage, column-level lineage, ownership, usage, policies, conversations, glossaries, classifications, metrics, domains, and data products into a unified metadata knowledge graph. With 120+ connectors, open metadata standards, semantic search, APIs, SDKs, and an MCP server, OpenMetadata gives every user and AI system the governed context it needs to discover, understand, trust, and use data.
|
||||
|
||||
## Key Features of OpenMetadata
|
||||
**Data Discovery**: Find and explore all your data assets in a single place using various strategies, such as keyword search, data associations, and advanced queries. You can search across tables, topics, dashboards, pipelines, and services.
|
||||
AI does not need another raw database connector. AI needs context.
|
||||
|
||||

|
||||
<br><br><br>
|
||||
**Data Collaboration**: Communicate, converse, and cooperate with other users and teams on data assets. You can get event notifications, send alerts, add announcements, create tasks, and use conversation threads.
|
||||
OpenMetadata provides that context:
|
||||
|
||||

|
||||
<br><br><br>
|
||||
**Data Quality and Profiler**: Measure and monitor the quality with **no-code** to build trust in your data. You can define and run data quality tests, group them into test suites, and view the results in an interactive dashboard. With powerful collaboration, make data quality a shared responsibility in your organization.
|
||||
- what data exists
|
||||
- what it means
|
||||
- who owns it
|
||||
- how it is used
|
||||
- where it came from
|
||||
- where it flows
|
||||
- whether it is fresh, tested, and trusted
|
||||
- which business concepts, glossary terms, classifications, and policies apply
|
||||
- what downstream assets, dashboards, pipelines, metrics, and ML models depend on it
|
||||
|
||||

|
||||
<br><br><br>
|
||||
**Data Governance**: Enforce data policies and standards across your organization. You can define data domains and data products, assign owners and stakeholders, and classify data assets using tags and terms. Use powerful automation features to auto-classify your data.
|
||||
---
|
||||
|
||||

|
||||
<br><br><br>
|
||||
**Data Insights and KPIs**: Use reports and platform analytics to understand how your organization's data is doing. Data Insights provides a single-pane view of all the key metrics to reflect the state of your data best. Define the Key Performance Indicators (KPIs) and set goals within OpenMetadata to work towards better documentation, ownership, and tiering. Alerts can be set against the KPIs to be received on a specified schedule.
|
||||
## Contents
|
||||
|
||||

|
||||
<br><br><br>
|
||||
**Data Lineage**: Track and visualize the origin and transformation of your data assets end-to-end. You can view column-level lineage, filter queries, and edit lineage manually using a no-code editor.
|
||||
|
||||
**Data Documentation**: Document your data assets and metadata entities using rich text, images, and links. You can also add comments and annotations and generate data dictionaries and data catalogs.
|
||||
|
||||
**Data Observability**: Monitor the health and performance of your data assets and pipelines. You can view metrics such as data freshness, data volume, data quality, and data latency. You can also set up alerts and notifications for any anomalies or failures.
|
||||
|
||||
**Data Security**: Secure your data and metadata using various authentication and authorization mechanisms. You can integrate with different identity providers for single sign-on and define roles and policies for access control.
|
||||
|
||||
**Webhooks**: Integrate with external applications and services using webhooks. You can register URLs to receive metadata event notifications and integrate with Slack, Microsoft Teams, and Google Chat.
|
||||
|
||||
**Connectors**: Ingest metadata from various sources and tools using connectors. OpenMetadata supports about 84+ connectors for data warehouses, databases, dashboard services, messaging services, pipeline services, and more.
|
||||
- [Why OpenMetadata for AI?](#why-openmetadata-for-ai)
|
||||
- [Context: Give AI the Full Picture of Your Data](#context-give-ai-the-full-picture-of-your-data)
|
||||
- [Semantics: Give AI Business Meaning](#semantics-give-ai-business-meaning)
|
||||
- [Knowledge Graphs and Ontologies](#knowledge-graphs-and-ontologies)
|
||||
- [Automation: Activate Context and Semantics with AI](#automation-activate-context-and-semantics-with-ai)
|
||||
- [What You Can Build](#what-you-can-build)
|
||||
- [How OpenMetadata Works](#how-openmetadata-works)
|
||||
- [MCP: Connect AI Assistants and Agents](#mcp-connect-ai-assistants-and-agents)
|
||||
- [Semantic Search](#semantic-search)
|
||||
- [OpenMetadata Standards](#openmetadata-standards)
|
||||
- [Core Platform Capabilities](#core-platform-capabilities)
|
||||
- [Quickstart](#quickstart)
|
||||
- [Documentation and Community](#documentation-and-community)
|
||||
- [Contributing](#contributing)
|
||||
- [License](#license)
|
||||
|
||||
## Try our Sandbox
|
||||
---
|
||||
|
||||
Take a look and play with sample data at [http://sandbox.open-metadata.org](http://sandbox.open-metadata.org)
|
||||
## Why OpenMetadata for AI?
|
||||
|
||||
## Install and Run OpenMetadata
|
||||
Get up and running in a few minutes. See the OpenMetadata documentation for [installation instructions](https://docs.open-metadata.org/quick-start/local-docker-deployment).
|
||||
AI needs more than data access. It needs context, semantics, trust, lineage, governance, and operational awareness.
|
||||
|
||||
## Documentation and Support
|
||||
Connecting an AI assistant directly to a database, warehouse, dashboard, or pipeline only gives it raw access to data structures. It does not give the AI enough context to understand what the data means, whether it can be trusted, who owns it, how it is governed, or what downstream systems depend on it.
|
||||
|
||||
We're here to help and make OpenMetadata even better! Check out [OpenMetadata documentation](https://docs.open-metadata.org/) for a complete description of OpenMetadata's features. Join our [Slack Community](https://slack.open-metadata.org/) to get in touch with us if you want to chat, need help, or discuss new feature requirements.
|
||||
OpenMetadata gives AI systems the context and semantics they need to safely discover, understand, govern, and use enterprise data.
|
||||
|
||||
OpenMetadata does this by combining four capabilities:
|
||||
|
||||
## Contributors
|
||||
1. **Context** — technical, operational, trust, and lineage metadata from the data ecosystem.
|
||||
2. **Semantics** — business meaning through glossaries, metrics, classifications, domains, policies, and ontologies.
|
||||
3. **Knowledge Graph** — relationships connecting assets, columns, people, teams, policies, lineage, quality, and business concepts.
|
||||
4. **Automation** — MCP, Semantic Search, APIs, SDKs, events, and workflows that let AI assistants and agents act on governed metadata.
|
||||
|
||||
We ❤️ all contributions, big and small! Check out our [CONTRIBUTING](./CONTRIBUTING.md) guide to get started, and let us know how we can help.
|
||||
With OpenMetadata, AI can answer questions such as:
|
||||
|
||||
Don't want to miss anything? Give the project a ⭐ 🚀
|
||||
- What does this metric mean?
|
||||
- Which datasets power this dashboard?
|
||||
- Who owns this data product?
|
||||
- Is this dataset certified, fresh, and high quality?
|
||||
- What downstream dashboards or ML models are affected by this column change?
|
||||
- Which assets are related to customer purchase behavior, even if they use different names?
|
||||
- Which columns contain sensitive customer information?
|
||||
- Which glossary terms and business concepts apply to this dataset?
|
||||
|
||||
A HUGE THANK YOU to all our supporters!
|
||||
---
|
||||
|
||||
## Context: Give AI the Full Picture of Your Data
|
||||
|
||||
Context is the metadata that describes how data exists, behaves, changes, flows, and is used across the organization.
|
||||
|
||||
OpenMetadata collects context from across your data stack and connects it into a unified metadata graph.
|
||||
|
||||
### Technical Metadata
|
||||
|
||||
OpenMetadata gives AI access to technical metadata such as:
|
||||
|
||||
- databases, schemas, tables, columns, topics, dashboards, charts, pipelines, APIs, search indexes, ML models, and storage assets
|
||||
- schemas, column names, data types, constraints, descriptions, sample queries, joins, and service metadata
|
||||
- service configuration, ingestion metadata, and operational metadata
|
||||
- owners, teams, users, personas, domains, data products, and usage patterns
|
||||
|
||||
### Data Quality and Trust Signals
|
||||
|
||||
AI should not treat every dataset as equally trustworthy.
|
||||
|
||||
OpenMetadata gives AI access to trust signals such as:
|
||||
|
||||
- data quality tests
|
||||
- test suites and test results
|
||||
- freshness checks
|
||||
- volume checks
|
||||
- null, uniqueness, distribution, and custom tests
|
||||
- profiling results
|
||||
- observability signals
|
||||
- data quality history
|
||||
- incidents, alerts, and operational health signals
|
||||
|
||||
### Data Lineage and Impact
|
||||
|
||||
AI needs to understand where data comes from and where it goes.
|
||||
|
||||
OpenMetadata captures:
|
||||
|
||||
- upstream and downstream lineage
|
||||
- table-level lineage
|
||||
- dashboard lineage
|
||||
- pipeline lineage
|
||||
- metric lineage
|
||||
- ML model lineage
|
||||
- API and topic dependencies
|
||||
- impact analysis across the data estate
|
||||
|
||||
### Column-Level Lineage
|
||||
|
||||
For precise AI reasoning, table-level lineage is not enough.
|
||||
|
||||
OpenMetadata helps AI understand:
|
||||
|
||||
- which source columns produce which downstream columns
|
||||
- how columns flow through transformations
|
||||
- which dashboards, reports, metrics, or ML models depend on a specific column
|
||||
- what may break when a column changes
|
||||
|
||||
### Connected from 120+ Data Services
|
||||
|
||||
OpenMetadata brings this context together from databases, warehouses, lakes, dashboards, pipelines, messaging systems, ML platforms, storage systems, APIs, search systems, and metadata systems.
|
||||
|
||||
Context answers questions like:
|
||||
|
||||
- What data exists?
|
||||
- Where did this data come from?
|
||||
- Who owns it?
|
||||
- Is it fresh?
|
||||
- Is it tested?
|
||||
- Is it trusted?
|
||||
- What systems depend on it?
|
||||
- What happens if it changes?
|
||||
|
||||
---
|
||||
|
||||
## Semantics: Give AI Business Meaning
|
||||
|
||||
Semantics is the business meaning layered on top of technical context.
|
||||
|
||||
Without semantics, AI may see a column named `cust_id`, `acct_id`, or `buyer_key`, but it may not know whether those fields represent a customer, an account, a buyer, a household, or a legal entity.
|
||||
|
||||
OpenMetadata lets teams define, govern, and connect business meaning across the metadata graph.
|
||||
|
||||
### Business Concepts
|
||||
|
||||
Define the concepts that matter to the business, such as:
|
||||
|
||||
- Customer
|
||||
- Account
|
||||
- Order
|
||||
- Revenue
|
||||
- Product
|
||||
- Consent
|
||||
- Churn
|
||||
- Risk
|
||||
- Lifetime Value
|
||||
- Net Retention
|
||||
- Active User
|
||||
- Sensitive Data
|
||||
|
||||
### Glossaries and Glossary Terms
|
||||
|
||||
OpenMetadata lets teams create governed vocabularies with:
|
||||
|
||||
- business definitions
|
||||
- synonyms and abbreviations
|
||||
- owners and reviewers
|
||||
- related terms
|
||||
- hierarchical terms
|
||||
- links to tables, columns, dashboards, metrics, and data products
|
||||
|
||||
### Metrics and KPIs
|
||||
|
||||
Metrics are one of the most important semantic objects for AI.
|
||||
|
||||
OpenMetadata helps AI understand:
|
||||
|
||||
- what a metric means
|
||||
- how it is calculated
|
||||
- who owns it
|
||||
- which dashboards use it
|
||||
- which tables power it
|
||||
- which glossary terms define it
|
||||
- which downstream consumers depend on it
|
||||
|
||||
### Classifications and Tags
|
||||
|
||||
OpenMetadata lets teams classify and label data with governed tags such as:
|
||||
|
||||
- PII
|
||||
- Sensitive
|
||||
- Confidential
|
||||
- Certified
|
||||
- Deprecated
|
||||
- Tier 1
|
||||
- Finance
|
||||
- Marketing
|
||||
- GDPR
|
||||
- HIPAA
|
||||
- SOX
|
||||
- ML Feature
|
||||
- Customer Data
|
||||
|
||||
### Domains and Data Products
|
||||
|
||||
OpenMetadata connects assets to business ownership boundaries through:
|
||||
|
||||
- domains
|
||||
- data products
|
||||
- teams
|
||||
- owners
|
||||
- policies
|
||||
- personas
|
||||
- data product consumers
|
||||
|
||||
### Policies and Governance
|
||||
|
||||
OpenMetadata connects semantics to governance so AI systems can reason with policy-aware context, not just metadata.
|
||||
|
||||
This includes:
|
||||
|
||||
- ownership
|
||||
- stewardship
|
||||
- classification
|
||||
- access control context
|
||||
- certification
|
||||
- review workflows
|
||||
- governance policies
|
||||
- lifecycle states
|
||||
|
||||
Semantics answers questions like:
|
||||
|
||||
- What does this data mean?
|
||||
- What business concept does this column represent?
|
||||
- Is this metric officially defined?
|
||||
- Is this asset certified?
|
||||
- Is this data sensitive?
|
||||
- Which glossary terms apply?
|
||||
- Which domain owns this data product?
|
||||
|
||||
---
|
||||
|
||||
## Knowledge Graphs and Ontologies
|
||||
|
||||
OpenMetadata connects context and semantics into a unified metadata knowledge graph.
|
||||
|
||||
The graph does not just store data assets. It stores the relationships between data assets, people, teams, policies, quality tests, lineage, classifications, glossary terms, metrics, domains, and data products.
|
||||
|
||||
This makes OpenMetadata a semantic context layer for AI.
|
||||
|
||||
Example relationships:
|
||||
|
||||
```text
|
||||
Table ──hasColumn────────────> Column
|
||||
Column ──classifiedAs────────> PII
|
||||
Column ──represents──────────> Customer Identifier
|
||||
Table ──ownedBy──────────────> Data Engineering Team
|
||||
Table ──partOf───────────────> Customer 360 Data Product
|
||||
Dashboard ──dependsOn────────> Table
|
||||
Metric ──definedBy───────────> Glossary Term
|
||||
Pipeline ──produces──────────> Table
|
||||
Column ──flowsTo─────────────> Column
|
||||
Test Case ──validates────────> Table
|
||||
Domain ──contains────────────> Data Product
|
||||
Glossary Term ──relatedTo────> Business Concept
|
||||
Policy ──governs─────────────> Classification
|
||||
```
|
||||
|
||||
With this graph, AI can reason across relationships:
|
||||
|
||||
- Which datasets power this dashboard?
|
||||
- What does this metric mean?
|
||||
- Who owns this data product?
|
||||
- Is this table fresh, certified, and high quality?
|
||||
- Which downstream dashboards or ML models are affected by this column change?
|
||||
- Which assets are related to customer purchase behavior, even if they use different names?
|
||||
- Which columns represent sensitive customer information?
|
||||
- Which business concepts are connected to this data product?
|
||||
|
||||
### Ontologies and Semantic Interoperability
|
||||
|
||||
OpenMetadata is built on open metadata standards.
|
||||
|
||||
[OpenMetadata Standards](https://openmetadatastandards.org/) provides schemas, ontologies, and semantic specifications for interoperable metadata management, including:
|
||||
|
||||
- JSON Schemas for metadata entities, APIs, configurations, events, and relationships
|
||||
- RDF/OWL ontologies for semantic web, linked data, and knowledge graph use cases
|
||||
- SHACL shapes for validation
|
||||
- JSON-LD contexts for semantic interoperability
|
||||
- standards for governance, lineage, quality, observability, teams, users, policies, and events
|
||||
|
||||
These standards make OpenMetadata more than a catalog. They make it a foundation for interoperable semantic metadata, linked data, and enterprise knowledge graphs.
|
||||
|
||||
---
|
||||
|
||||
## Automation: Activate Context and Semantics with AI
|
||||
|
||||
OpenMetadata makes the metadata graph actionable.
|
||||
|
||||
AI assistants, coding agents, data teams, governance teams, and applications can use OpenMetadata through:
|
||||
|
||||
- MCP
|
||||
- Semantic Search
|
||||
- APIs
|
||||
- SDKs
|
||||
- events
|
||||
- webhooks
|
||||
- ingestion workflows
|
||||
- metadata applications
|
||||
|
||||
### MCP Server
|
||||
|
||||
OpenMetadata includes an MCP server that lets AI assistants and MCP-compatible clients interact with the metadata graph through natural language.
|
||||
|
||||
With OpenMetadata MCP, AI assistants can:
|
||||
|
||||
- search metadata
|
||||
- run semantic search
|
||||
- retrieve entity details
|
||||
- inspect upstream and downstream lineage
|
||||
- create glossaries and glossary terms
|
||||
- create lineage
|
||||
- update descriptions, tags, owners, and other metadata
|
||||
- list data quality test definitions
|
||||
- create data quality test cases
|
||||
- analyze root causes of data quality failures
|
||||
|
||||
Get started with MCP:
|
||||
[OpenMetadata MCP Server Documentation](https://docs.open-metadata.org/how-to-guides/mcp)
|
||||
|
||||
### Semantic Search
|
||||
|
||||
Semantic Search lets users and AI assistants search by meaning, not just by exact keywords.
|
||||
|
||||
For example, a user can ask:
|
||||
|
||||
> Find tables related to customer purchase behavior and transaction history.
|
||||
|
||||
OpenMetadata can return conceptually related assets even when the exact words in the query do not appear in the asset names.
|
||||
|
||||
This helps AI answer questions such as:
|
||||
|
||||
- Which datasets are related to customer behavior?
|
||||
- What dashboards do we have for revenue forecasting?
|
||||
- Show me assets related to user engagement metrics.
|
||||
- Find pipelines that process financial compliance data.
|
||||
|
||||
### AI SDK
|
||||
|
||||
Developers can use OpenMetadata’s AI SDK to build custom AI applications that use OpenMetadata MCP tools programmatically.
|
||||
|
||||
The AI SDK enables AI applications to use OpenMetadata context from Python, TypeScript, and Java.
|
||||
|
||||
### APIs, Events, and Webhooks
|
||||
|
||||
OpenMetadata exposes APIs, events, and webhooks so teams can automate metadata workflows across their data ecosystem.
|
||||
|
||||
Use them to:
|
||||
|
||||
- ingest and update metadata
|
||||
- react to metadata changes
|
||||
- trigger governance workflows
|
||||
- integrate with collaboration tools
|
||||
- build custom metadata applications
|
||||
- synchronize context across systems
|
||||
|
||||
### Coding Agents and AI Assistants
|
||||
|
||||
OpenMetadata can connect to MCP-compatible assistants and agents such as:
|
||||
|
||||
- Claude Desktop
|
||||
- Claude Code
|
||||
- Goose
|
||||
- Cursor
|
||||
- VS Code
|
||||
- Codex
|
||||
- custom LLM applications
|
||||
- internal enterprise AI assistants
|
||||
|
||||
This allows coding agents and data assistants to understand schemas, glossary definitions, ownership, lineage, quality requirements, and downstream dependencies before generating SQL, dbt models, documentation, tests, migration plans, or impact analysis.
|
||||
|
||||
---
|
||||
|
||||
## What You Can Build
|
||||
|
||||
### AI Data Discovery
|
||||
|
||||
Ask natural-language questions over your metadata graph and find relevant assets, even when names and keywords do not match exactly.
|
||||
|
||||
Example:
|
||||
|
||||
> Find datasets related to customer purchase behavior and transaction history.
|
||||
|
||||
### Trusted AI Assistants
|
||||
|
||||
Ground AI responses in governed metadata: owners, descriptions, glossary terms, tags, classifications, quality signals, freshness, usage, and lineage.
|
||||
|
||||
Example:
|
||||
|
||||
> Explain what this dashboard measures and whether the underlying data is trusted.
|
||||
|
||||
### Impact Analysis Agents
|
||||
|
||||
Ask what will break if a table, column, pipeline, dashboard, metric, or ML feature changes.
|
||||
|
||||
Example:
|
||||
|
||||
> What downstream dashboards and ML models are affected if `customer_id` changes in this table?
|
||||
|
||||
### Governance Automation
|
||||
|
||||
Use agents to suggest descriptions, assign glossary terms, identify sensitive data, create classifications, propose ownership, and manage stewardship workflows.
|
||||
|
||||
Example:
|
||||
|
||||
> Review this new table, suggest glossary terms, and identify possible PII columns.
|
||||
|
||||
### Data Quality Automation
|
||||
|
||||
Use AI workflows to create tests, summarize failures, identify root causes, and recommend remediation steps.
|
||||
|
||||
Example:
|
||||
|
||||
> Investigate why this data quality test failed and identify upstream changes that may have caused it.
|
||||
|
||||
### Semantic Knowledge Graphs
|
||||
|
||||
Build interoperable metadata knowledge graphs using OpenMetadata Standards, RDF/OWL, JSON-LD, SHACL, and OpenMetadata’s entity relationships.
|
||||
|
||||
Example:
|
||||
|
||||
> Find all assets related to customer risk that contain sensitive data and are used by revenue dashboards.
|
||||
|
||||
### Developer and Coding Agent Workflows
|
||||
|
||||
Connect coding agents to OpenMetadata so they can understand schemas, owners, lineage, business definitions, and quality requirements before generating code, queries, dbt models, tests, or migration plans.
|
||||
|
||||
Example:
|
||||
|
||||
> Generate a dbt model for this customer table and include tests based on OpenMetadata quality expectations.
|
||||
|
||||
---
|
||||
|
||||
## How OpenMetadata Works
|
||||
|
||||
OpenMetadata is built around an open, schema-first metadata graph.
|
||||
|
||||
```text
|
||||
┌──────────────────────────────────────────┐
|
||||
│ Data Ecosystem │
|
||||
│ Warehouses | Lakes | BI | Pipelines | ML │
|
||||
│ APIs | Topics | Storage | Search | SaaS │
|
||||
└─────────────────────┬────────────────────┘
|
||||
│
|
||||
120+ Connectors
|
||||
│
|
||||
▼
|
||||
┌────────────────────────────────────────────────────────────────────┐
|
||||
│ OpenMetadata │
|
||||
│ │
|
||||
│ Context Layer │
|
||||
│ - technical metadata │
|
||||
│ - quality and observability signals │
|
||||
│ - table and column-level lineage │
|
||||
│ - ownership, usage, domains, data products │
|
||||
│ │
|
||||
│ Semantics Layer │
|
||||
│ - business concepts │
|
||||
│ - glossaries and glossary terms │
|
||||
│ - classifications and tags │
|
||||
│ - metrics and KPIs │
|
||||
│ - ontologies and semantic standards │
|
||||
│ │
|
||||
│ Knowledge Graph │
|
||||
│ - assets, people, teams, policies, lineage, quality, semantics │
|
||||
└─────────────────────────────────────┬──────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────┼─────────────────────────────┐
|
||||
▼ ▼ ▼
|
||||
Semantic Search APIs MCP Server
|
||||
│ │ │
|
||||
└─────────────────────────────┼─────────────────────────────┘
|
||||
▼
|
||||
AI Assistants and Agents
|
||||
Claude | Claude Code | Cursor | VS Code | Codex
|
||||
Goose | Custom Apps | AI SDK Workflows
|
||||
```
|
||||
|
||||
### Platform Components
|
||||
|
||||
OpenMetadata consists of five core layers:
|
||||
|
||||
1. **Open Metadata Standards**
|
||||
Canonical schemas, APIs, RDF/OWL ontologies, SHACL shapes, JSON-LD contexts, and event models for metadata interoperability.
|
||||
|
||||
2. **Metadata Store and Knowledge Graph**
|
||||
A central repository that stores and connects metadata entities, relationships, quality signals, usage, lineage, ownership, and semantics.
|
||||
|
||||
3. **Ingestion Framework and Connectors**
|
||||
A pluggable framework for collecting metadata from databases, warehouses, dashboards, pipelines, messaging systems, ML platforms, storage systems, APIs, and more.
|
||||
|
||||
4. **APIs, Search, Events, and Webhooks**
|
||||
Interfaces for consuming, updating, searching, subscribing to, and automating metadata.
|
||||
|
||||
5. **MCP and AI SDK**
|
||||
AI-facing tools that expose OpenMetadata context and semantics to assistants, coding agents, and custom LLM applications.
|
||||
|
||||
---
|
||||
|
||||
## MCP: Connect AI Assistants and Agents
|
||||
|
||||
OpenMetadata’s MCP server lets AI assistants and agents interact with your metadata graph through natural language.
|
||||
|
||||
Use MCP to give AI assistants governed access to OpenMetadata context, including descriptions, owners, lineage, glossary terms, tags, classifications, data quality results, and semantic search.
|
||||
|
||||
### MCP Tools
|
||||
|
||||
OpenMetadata MCP tools include:
|
||||
|
||||
| Tool | What it does |
|
||||
| --- | --- |
|
||||
| `search_metadata` | Search across tables, dashboards, pipelines, topics, glossaries, metrics, and more |
|
||||
| `semantic_search` | Search by meaning and context beyond keyword matching |
|
||||
| `get_entity_details` | Retrieve detailed metadata for a specific entity |
|
||||
| `get_entity_lineage` | Retrieve upstream and downstream lineage for an entity |
|
||||
| `create_glossary` | Create a new glossary |
|
||||
| `create_glossary_term` | Create a glossary term |
|
||||
| `create_lineage` | Create a lineage edge between entities |
|
||||
| `patch_entity` | Update metadata such as descriptions, tags, and owners |
|
||||
| `get_test_definitions` | List data quality test definitions |
|
||||
| `create_test_case` | Create a data quality test case |
|
||||
| `root_cause_analysis` | Analyze root causes of data quality failures |
|
||||
|
||||
### Supported MCP Workflows
|
||||
|
||||
OpenMetadata documentation includes setup guides for:
|
||||
|
||||
- Claude Desktop
|
||||
- Claude Code
|
||||
- Goose
|
||||
- Cursor
|
||||
- VS Code
|
||||
- Semantic Search through MCP
|
||||
|
||||
Codex and other MCP-compatible coding agents can use the OpenMetadata MCP endpoint as an external context and tool server.
|
||||
|
||||
Get started:
|
||||
[OpenMetadata MCP Server Documentation](https://docs.open-metadata.org/v1.12.x/how-to-guides/mcp)
|
||||
|
||||
### MCP Endpoint
|
||||
|
||||
```text
|
||||
https://<YOUR-OPENMETADATA-SERVER>/mcp
|
||||
```
|
||||
|
||||
### Example Prompts
|
||||
|
||||
After connecting an MCP client, try prompts such as:
|
||||
|
||||
```text
|
||||
What is the definition of the Revenue metric?
|
||||
|
||||
Show me the lineage of the data feeding the Executive Revenue dashboard.
|
||||
|
||||
Who owns the Customer 360 data product and when was it last updated?
|
||||
|
||||
Find tables related to customer purchase behavior and transaction history.
|
||||
|
||||
Which downstream dashboards are affected if this column changes?
|
||||
|
||||
Create a glossary term for Net Retention and link it to related metrics.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Semantic Search
|
||||
|
||||
Semantic Search lets users and AI assistants find data assets by meaning, not only by exact keyword matches.
|
||||
|
||||
When Semantic Search is enabled, OpenMetadata can convert natural-language queries into embeddings and search conceptually related metadata assets.
|
||||
|
||||
Example:
|
||||
|
||||
```text
|
||||
Find tables related to customer purchase behavior and transaction history.
|
||||
```
|
||||
|
||||
This can surface assets such as:
|
||||
|
||||
```text
|
||||
order_transactions
|
||||
buyer_activity
|
||||
customer_events
|
||||
revenue_orders
|
||||
```
|
||||
|
||||
Semantic Search helps with:
|
||||
|
||||
- natural-language discovery
|
||||
- AI data exploration
|
||||
- concept-based search
|
||||
- cross-domain asset discovery
|
||||
- finding related data even when names differ
|
||||
- grounding LLM responses in relevant metadata context
|
||||
|
||||
Learn more:
|
||||
[Semantic Search MCP Tool](https://docs.open-metadata.org/v1.12.x/how-to-guides/mcp/semantic-search)
|
||||
|
||||
---
|
||||
|
||||
## OpenMetadata Standards
|
||||
|
||||
OpenMetadata is built on open metadata standards.
|
||||
|
||||
[OpenMetadata Standards](https://openmetadatastandards.org/) is the open-source home for the schemas, ontologies, and specifications behind OpenMetadata.
|
||||
|
||||
It provides:
|
||||
|
||||
- 700+ JSON Schemas for metadata entities, APIs, configurations, events, and relationships
|
||||
- RDF/OWL ontologies for semantic web, linked data, and knowledge graph use cases
|
||||
- SHACL shapes for metadata validation
|
||||
- JSON-LD contexts for semantic interoperability
|
||||
- API and event schemas for search, feeds, webhooks, and bulk operations
|
||||
- standards for governance, lineage, quality, observability, teams, users, roles, policies, and events
|
||||
|
||||
OpenMetadata Standards enables:
|
||||
|
||||
- interoperable metadata management
|
||||
- semantic metadata modeling
|
||||
- enterprise knowledge graph construction
|
||||
- linked data and RDF integrations
|
||||
- metadata validation using SHACL
|
||||
- extensibility through schema-first design
|
||||
|
||||
Learn more:
|
||||
[OpenMetadata Standards](https://openmetadatastandards.org/)
|
||||
|
||||
---
|
||||
|
||||
## Core Platform Capabilities
|
||||
|
||||
### Discovery and Understanding
|
||||
|
||||
- asset search and discovery
|
||||
- semantic search
|
||||
- descriptions and documentation
|
||||
- sample data and usage context
|
||||
- ownership and stewardship
|
||||
- conversations, tasks, and announcements
|
||||
|
||||
### Governance and Semantics
|
||||
|
||||
- glossaries and glossary terms
|
||||
- classifications and tags
|
||||
- metrics and KPIs
|
||||
- domains and data products
|
||||
- policies and roles
|
||||
- certification and lifecycle states
|
||||
|
||||
### Data Quality and Observability
|
||||
|
||||
- test cases and test suites
|
||||
- profiling
|
||||
- freshness, volume, null, uniqueness, and distribution checks
|
||||
- custom tests
|
||||
- data quality dashboards
|
||||
- alerts and incidents
|
||||
- root-cause analysis workflows
|
||||
|
||||
### Lineage and Impact Analysis
|
||||
|
||||
- table lineage
|
||||
- column-level lineage
|
||||
- dashboard lineage
|
||||
- pipeline lineage
|
||||
- metric lineage
|
||||
- ML model lineage
|
||||
- upstream and downstream impact analysis
|
||||
|
||||
### Collaboration
|
||||
|
||||
- conversations
|
||||
- tasks
|
||||
- announcements
|
||||
- notifications
|
||||
- ownership workflows
|
||||
- documentation workflows
|
||||
- shared stewardship between producers and consumers
|
||||
|
||||
### Security and Access Control
|
||||
|
||||
- authentication
|
||||
- authorization
|
||||
- roles and policies
|
||||
- SSO integration
|
||||
- bot and user tokens
|
||||
- MCP authentication
|
||||
- governed metadata actions
|
||||
|
||||
### Extensibility and Automation
|
||||
|
||||
- APIs
|
||||
- SDKs
|
||||
- webhooks
|
||||
- events
|
||||
- applications
|
||||
- ingestion framework
|
||||
- custom connectors
|
||||
- custom properties
|
||||
- MCP tools
|
||||
- AI SDK workflows
|
||||
|
||||
---
|
||||
|
||||
## Quickstart
|
||||
|
||||
### 1. Try OpenMetadata
|
||||
|
||||
Explore OpenMetadata using the sandbox:
|
||||
|
||||
[OpenMetadata Sandbox](https://sandbox.open-metadata.org)
|
||||
|
||||
### 2. Install OpenMetadata
|
||||
|
||||
Follow the installation guide:
|
||||
|
||||
[OpenMetadata Quickstart](https://docs.open-metadata.org/latest/quick-start)
|
||||
|
||||
### 3. Ingest Metadata
|
||||
|
||||
Connect your data sources and build your metadata graph.
|
||||
|
||||
Start with:
|
||||
|
||||
- a warehouse or database
|
||||
- a BI/dashboard tool
|
||||
- an orchestration or pipeline system
|
||||
- data quality and profiling
|
||||
- lineage ingestion
|
||||
|
||||
### 4. Build Context
|
||||
|
||||
Add the operational and trust metadata AI needs:
|
||||
|
||||
- descriptions
|
||||
- owners
|
||||
- teams
|
||||
- domains
|
||||
- data products
|
||||
- quality tests
|
||||
- freshness checks
|
||||
- usage
|
||||
- lineage
|
||||
- column-level lineage
|
||||
|
||||
### 5. Add Semantics
|
||||
|
||||
Add business meaning:
|
||||
|
||||
- glossaries
|
||||
- glossary terms
|
||||
- classifications
|
||||
- tags
|
||||
- metrics
|
||||
- KPIs
|
||||
- policies
|
||||
- domains
|
||||
- data products
|
||||
|
||||
### 6. Enable Semantic Search
|
||||
|
||||
Configure Semantic Search so users and AI assistants can search by meaning.
|
||||
|
||||
Learn more:
|
||||
|
||||
```text
|
||||
https://docs.open-metadata.org/v1.12.x/how-to-guides/mcp/semantic-search
|
||||
```
|
||||
|
||||
### 7. Connect an MCP Client
|
||||
|
||||
Install or enable the MCP application in OpenMetadata and connect your preferred MCP-compatible client.
|
||||
|
||||
MCP endpoint:
|
||||
|
||||
```text
|
||||
https://<YOUR-OPENMETADATA-SERVER>/mcp
|
||||
```
|
||||
|
||||
MCP guide:
|
||||
|
||||
```text
|
||||
https://docs.open-metadata.org/v1.12.x/how-to-guides/mcp
|
||||
```
|
||||
|
||||
### 8. Build Custom AI Applications
|
||||
|
||||
Use the AI SDK to connect any LLM to OpenMetadata’s MCP tools.
|
||||
|
||||
AI SDK documentation:
|
||||
|
||||
```text
|
||||
https://docs.open-metadata.org/v1.12.x/api-reference/sdk/ai-sdk
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Documentation and Community
|
||||
|
||||
- Documentation: [docs.open-metadata.org](https://docs.open-metadata.org/)
|
||||
- MCP Server: [OpenMetadata MCP Documentation](https://docs.open-metadata.org/v1.12.x/how-to-guides/mcp)
|
||||
- OpenMetadata Standards: [openmetadatastandards.org](https://openmetadatastandards.org/)
|
||||
- Website: [open-metadata.org](https://open-metadata.org/)
|
||||
- Slack Community: [slack.open-metadata.org](https://slack.open-metadata.org/)
|
||||
- Blog: [blog.open-metadata.org](https://blog.open-metadata.org/)
|
||||
|
||||
---
|
||||
|
||||
## Open Source and Enterprise AI
|
||||
|
||||
OpenMetadata is the open-source foundation for metadata, context, semantics, governance, quality, lineage, APIs, MCP, and AI SDK workflows.
|
||||
|
||||
For managed enterprise capabilities, AI agents, automation, AI Studio, enterprise MCP workflows, commercial support, and managed operations, see Collate:
|
||||
|
||||
- [Collate](https://www.getcollate.io/)
|
||||
- [Collate AI](https://www.getcollate.io/collate-ai)
|
||||
|
||||
---
|
||||
|
||||
## Contributing
|
||||
|
||||
We welcome contributions from the community.
|
||||
|
||||
You can contribute by:
|
||||
|
||||
- improving metadata schemas and standards
|
||||
- adding connectors
|
||||
- improving ingestion workflows
|
||||
- enhancing MCP tools
|
||||
- improving semantic search
|
||||
- adding documentation
|
||||
- fixing bugs
|
||||
- improving the UI and user experience
|
||||
- proposing new governance, lineage, quality, and AI use cases
|
||||
|
||||
See the contribution guide in the repository to get started.
|
||||
|
||||
---
|
||||
|
||||
<a href="https://github.com/open-metadata/OpenMetadata/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=open-metadata/OpenMetadata&max=4000&columns=30" />
|
||||
</a>
|
||||
|
||||
## Stargazers
|
||||
|
||||
|
|
|
|||
305
adr-incident-manager-governance-workflows.md
Normal file
305
adr-incident-manager-governance-workflows.md
Normal file
|
|
@ -0,0 +1,305 @@
|
|||
# Integrate Incident Manager in the Governance Workflows Framework
|
||||
|
||||
ADR-#: 1
|
||||
Authors: Pablo Takara
|
||||
Reviewers: Teddy Crépineau, Ram Narayan Balaji
|
||||
Date: February 27, 2026
|
||||
Status: Proposed
|
||||
|
||||
> Migrate incident lifecycle into a governance workflow using a new Task Lifecycle Node. The node uses OpenMetadata tasks as the source of truth (not Flowable UserTask), receives a template with configurable statuses, and exposes each status transition to the main workflow graph via process variables. Users wire hooks on any transition using standard edges. Non-terminal statuses loop back; terminal statuses auto-close the task.
|
||||
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
The Incident Manager handles the lifecycle of data quality incidents in OpenMetadata. When a test case fails, an incident is created; it progresses through `New → Ack → Assigned → Resolved` as humans triage it.
|
||||
|
||||
Today, this lifecycle is a **switch statement** in `TestCaseResolutionStatusRepository.storeInternal()`. It handles state transitions, task creation, assignment, and resolution. The state machine is simple, correct, and performant, but it has **no extension points**. Adding a behavior like "on Assigned, notify via Slack" or "on New, auto-assign to table owner" requires modifying repository code, testing, and redeploying.
|
||||
|
||||
Meanwhile, OpenMetadata ships a **governance workflows framework** built on Flowable BPM. It is fully configurable via REST API and UI. Users configure workflows as abstract **trigger → nodes → edges** graphs (they never see BPMN XML). The backend compiles these to Flowable process definitions automatically via `NodeFactory` and `MainWorkflow`.
|
||||
|
||||
The two systems live side by side but do not interact.
|
||||
|
||||
Additionally, the **task refactor** promotes tasks to first-class entities with standard `ChangeEvents`. This enables Flowable to be notified of every status transition — not just resolution — unlocking configurable hooks on any transition from day one.
|
||||
|
||||
### Specific Gaps
|
||||
|
||||
1. **No auto-close when tests pass.** `TestCaseResultRepository.setTestCaseResultIncidentId()` sets `incidentId = null` when a test succeeds but **never resolves the incident or closes its task**.
|
||||
2. **No auto-assign on incident creation.** Every incident starts in `New` and requires manual acknowledgement.
|
||||
3. **No extensibility.** Organizations cannot define configurable rules like "on any status change, execute action X" without code changes.
|
||||
4. **Fixed lifecycle.** The `New → Ack → Assigned → Resolved` states are hardcoded. Organizations with different triage processes have no way to customize.
|
||||
5. **No incident TTL.** No mechanism to auto-close stale incidents.
|
||||
|
||||
### Enterprise scale context
|
||||
|
||||
- 5M assets, 10-30% with data quality tests = 500K-1.5M test cases
|
||||
- At 2-5% failure rate = **10K-75K concurrent open incidents** (typical)
|
||||
- `getOrCreateIncident()` enforces one unresolved incident per test case
|
||||
|
||||
---
|
||||
|
||||
## Use Cases
|
||||
|
||||
**UC-1 — Auto-close incident when test passes**
|
||||
The system automatically resolves the open incident (reason: AutoResolved) and closes its task. No human intervention required.
|
||||
|
||||
**UC-2 — Auto-assign incident on creation**
|
||||
When a new incident is created, the system automatically assigns it to a configured user or team.
|
||||
|
||||
**UC-3 — Auto-close stale incidents (TTL)**
|
||||
An incident open longer than a configurable deadline is automatically resolved (reason: Expired).
|
||||
|
||||
**UC-4 — User-defined hooks on any status transition**
|
||||
Users wire follow-up steps (notifications, Jira tickets, etc.) on any status change via workflow edges — no code changes.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
### Task Lifecycle Node
|
||||
|
||||
A new governance workflow node that does NOT use Flowable's BPMN UserTask. It creates an OpenMetadata task, waits for status changes via `IntermediateCatchEvent`, and exposes each status to the parent workflow for routing.
|
||||
|
||||
**Internal BPMN structure:**
|
||||
```
|
||||
┌─ SubProcess ──────────────────────────────────────────────────────┐
|
||||
│ │
|
||||
│ [Start] → [Setup] → [Gateway: created?] │
|
||||
│ │ no → [End: skip] │
|
||||
│ │ yes ↓ │
|
||||
│ │ [IntermediateCatchEvent: wait] │
|
||||
│ │ ↓ message with {status} │
|
||||
│ │ [Gateway: terminal?] │
|
||||
│ │ yes → [CloseTask] → [SetResult] → [End] │
|
||||
│ │ no → [SetResult] → [End] │
|
||||
│ │ │
|
||||
│ │ Setup (idempotent): │
|
||||
│ │ • Check for existing open incident │
|
||||
│ │ → if exists with active process: skip │
|
||||
│ │ → if orphaned process: terminate it │
|
||||
│ │ • Create incident record (New) │
|
||||
│ │ • Create OM task │
|
||||
│ │ • Auto-assign (from template config) │
|
||||
│ │ • Set process variable omTaskId = task UUID │
|
||||
│ │
|
||||
│ + [TTL Boundary Timer: configurable, interrupting] │
|
||||
│ → [AutoResolve via repository] → [End] │
|
||||
└────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Node config:**
|
||||
```json
|
||||
{
|
||||
"type": "taskLifecycleNode",
|
||||
"config": {
|
||||
"template": "incident",
|
||||
"statuses": ["New", "Ack", "Assigned", "Resolved"],
|
||||
"terminal": ["Resolved"],
|
||||
"responsibles": { "source": "tableOwner" },
|
||||
"ttl": "P30D"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The node:
|
||||
1. **Setup** — Creates the OM task (idempotent on re-entry). Sets `omTaskId` process variable.
|
||||
2. **Wait** — `IntermediateCatchEvent` with `messageExpression="${omTaskId}"`. Subscribes to a message named after the task UUID (~2 Flowable DB rows).
|
||||
3. **On message** — Evaluates whether the received status is terminal.
|
||||
4. **Terminal** — Closes the OM task (idempotent), sets `{nodeName}_result` at parent scope, subprocess exits.
|
||||
5. **Non-terminal** — Sets `{nodeName}_result` at parent scope, subprocess exits. Parent-level edges route back to the node.
|
||||
|
||||
### Status exposed via graph edges (with cycles)
|
||||
|
||||
Status is set as a Flowable process variable when the subprocess exits. Parent-level edges condition on this variable. Non-terminal edges loop back to the node.
|
||||
|
||||
```
|
||||
┌────── "ack" ───────────────────────────┐
|
||||
│ ┌─── "assigned" → [NotifySlack] ──────┤
|
||||
▼ ▼ │
|
||||
[Start] → [ManageIncident] ── "resolved" → [End]
|
||||
```
|
||||
|
||||
**Workflow definition example:**
|
||||
```json
|
||||
{
|
||||
"name": "incident-lifecycle",
|
||||
"trigger": {
|
||||
"type": "eventBasedEntity",
|
||||
"config": {
|
||||
"entityTypes": ["TestCase"],
|
||||
"events": ["Updated"],
|
||||
"filter": { "TestCase": { "==": [{"var": "testCaseStatus"}, "Failed"] } }
|
||||
}
|
||||
},
|
||||
"nodes": [
|
||||
{ "type": "startEvent", "name": "start" },
|
||||
{ "type": "taskLifecycleNode", "name": "incident", "config": {
|
||||
"template": "incident",
|
||||
"statuses": ["New", "Ack", "Assigned", "Resolved"],
|
||||
"terminal": ["Resolved"],
|
||||
"responsibles": { "source": "tableOwner" },
|
||||
"ttl": "P30D"
|
||||
}},
|
||||
{ "type": "automatedTask", "subType": "sinkTask", "name": "notifySlack" },
|
||||
{ "type": "endEvent", "name": "end" }
|
||||
],
|
||||
"edges": [
|
||||
{ "from": "start", "to": "incident" },
|
||||
{ "from": "incident", "to": "incident", "condition": { "status": "Ack" } },
|
||||
{ "from": "incident", "to": "notifySlack", "condition": { "status": "Assigned" } },
|
||||
{ "from": "notifySlack", "to": "incident" },
|
||||
{ "from": "incident", "to": "end", "condition": { "status": "Resolved" } }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Message delivery via task ChangeEvents
|
||||
|
||||
With the task refactor, tasks emit `ChangeEvents` on status changes. These drive message delivery to Flowable:
|
||||
|
||||
1. Task status changes (via REST API / `storeInternal`)
|
||||
2. `ChangeEvent` emitted
|
||||
3. Listener correlates message to waiting `IntermediateCatchEvent`
|
||||
|
||||
The OM task is already updated before the message fires. If correlation fails, the task state is correct — Flowable catches up on the next status change.
|
||||
|
||||
**Mechanism TBD**: Listener on task `ChangeEvents` (clean separation) vs direct hook in task status update code (fewer hops).
|
||||
|
||||
### What the workflow controls vs the repository
|
||||
|
||||
| Action | Who handles it |
|
||||
| --- | --- |
|
||||
| Task creation | Node setup phase (idempotent) |
|
||||
| Status changes (Ack, Assigned, etc.) | Repository — synchronous, unchanged |
|
||||
| Resolution | Repository — synchronous, unchanged |
|
||||
| Task closure | Both — node closes on terminal, repository may also close. Idempotent. |
|
||||
| Flowable notification | Task ChangeEvent → message to IntermediateCatchEvent |
|
||||
| Follow-up hooks | Workflow edges — user-configurable |
|
||||
| TTL auto-resolve | Boundary timer on node |
|
||||
| Auto-close on test pass | Separate short-lived workflow |
|
||||
|
||||
### Why this approach
|
||||
|
||||
1. **Hooks on any transition.** Status exposed to parent graph → users wire follow-up steps via edges.
|
||||
2. **Configurable lifecycle.** Template defines statuses and terminal set. No hardcoded lifecycle.
|
||||
3. **OM task is source of truth.** No BPMN UserTask. ~2 DB rows per task vs ~5-10.
|
||||
4. **Repository stays in the critical path.** All transitions are synchronous. Flowable is notified after the fact. If Flowable is down, transitions still succeed.
|
||||
5. **Unified abstraction.** Same node type for incidents, approvals, certifications — different templates.
|
||||
|
||||
---
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Hooks on any status transition** without code changes.
|
||||
- **Configurable lifecycle from day one** via template config.
|
||||
- **Lightweight** — ~2 Flowable DB rows per task (IntermediateCatchEvent).
|
||||
- **Safe** — repository owns all transitions synchronously; Flowable is follow-up only.
|
||||
- **Default workflow replicates current behavior** and ships enabled.
|
||||
- **Unified abstraction** — incidents, approvals, certifications share one node type.
|
||||
|
||||
### Negative
|
||||
|
||||
- **MainWorkflow compiler must support cycles.** Today it assumes a DAG. Biggest technical risk.
|
||||
- **More Flowable interactions.** Every status change sends a message (vs resolution only). ~225K correlations over lifetime of 75K incidents with ~3 transitions each.
|
||||
- **Task refactor dependency.** Fallback: direct `reportOutcome()` from `storeInternal()` if not ready.
|
||||
|
||||
### Neutral
|
||||
|
||||
- REST API surface unchanged.
|
||||
- `TestCaseResolutionStatus` schema changes minimally (add `AutoResolved`, `Expired` reasons).
|
||||
- Resolution business logic in the repository is unchanged.
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Bookends only (no intermediate state hooks)
|
||||
|
||||
Handle only creation + resolution in the workflow. Intermediate states stay entirely in `storeInternal()`.
|
||||
|
||||
**Not chosen:** Users cannot wire hooks on Ack/Assigned. The task refactor makes full lifecycle hooks possible now — deferring them means two migrations.
|
||||
|
||||
### Internal loop (cycle hidden inside SubProcess)
|
||||
|
||||
The message loop lives inside the node. Status exposed only on terminal exit. Outer graph stays a DAG.
|
||||
|
||||
**Not chosen:** Users cannot wire hooks on non-terminal transitions. The point is exposing every status change to the parent graph.
|
||||
|
||||
### Resolution through Flowable (not fire-and-forget)
|
||||
|
||||
Route resolution through the Flowable process.
|
||||
|
||||
**Not chosen:** Puts Flowable in the critical path. If Flowable is slow/down, resolution is blocked.
|
||||
|
||||
### Extend state machine with Java hooks
|
||||
|
||||
**Rejected:** Parallel automation system, requires code changes for every new behavior.
|
||||
|
||||
### CMMN (Case Management)
|
||||
|
||||
**Rejected:** Zero existing infrastructure, overkill.
|
||||
|
||||
---
|
||||
|
||||
## Design Choices
|
||||
|
||||
### IntermediateCatchEvent with messageExpression
|
||||
|
||||
`messageExpression="${omTaskId}"` gives unique-per-instance subscriptions. `EventSubscriptionQuery.eventName(taskId)` is an indexed lookup. No MessageCorrelationBuilder (doesn't exist in Flowable 7.2.0).
|
||||
|
||||
### Idempotent setup on loop re-entry
|
||||
|
||||
When non-terminal edges loop back, Setup detects the existing task and reuses it. Safe for any number of loops.
|
||||
|
||||
### Terminal auto-close — both sides
|
||||
|
||||
`storeInternal(Resolved)` closes the task. The node's `CloseTask` also closes on terminal status. Both are idempotent. This handles TTL (node-initiated) and human resolution (repository-initiated) uniformly.
|
||||
|
||||
### Business key = test case FQN
|
||||
|
||||
Enables idempotent creation, fire-and-forget termination, auto-close correlation.
|
||||
|
||||
### Governance-bot loop prevention
|
||||
|
||||
`WorkflowEventConsumer` skips events from `governance-bot`. The workflow runs as `governance-bot`, so its own events don't re-trigger workflows.
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
- [ ] **Message delivery mechanism**: Listener on task ChangeEvents vs direct hook in task status update.
|
||||
- [ ] **TestCaseResult.incidentId linking**: If creation moves to async workflow, test result may store before incident exists. Recommendation: keep `getOrCreateIncident()` synchronous.
|
||||
- [ ] **Cycle validation**: Should the compiler enforce that every non-terminal edge path routes back to a task node?
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Impact | Mitigation |
|
||||
| --- | --- | --- |
|
||||
| Cycle support in MainWorkflow | Blocks the design | Spike early. Workaround: invisible gateway node. |
|
||||
| Task refactor not ready | No ChangeEvents for message delivery | Fall back to direct reportOutcome() from storeInternal() |
|
||||
| Race condition | Message lost during follow-up execution | EventSubscriptionQuery returns null → skipped. Java-side buffer later. |
|
||||
| ACT_RU growth | ~2 rows per open incident | 75K incidents = 150K rows. Measure in hardening phase. |
|
||||
| Process orphaning | Never-resolved incidents linger | TTL handles deadlines. Batch sweep for the rest. |
|
||||
|
||||
---
|
||||
|
||||
## Follow-up Work
|
||||
|
||||
1. **Batch sweep** for orphaned processes.
|
||||
2. **Migrate UserApprovalTask** (glossary) to same node type with `template: "approval"`.
|
||||
3. **SLA timer escalation** — optional boundary timer using same infrastructure as TTL.
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- `TestCaseResolutionStatusRepository.storeInternal()` — Current state machine
|
||||
- `WorkflowHandler.java` — Flowable ProcessEngine, message delivery
|
||||
- `MainWorkflow.java` — BPMN compiler (needs cycle support)
|
||||
- `UserApprovalTask.java` — Current UserTask pattern (being replaced)
|
||||
- `NodeFactory.java` — Node type registration
|
||||
- `WorkflowEventConsumer.java` — Event routing, governance-bot loop prevention
|
||||
|
|
@ -8,7 +8,6 @@ PROJECT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
|||
|
||||
# Default values
|
||||
SERVER_URL="http://localhost:8585"
|
||||
RECREATE_INDEX=false
|
||||
ENTITY_TYPES=""
|
||||
BATCH_SIZE=100
|
||||
PARTITION_SIZE=10000
|
||||
|
|
@ -20,10 +19,6 @@ while [[ $# -gt 0 ]]; do
|
|||
SERVER_URL="$2"
|
||||
shift 2
|
||||
;;
|
||||
--recreate)
|
||||
RECREATE_INDEX=true
|
||||
shift
|
||||
;;
|
||||
--entities)
|
||||
ENTITY_TYPES="$2"
|
||||
shift 2
|
||||
|
|
@ -41,7 +36,6 @@ while [[ $# -gt 0 ]]; do
|
|||
echo ""
|
||||
echo "Options:"
|
||||
echo " --server URL Target server URL (default: http://localhost:8585)"
|
||||
echo " --recreate Drop and recreate indices before reindexing"
|
||||
echo " --entities TYPES Comma-separated entity types to reindex (default: all)"
|
||||
echo " --batch-size NUM Batch size for indexing (default: 100)"
|
||||
echo " --partition-size NUM Partition size for distributed indexing (default: 10000, range: 1000-50000)"
|
||||
|
|
@ -51,7 +45,6 @@ while [[ $# -gt 0 ]]; do
|
|||
echo "Examples:"
|
||||
echo " $0 # Reindex all on server 1"
|
||||
echo " $0 --server http://localhost:8587 # Trigger on server 2"
|
||||
echo " $0 --recreate # Drop and recreate indices"
|
||||
echo " $0 --entities table,dashboard # Reindex only tables and dashboards"
|
||||
echo " $0 --partition-size 2000 # Use smaller partitions for better distribution"
|
||||
exit 0
|
||||
|
|
@ -67,7 +60,7 @@ echo "======================================"
|
|||
echo "Triggering Search Reindexing"
|
||||
echo "======================================"
|
||||
echo "Server: $SERVER_URL"
|
||||
echo "Recreate indices: $RECREATE_INDEX"
|
||||
echo "Indexing mode: staged indexes with alias promotion"
|
||||
echo "Batch size: $BATCH_SIZE"
|
||||
echo "Partition size: $PARTITION_SIZE"
|
||||
if [ -n "$ENTITY_TYPES" ]; then
|
||||
|
|
@ -96,13 +89,6 @@ fi
|
|||
echo "Authenticated successfully."
|
||||
echo ""
|
||||
|
||||
# Build the reindex request body
|
||||
if [ "$RECREATE_INDEX" == "true" ]; then
|
||||
RECREATE_FLAG="true"
|
||||
else
|
||||
RECREATE_FLAG="false"
|
||||
fi
|
||||
|
||||
# Build entities array
|
||||
if [ -n "$ENTITY_TYPES" ]; then
|
||||
# Convert comma-separated to JSON array
|
||||
|
|
@ -113,11 +99,9 @@ fi
|
|||
|
||||
REQUEST_BODY=$(cat <<EOF
|
||||
{
|
||||
"recreateIndex": $RECREATE_FLAG,
|
||||
"entities": $ENTITIES_JSON,
|
||||
"batchSize": $BATCH_SIZE,
|
||||
"partitionSize": $PARTITION_SIZE,
|
||||
"useDistributedIndexing": true,
|
||||
"runMode": "BATCH"
|
||||
}
|
||||
EOF
|
||||
|
|
|
|||
|
|
@ -76,4 +76,3 @@ SET json = JSON_REMOVE(JSON_REMOVE(json, '$.inputPorts'), '$.outputPorts')
|
|||
WHERE jsonSchema = 'dataProduct'
|
||||
AND (JSON_CONTAINS_PATH(json, 'one', '$.inputPorts')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.outputPorts'));
|
||||
|
||||
|
|
|
|||
|
|
@ -110,3 +110,4 @@ SET json = json::jsonb - 'inputPorts' - 'outputPorts'
|
|||
WHERE jsonSchema = 'dataProduct'
|
||||
AND (json::jsonb ?? 'inputPorts' OR json::jsonb ?? 'outputPorts');
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -1 +1,27 @@
|
|||
-- Placeholder for 1.12.6 MySQL post data migration SQL script
|
||||
-- Remove pipeline annotation from service-level, domain-level, and dataProduct-level lineage edges.
|
||||
-- These edges incorrectly inherited the pipeline annotation from entity-level lineage, causing service
|
||||
-- nodes to appear in entity-level lineage views and the "By Service" view to be empty for pipeline
|
||||
-- entities. After this migration, run an Elasticsearch/OpenSearch reindex to update search documents.
|
||||
UPDATE entity_relationship
|
||||
SET json = JSON_REMOVE(json, '$.pipeline')
|
||||
WHERE fromEntity IN ('databaseService', 'messagingService', 'pipelineService', 'dashboardService',
|
||||
'mlmodelService', 'metadataService', 'storageService', 'searchService', 'apiService',
|
||||
'driveService')
|
||||
AND toEntity IN ('databaseService', 'messagingService', 'pipelineService', 'dashboardService',
|
||||
'mlmodelService', 'metadataService', 'storageService', 'searchService', 'apiService',
|
||||
'driveService')
|
||||
AND relation = 13
|
||||
AND JSON_CONTAINS_PATH(json, 'one', '$.pipeline');
|
||||
|
||||
UPDATE entity_relationship
|
||||
SET json = JSON_REMOVE(json, '$.pipeline')
|
||||
WHERE fromEntity = 'domain' AND toEntity = 'domain'
|
||||
AND relation = 13
|
||||
AND JSON_EXTRACT(json, '$.pipeline') IS NOT NULL;
|
||||
|
||||
UPDATE entity_relationship
|
||||
SET json = JSON_REMOVE(json, '$.pipeline')
|
||||
WHERE fromEntity = 'dataProduct' AND toEntity = 'dataProduct'
|
||||
AND relation = 13
|
||||
AND JSON_EXTRACT(json, '$.pipeline') IS NOT NULL;
|
||||
|
||||
|
|
|
|||
|
|
@ -1 +1,27 @@
|
|||
-- Placeholder for 1.12.6 Postgres post data migration SQL script
|
||||
-- Remove pipeline annotation from service-level, domain-level, and dataProduct-level lineage edges.
|
||||
-- These edges incorrectly inherited the pipeline annotation from entity-level lineage, causing service
|
||||
-- nodes to appear in entity-level lineage views and the "By Service" view to be empty for pipeline
|
||||
-- entities. After this migration, run an Elasticsearch/OpenSearch reindex to update search documents.
|
||||
UPDATE entity_relationship
|
||||
SET json = json - 'pipeline'
|
||||
WHERE fromentity IN ('databaseService', 'messagingService', 'pipelineService', 'dashboardService',
|
||||
'mlmodelService', 'metadataService', 'storageService', 'searchService', 'apiService',
|
||||
'driveService')
|
||||
AND toentity IN ('databaseService', 'messagingService', 'pipelineService', 'dashboardService',
|
||||
'mlmodelService', 'metadataService', 'storageService', 'searchService', 'apiService',
|
||||
'driveService')
|
||||
AND relation = 13
|
||||
AND json ?? 'pipeline';
|
||||
|
||||
UPDATE entity_relationship
|
||||
SET json = json - 'pipeline'
|
||||
WHERE fromentity = 'domain' AND toentity = 'domain'
|
||||
AND relation = 13
|
||||
AND json ?? 'pipeline';
|
||||
|
||||
UPDATE entity_relationship
|
||||
SET json = json - 'pipeline'
|
||||
WHERE fromentity = 'dataProduct' AND toentity = 'dataProduct'
|
||||
AND relation = 13
|
||||
AND json ?? 'pipeline';
|
||||
|
||||
|
|
|
|||
|
|
@ -1 +1 @@
|
|||
-- Placeholder for 1.12.6 Postgres schema changes
|
||||
-- Placeholder for 1.12.6 PostgreSQL schema changes
|
||||
|
|
|
|||
|
|
@ -0,0 +1 @@
|
|||
-- Placeholder for 1.12.7 MySQL post-data migration script
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
-- Placeholder for 1.12.7 MySQL schema changes
|
||||
-- The Postgres-side fix for #27158 has no MySQL counterpart: MySQL's
|
||||
-- 1.11.0 indexes were already non-partial (no partial-index syntax in
|
||||
-- MySQL), so the regression that hit Postgres did not affect MySQL.
|
||||
|
|
@ -0,0 +1 @@
|
|||
-- Placeholder for 1.12.7 PostgreSQL post-data migration script
|
||||
|
|
@ -0,0 +1,32 @@
|
|||
-- Issue #27158: tag_usage seq-scan on Postgres. #24063 dropped the
|
||||
-- `state = 1` predicate that 1.11.0's partial indexes required.
|
||||
-- Fix: add single-col indexes on the `_lower` columns, and drop the
|
||||
-- `WHERE state = 1` filter from the partials so changes can't invalidate them.
|
||||
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_tag_usage_targetfqnhash_lower_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tag_usage_targetfqnhash_lower_pattern
|
||||
ON tag_usage (targetfqnhash_lower text_pattern_ops);
|
||||
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_tag_usage_tagfqn_lower_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tag_usage_tagfqn_lower_pattern
|
||||
ON tag_usage (tagfqn_lower text_pattern_ops);
|
||||
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_tag_usage_target_prefix_covering;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tag_usage_target_prefix_covering
|
||||
ON tag_usage (source, targetfqnhash_lower text_pattern_ops)
|
||||
INCLUDE (tagFQN, labelType, state);
|
||||
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_tag_usage_tagfqn_prefix_covering;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tag_usage_tagfqn_prefix_covering
|
||||
ON tag_usage (source, tagfqn_lower text_pattern_ops)
|
||||
INCLUDE (targetFQNHash, labelType, state);
|
||||
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_tag_usage_join_source;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_tag_usage_join_source
|
||||
ON tag_usage (tagFQNHash, source)
|
||||
INCLUDE (targetFQNHash, tagFQN, labelType, state);
|
||||
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
DROP INDEX CONCURRENTLY IF EXISTS gin_tag_usage_targetfqn_trgm;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS gin_tag_usage_targetfqn_trgm
|
||||
ON tag_usage USING GIN (targetFQNHash gin_trgm_ops);
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
-- Fix PII classification autoClassificationConfig (issue #27910)
|
||||
UPDATE classification
|
||||
SET json = JSON_SET(
|
||||
json,
|
||||
'$.autoClassificationConfig',
|
||||
CAST('{"enabled": true, "conflictResolution": "highest_priority", "minimumConfidence": 0.6, "requireExplicitMatch": true}' AS JSON)
|
||||
)
|
||||
WHERE JSON_VALUE(json, '$.name' RETURNING CHAR) = 'PII'
|
||||
AND JSON_EXTRACT(json, '$.autoClassificationConfig.enabled') IS NULL;
|
||||
|
||||
-- Fix PII tags autoClassificationEnabled (issue #27910)
|
||||
UPDATE tag
|
||||
SET json = JSON_SET(json, '$.autoClassificationEnabled', CAST('true' AS JSON))
|
||||
WHERE JSON_VALUE(json, '$.classification.name' RETURNING CHAR) = 'PII'
|
||||
AND JSON_VALUE(json, '$.name' RETURNING CHAR) IN ('NonSensitive', 'Sensitive')
|
||||
AND (
|
||||
JSON_EXTRACT(json, '$.autoClassificationEnabled') IS NULL
|
||||
OR JSON_EXTRACT(json, '$.autoClassificationEnabled') = false
|
||||
);
|
||||
|
|
@ -0,0 +1,19 @@
|
|||
-- Fix PII classification autoClassificationConfig (issue #27910)
|
||||
UPDATE classification
|
||||
SET json = jsonb_set(
|
||||
json::jsonb,
|
||||
'{autoClassificationConfig}',
|
||||
'{"enabled": true, "conflictResolution": "highest_priority", "minimumConfidence": 0.6, "requireExplicitMatch": true}'::jsonb
|
||||
)::json
|
||||
WHERE json->>'name' = 'PII'
|
||||
AND json->'autoClassificationConfig'->>'enabled' IS NULL;
|
||||
|
||||
-- Fix PII tags autoClassificationEnabled (issue #27910)
|
||||
UPDATE tag
|
||||
SET json = jsonb_set(json::jsonb, '{autoClassificationEnabled}', 'true'::jsonb)::json
|
||||
WHERE json->'classification'->>'name' = 'PII'
|
||||
AND json->>'name' IN ('NonSensitive', 'Sensitive')
|
||||
AND (
|
||||
json->>'autoClassificationEnabled' IS NULL
|
||||
OR (json->>'autoClassificationEnabled')::boolean = false
|
||||
);
|
||||
|
|
@ -0,0 +1,5 @@
|
|||
-- Placeholder for 1.12.9 MySQL schema changes
|
||||
-- The Postgres-side fix for collate#3488 has no MySQL counterpart: MySQL's
|
||||
-- 1.1.5 unique constraint on profiler_data_time_series was never dropped
|
||||
-- (MODIFY COLUMN re-evaluates generated expressions in place), so the
|
||||
-- regression that hit Postgres did not affect MySQL.
|
||||
|
|
@ -0,0 +1,62 @@
|
|||
-- Boost memory for the dedup + index build. RESET at end.
|
||||
SET work_mem = '256MB';
|
||||
SET maintenance_work_mem = '512MB';
|
||||
|
||||
-- Dedup before the unique index rebuild. NULL filter on operation: Postgres
|
||||
-- UNIQUE treats NULLs as DISTINCT, so the constraint never blocked tableProfile
|
||||
-- / columnProfile rows (operation = NULL). GROUP BY treats NULLs as equal —
|
||||
-- without the filter we'd collapse rows the constraint never rejected.
|
||||
DELETE FROM profiler_data_time_series p
|
||||
USING (
|
||||
SELECT entityFQNHash, extension, operation, "timestamp", MAX(ctid) AS keep_ctid
|
||||
FROM profiler_data_time_series
|
||||
WHERE operation IS NOT NULL
|
||||
AND entityFQNHash IS NOT NULL
|
||||
GROUP BY entityFQNHash, extension, operation, "timestamp"
|
||||
HAVING COUNT(*) > 1
|
||||
) d
|
||||
WHERE p.entityFQNHash = d.entityFQNHash
|
||||
AND p.extension = d.extension
|
||||
AND p.operation = d.operation
|
||||
AND p."timestamp" = d."timestamp"
|
||||
AND p.ctid <> d.keep_ctid;
|
||||
|
||||
-- Recover from a prior failed CREATE UNIQUE INDEX CONCURRENTLY: drop the
|
||||
-- invalid leftover and rebuild inline so ALTER below can promote it.
|
||||
DO $$
|
||||
DECLARE
|
||||
invalid_idx oid;
|
||||
BEGIN
|
||||
SELECT i.indexrelid INTO invalid_idx
|
||||
FROM pg_index i
|
||||
JOIN pg_class idx ON idx.oid = i.indexrelid
|
||||
WHERE idx.relname = 'profiler_data_time_series_unique_hash_extension_ts'
|
||||
AND i.indrelid = 'profiler_data_time_series'::regclass
|
||||
AND NOT i.indisvalid;
|
||||
|
||||
IF invalid_idx IS NOT NULL THEN
|
||||
EXECUTE 'DROP INDEX ' || invalid_idx::regclass;
|
||||
EXECUTE 'CREATE UNIQUE INDEX profiler_data_time_series_unique_hash_extension_ts '
|
||||
|| 'ON profiler_data_time_series '
|
||||
|| '(entityFQNHash, extension, operation, "timestamp")';
|
||||
END IF;
|
||||
END $$;
|
||||
|
||||
-- Restore the unique constraint dropped in 1.9.9. Closes the 1.9.9 regression that caused
|
||||
-- /columns?fields=profile 504s, and brings Postgres back in line with MySQL (which never
|
||||
-- lost it). The leading (entityFQNHash, extension) prefix serves the column-profile batch query.
|
||||
-- Two-phase: CONCURRENTLY build avoids ACCESS EXCLUSIVE lock; ADD CONSTRAINT USING INDEX
|
||||
-- promotes the built index without re-scanning.
|
||||
CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS
|
||||
profiler_data_time_series_unique_hash_extension_ts
|
||||
ON profiler_data_time_series (entityFQNHash, extension, operation, timestamp);
|
||||
|
||||
ALTER TABLE profiler_data_time_series
|
||||
ADD CONSTRAINT profiler_data_time_series_unique_hash_extension_ts
|
||||
UNIQUE USING INDEX profiler_data_time_series_unique_hash_extension_ts;
|
||||
|
||||
ANALYZE profiler_data_time_series;
|
||||
|
||||
-- Reset session memory before the connection returns to the pool.
|
||||
RESET work_mem;
|
||||
RESET maintenance_work_mem;
|
||||
|
|
@ -80,8 +80,46 @@ UPDATE glossary_term_entity
|
|||
SET json = JSON_REMOVE(json, '$.relatedTerms')
|
||||
WHERE JSON_EXTRACT(json, '$.relatedTerms') IS NOT NULL;
|
||||
|
||||
-- entity_extension version snapshots: handled by Java migration
|
||||
-- migrateGlossaryTermVersionRelatedTermsToTermRelation (transforms in place to preserve history).
|
||||
|
||||
-- Backfill conceptMappings for existing glossary terms
|
||||
UPDATE glossary_term_entity
|
||||
SET json = JSON_SET(COALESCE(json, '{}'), '$.conceptMappings', JSON_ARRAY())
|
||||
WHERE JSON_EXTRACT(json, '$.conceptMappings') IS NULL;
|
||||
|
||||
-- Add Container permissions to AutoClassificationBotPolicy for storage auto-classification support
|
||||
UPDATE policy_entity
|
||||
SET json = JSON_ARRAY_INSERT(
|
||||
json,
|
||||
'$.rules[1]',
|
||||
JSON_OBJECT(
|
||||
'name', 'AutoClassificationBotRule-Allow-Container',
|
||||
'description', 'Allow adding tags and sample data to the containers',
|
||||
'resources', JSON_ARRAY('Container'),
|
||||
'operations', JSON_ARRAY('EditAll', 'ViewAll'),
|
||||
'effect', 'allow'
|
||||
)
|
||||
)
|
||||
WHERE JSON_UNQUOTE(JSON_EXTRACT(json, '$.name')) = 'AutoClassificationBotPolicy'
|
||||
AND JSON_EXTRACT(json, '$.rules[1].name') != 'AutoClassificationBotRule-Allow-Container';
|
||||
|
||||
-- Fix PII classification autoClassificationConfig (issue #27910)
|
||||
UPDATE classification
|
||||
SET json = JSON_SET(
|
||||
json,
|
||||
'$.autoClassificationConfig',
|
||||
CAST('{"enabled": true, "conflictResolution": "highest_priority", "minimumConfidence": 0.6, "requireExplicitMatch": true}' AS JSON)
|
||||
)
|
||||
WHERE JSON_VALUE(json, '$.name' RETURNING CHAR) = 'PII'
|
||||
AND JSON_EXTRACT(json, '$.autoClassificationConfig.enabled') IS NULL;
|
||||
|
||||
-- Fix PII tags autoClassificationEnabled (issue #27910)
|
||||
UPDATE tag
|
||||
SET json = JSON_SET(json, '$.autoClassificationEnabled', CAST('true' AS JSON))
|
||||
WHERE JSON_VALUE(json, '$.classification.name' RETURNING CHAR) = 'PII'
|
||||
AND JSON_VALUE(json, '$.name' RETURNING CHAR) IN ('NonSensitive', 'Sensitive')
|
||||
AND (
|
||||
JSON_EXTRACT(json, '$.autoClassificationEnabled') IS NULL
|
||||
OR JSON_EXTRACT(json, '$.autoClassificationEnabled') = false
|
||||
);
|
||||
|
|
|
|||
|
|
@ -130,6 +130,128 @@ FROM user_entity ue, role_entity re
|
|||
WHERE ue.name = 'mcpapplicationbot'
|
||||
AND re.name = 'ApplicationBotImpersonationRole';
|
||||
|
||||
-- Update Databricks and Unity Catalog connection schemes from 'databricks+connector' to 'databricks'
|
||||
-- as part of migration from sqlalchemy-databricks to databricks-sqlalchemy package
|
||||
UPDATE dbservice_entity
|
||||
SET json = JSON_SET(json, '$.connection.config.scheme', 'databricks')
|
||||
WHERE serviceType IN ('Databricks', 'UnityCatalog')
|
||||
AND JSON_UNQUOTE(JSON_EXTRACT(json, '$.connection.config.scheme')) = 'databricks+connector';
|
||||
|
||||
UPDATE entity_extension
|
||||
SET json = JSON_SET(
|
||||
json,
|
||||
'$.profileSampleConfig',
|
||||
JSON_OBJECT(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', JSON_OBJECT(
|
||||
'profileSample', JSON_EXTRACT(json, '$.profileSample'),
|
||||
'profileSampleType', COALESCE(
|
||||
JSON_EXTRACT(json, '$.profileSampleType'),
|
||||
CAST('"PERCENTAGE"' AS JSON)
|
||||
),
|
||||
'samplingMethodType', JSON_EXTRACT(json, '$.samplingMethodType')
|
||||
)
|
||||
)
|
||||
)
|
||||
WHERE extension IN (
|
||||
'table.tableProfilerConfig',
|
||||
'database.databaseProfilerConfig',
|
||||
'databaseSchema.databaseSchemaProfilerConfig'
|
||||
)
|
||||
AND JSON_EXTRACT(json, '$.profileSample') IS NOT NULL
|
||||
AND JSON_TYPE(JSON_EXTRACT(json, '$.profileSample')) != 'NULL'
|
||||
AND NOT JSON_CONTAINS_PATH(json, 'one', '$.profileSampleConfig');
|
||||
|
||||
-- entity_extension: remove old flat fields
|
||||
UPDATE entity_extension
|
||||
SET json = JSON_REMOVE(
|
||||
JSON_REMOVE(
|
||||
JSON_REMOVE(json, '$.samplingMethodType'),
|
||||
'$.profileSampleType'
|
||||
),
|
||||
'$.profileSample'
|
||||
)
|
||||
WHERE extension IN (
|
||||
'table.tableProfilerConfig',
|
||||
'database.databaseProfilerConfig',
|
||||
'databaseSchema.databaseSchemaProfilerConfig'
|
||||
)
|
||||
AND (JSON_CONTAINS_PATH(json, 'one', '$.profileSample')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.profileSampleType')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.samplingMethodType'));
|
||||
|
||||
-- ingestion_pipeline_entity (profiler pipelines): build profileSampleConfig (skip if already migrated)
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = JSON_SET(
|
||||
json,
|
||||
'$.sourceConfig.config.profileSampleConfig',
|
||||
JSON_OBJECT(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', JSON_OBJECT(
|
||||
'profileSample', JSON_EXTRACT(json, '$.sourceConfig.config.profileSample'),
|
||||
'profileSampleType', COALESCE(
|
||||
JSON_EXTRACT(json, '$.sourceConfig.config.profileSampleType'),
|
||||
CAST('"PERCENTAGE"' AS JSON)
|
||||
),
|
||||
'samplingMethodType', JSON_EXTRACT(json, '$.sourceConfig.config.samplingMethodType')
|
||||
)
|
||||
)
|
||||
)
|
||||
WHERE pipelineType = 'profiler'
|
||||
AND JSON_EXTRACT(json, '$.sourceConfig.config.profileSample') IS NOT NULL
|
||||
AND JSON_TYPE(JSON_EXTRACT(json, '$.sourceConfig.config.profileSample')) != 'NULL'
|
||||
AND NOT JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSampleConfig');
|
||||
|
||||
-- ingestion_pipeline_entity (profiler pipelines): remove old flat fields
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = JSON_REMOVE(
|
||||
JSON_REMOVE(
|
||||
JSON_REMOVE(json, '$.sourceConfig.config.samplingMethodType'),
|
||||
'$.sourceConfig.config.profileSampleType'
|
||||
),
|
||||
'$.sourceConfig.config.profileSample'
|
||||
)
|
||||
WHERE pipelineType = 'profiler'
|
||||
AND (JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSample')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSampleType')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.samplingMethodType'));
|
||||
|
||||
-- ingestion_pipeline_entity (testSuite pipelines): build profileSampleConfig (skip if already migrated)
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = JSON_SET(
|
||||
json,
|
||||
'$.sourceConfig.config.profileSampleConfig',
|
||||
JSON_OBJECT(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', JSON_OBJECT(
|
||||
'profileSample', JSON_EXTRACT(json, '$.sourceConfig.config.profileSample'),
|
||||
'profileSampleType', COALESCE(
|
||||
JSON_EXTRACT(json, '$.sourceConfig.config.profileSampleType'),
|
||||
CAST('"PERCENTAGE"' AS JSON)
|
||||
),
|
||||
'samplingMethodType', JSON_EXTRACT(json, '$.sourceConfig.config.samplingMethodType')
|
||||
)
|
||||
)
|
||||
)
|
||||
WHERE pipelineType = 'testSuite'
|
||||
AND JSON_EXTRACT(json, '$.sourceConfig.config.profileSample') IS NOT NULL
|
||||
AND JSON_TYPE(JSON_EXTRACT(json, '$.sourceConfig.config.profileSample')) != 'NULL'
|
||||
AND NOT JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSampleConfig');
|
||||
|
||||
-- ingestion_pipeline_entity (testSuite pipelines): remove old flat fields
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = JSON_REMOVE(
|
||||
JSON_REMOVE(
|
||||
JSON_REMOVE(json, '$.sourceConfig.config.samplingMethodType'),
|
||||
'$.sourceConfig.config.profileSampleType'
|
||||
),
|
||||
'$.sourceConfig.config.profileSample'
|
||||
)
|
||||
WHERE pipelineType = 'testSuite'
|
||||
AND (JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSample')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.profileSampleType')
|
||||
OR JSON_CONTAINS_PATH(json, 'one', '$.sourceConfig.config.samplingMethodType'));
|
||||
|
||||
-- RDF distributed indexing state tables
|
||||
CREATE TABLE IF NOT EXISTS rdf_index_job (
|
||||
id VARCHAR(36) NOT NULL,
|
||||
|
|
@ -208,3 +330,71 @@ CREATE TABLE IF NOT EXISTS rdf_index_server_stats (
|
|||
UNIQUE INDEX idx_rdf_index_server_stats_job_server_entity (jobId, serverId, entityType),
|
||||
INDEX idx_rdf_index_server_stats_job_id (jobId)
|
||||
);
|
||||
|
||||
-- Speeds up the NOT EXISTS anti-join used by ContainerDAO root-only listings
|
||||
-- (?root=true&service=...). Covers the subquery's filter and projection so the
|
||||
-- planner can answer "does this container have a parent?" with an index-only
|
||||
-- scan instead of materializing the child-edge set.
|
||||
CREATE INDEX idx_er_fromentity_toentity_relation_toid
|
||||
ON entity_relationship (fromEntity, toEntity, relation, toId);
|
||||
|
||||
-- Add per-stage cumulative timing columns to search_index_server_stats so the
|
||||
-- distributed aggregator can surface where reindex latency is being spent
|
||||
-- (DB read in Reader, doc-build in Process, OpenSearch bulk in Sink, embeddings
|
||||
-- in Vector). Stored as BIGINT milliseconds; UI computes avg latency and
|
||||
-- throughput client-side from totalTimeMs / successRecords.
|
||||
ALTER TABLE search_index_server_stats
|
||||
ADD COLUMN readerTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN processTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN sinkTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN vectorTimeMs BIGINT NOT NULL DEFAULT 0;
|
||||
|
||||
-- The Postgres counterpart to this file adds a `text_pattern_ops` index
|
||||
-- on `fqnHash` for every entity table to make `?service=` / `?database=` /
|
||||
-- `?databaseSchema=` / `?parent=` listings (which compile to
|
||||
-- `fqnHash LIKE 'prefix%'`) index-driven instead of seq-scan-driven on RDS.
|
||||
-- MySQL does not need an equivalent: every entity-table `fqnHash` column is
|
||||
-- already declared `CHARACTER SET ascii COLLATE ascii_bin`, a binary
|
||||
-- collation that lets the existing unique B-tree on `fqnHash` answer LIKE
|
||||
-- prefix predicates directly. No change required on the MySQL side.
|
||||
|
||||
-- MCP OAuth: state parameter is opaque per RFC 6749 §4.1.1 and some clients (notably the
|
||||
-- Databricks MCP Proxy) send tokens longer than 255 characters. Widen mcp_state to TEXT to
|
||||
-- avoid INSERT failures on /mcp/authorize redirects.
|
||||
ALTER TABLE mcp_pending_auth_requests
|
||||
MODIFY COLUMN mcp_state TEXT;
|
||||
|
||||
-- Allow multiple typed relations between the same pair of glossary terms.
|
||||
-- The previous PRIMARY KEY (fromId, toId, relation) caused INSERT ... ON DUPLICATE
|
||||
-- KEY UPDATE to overwrite the json discriminator when a second relationType
|
||||
-- ("synonym" + "seeAlso", etc.) was added between the same two terms, silently
|
||||
-- dropping the first relationship. Adding relationType to the PK lets the same
|
||||
-- (fromId, toId, RELATED_TO) pair carry one row per relation type.
|
||||
-- `IF NOT EXISTS` on `ADD COLUMN` only landed in MySQL 8.0.29; supported 8.0.x
|
||||
-- deployments may be older, so use plain ADD COLUMN. SERVER_CHANGE_LOG gates
|
||||
-- re-execution at the framework level — same reasoning as the PK swap below.
|
||||
ALTER TABLE entity_relationship
|
||||
ADD COLUMN `relationType` varchar(64) NOT NULL DEFAULT '' AFTER `relation`;
|
||||
|
||||
-- Backfill relationType for every glossary-term ↔ glossary-term RELATED_TO row.
|
||||
-- Pre-1.13 data has json = NULL (no discriminator existed yet) — those rows MUST
|
||||
-- collapse onto 'relatedTo' so that a subsequent insert of the same logical
|
||||
-- relation matches the existing row instead of creating a duplicate under a
|
||||
-- different PK. relation=15 is the ordinal of Relationship.RELATED_TO (see
|
||||
-- openmetadata-spec entityRelationship.json). 'relatedTo' is the default
|
||||
-- relation type that the application code uses when none is specified.
|
||||
UPDATE entity_relationship
|
||||
SET relationType =
|
||||
COALESCE(NULLIF(JSON_UNQUOTE(JSON_EXTRACT(json, '$.relationType')), ''), 'relatedTo')
|
||||
WHERE fromEntity = 'glossaryTerm'
|
||||
AND toEntity = 'glossaryTerm'
|
||||
AND relation = 15;
|
||||
|
||||
-- Swap the PK to include relationType. The native migration framework tracks
|
||||
-- completion in SERVER_CHANGE_LOG so this runs once per upgrade; we intentionally
|
||||
-- avoid information_schema gating because least-privilege migration users may
|
||||
-- not have SELECT on it. A manual replay of this step on an already-migrated
|
||||
-- table will rebuild the PK with the same columns — wasteful but not broken.
|
||||
ALTER TABLE entity_relationship
|
||||
DROP PRIMARY KEY,
|
||||
ADD PRIMARY KEY (`fromId`, `toId`, `relation`, `relationType`);
|
||||
|
|
|
|||
|
|
@ -82,8 +82,47 @@ UPDATE glossary_term_entity
|
|||
SET json = (json::jsonb - 'relatedTerms')::json
|
||||
WHERE jsonb_exists(json::jsonb, 'relatedTerms');
|
||||
|
||||
-- entity_extension version snapshots: handled by Java migration
|
||||
-- migrateGlossaryTermVersionRelatedTermsToTermRelation (transforms in place to preserve history).
|
||||
|
||||
-- Backfill conceptMappings for existing glossary terms
|
||||
UPDATE glossary_term_entity
|
||||
SET json = jsonb_set(COALESCE(json::jsonb, '{}'::jsonb), '{conceptMappings}', '[]'::jsonb)
|
||||
WHERE json IS NULL OR json::jsonb->'conceptMappings' IS NULL;
|
||||
|
||||
-- Add Container permissions to AutoClassificationBotPolicy for storage auto-classification support
|
||||
UPDATE policy_entity
|
||||
SET json = jsonb_insert(
|
||||
json::jsonb,
|
||||
'{rules,1}',
|
||||
jsonb_build_object(
|
||||
'name', 'AutoClassificationBotRule-Allow-Container',
|
||||
'description', 'Allow adding tags and sample data to the containers',
|
||||
'resources', jsonb_build_array('Container'),
|
||||
'operations', jsonb_build_array('EditAll', 'ViewAll'),
|
||||
'effect', 'allow'
|
||||
)
|
||||
)
|
||||
WHERE json->>'name' = 'AutoClassificationBotPolicy'
|
||||
AND (json->'rules'->1->>'name' IS NULL OR json->'rules'->1->>'name' != 'AutoClassificationBotRule-Allow-Container');
|
||||
|
||||
-- Fix PII classification autoClassificationConfig (issue #27910)
|
||||
UPDATE classification
|
||||
SET json = jsonb_set(
|
||||
json::jsonb,
|
||||
'{autoClassificationConfig}',
|
||||
'{"enabled": true, "conflictResolution": "highest_priority", "minimumConfidence": 0.6, "requireExplicitMatch": true}'::jsonb
|
||||
)::json
|
||||
WHERE json->>'name' = 'PII'
|
||||
AND json->'autoClassificationConfig'->>'enabled' IS NULL;
|
||||
|
||||
-- Fix PII tags autoClassificationEnabled (issue #27910)
|
||||
UPDATE tag
|
||||
SET json = jsonb_set(json::jsonb, '{autoClassificationEnabled}', 'true'::jsonb)::json
|
||||
WHERE json->'classification'->>'name' = 'PII'
|
||||
AND json->>'name' IN ('NonSensitive', 'Sensitive')
|
||||
AND (
|
||||
json->>'autoClassificationEnabled' IS NULL
|
||||
OR (json->>'autoClassificationEnabled')::boolean = false
|
||||
);
|
||||
|
||||
|
|
@ -151,6 +151,121 @@ WHERE ue.name = 'mcpapplicationbot'
|
|||
AND re.name = 'ApplicationBotImpersonationRole'
|
||||
ON CONFLICT DO NOTHING;
|
||||
|
||||
-- Update Databricks and Unity Catalog connection schemes from 'databricks+connector' to 'databricks'
|
||||
-- as part of migration from sqlalchemy-databricks to databricks-sqlalchemy package
|
||||
UPDATE dbservice_entity
|
||||
SET json = jsonb_set(json, '{connection,config,scheme}', '"databricks"')
|
||||
WHERE serviceType IN ('Databricks', 'UnityCatalog')
|
||||
AND json #>> '{connection,config,scheme}' = 'databricks+connector';
|
||||
|
||||
-- Migrate profiler sampling config: move flat profileSample/profileSampleType/samplingMethodType
|
||||
-- into the new profileSampleConfig structure. Default to STATIC since DYNAMIC is new.
|
||||
|
||||
-- Profiler configs are stored in entity_extension table, not in entity json columns.
|
||||
-- Extension keys: table.tableProfilerConfig, database.databaseProfilerConfig, databaseSchema.databaseSchemaProfilerConfig
|
||||
-- The json column in entity_extension contains the config object directly (flat root-level fields).
|
||||
|
||||
-- entity_extension: build profileSampleConfig from existing flat fields (skip if already migrated)
|
||||
UPDATE entity_extension
|
||||
SET json = jsonb_set(
|
||||
json::jsonb,
|
||||
'{profileSampleConfig}',
|
||||
jsonb_build_object(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', jsonb_build_object(
|
||||
'profileSample', json::jsonb #> '{profileSample}',
|
||||
'profileSampleType', COALESCE(
|
||||
json::jsonb #> '{profileSampleType}',
|
||||
'"PERCENTAGE"'::jsonb
|
||||
),
|
||||
'samplingMethodType', json::jsonb #> '{samplingMethodType}'
|
||||
)
|
||||
)
|
||||
)::json
|
||||
WHERE extension IN (
|
||||
'table.tableProfilerConfig',
|
||||
'database.databaseProfilerConfig',
|
||||
'databaseSchema.databaseSchemaProfilerConfig'
|
||||
)
|
||||
AND json::jsonb #>> '{profileSample}' IS NOT NULL
|
||||
AND json::jsonb #> '{profileSampleConfig}' IS NULL;
|
||||
|
||||
-- entity_extension: remove old flat fields
|
||||
UPDATE entity_extension
|
||||
SET json = (json::jsonb #- '{profileSample}'
|
||||
#- '{profileSampleType}'
|
||||
#- '{samplingMethodType}')::json
|
||||
WHERE extension IN (
|
||||
'table.tableProfilerConfig',
|
||||
'database.databaseProfilerConfig',
|
||||
'databaseSchema.databaseSchemaProfilerConfig'
|
||||
)
|
||||
AND (json::jsonb #>> '{profileSample}' IS NOT NULL
|
||||
OR json::jsonb #>> '{profileSampleType}' IS NOT NULL
|
||||
OR json::jsonb #>> '{samplingMethodType}' IS NOT NULL);
|
||||
|
||||
-- ingestion_pipeline_entity (profiler pipelines): build profileSampleConfig (skip if already migrated)
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = jsonb_set(
|
||||
json::jsonb,
|
||||
'{sourceConfig,config,profileSampleConfig}',
|
||||
jsonb_build_object(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', jsonb_build_object(
|
||||
'profileSample', json::jsonb #> '{sourceConfig,config,profileSample}',
|
||||
'profileSampleType', COALESCE(
|
||||
json::jsonb #> '{sourceConfig,config,profileSampleType}',
|
||||
'"PERCENTAGE"'::jsonb
|
||||
),
|
||||
'samplingMethodType', json::jsonb #> '{sourceConfig,config,samplingMethodType}'
|
||||
)
|
||||
)
|
||||
)::json
|
||||
WHERE json #>> '{pipelineType}' = 'profiler'
|
||||
AND json::jsonb #>> '{sourceConfig,config,profileSample}' IS NOT NULL
|
||||
AND json::jsonb #> '{sourceConfig,config,profileSampleConfig}' IS NULL;
|
||||
|
||||
-- ingestion_pipeline_entity (profiler pipelines): remove old flat fields
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = (json::jsonb #- '{sourceConfig,config,profileSample}'
|
||||
#- '{sourceConfig,config,profileSampleType}'
|
||||
#- '{sourceConfig,config,samplingMethodType}')::json
|
||||
WHERE json #>> '{pipelineType}' = 'profiler'
|
||||
AND (json::jsonb #>> '{sourceConfig,config,profileSample}' IS NOT NULL
|
||||
OR json::jsonb #>> '{sourceConfig,config,profileSampleType}' IS NOT NULL
|
||||
OR json::jsonb #>> '{sourceConfig,config,samplingMethodType}' IS NOT NULL);
|
||||
|
||||
-- ingestion_pipeline_entity (testSuite pipelines): build profileSampleConfig (skip if already migrated)
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = jsonb_set(
|
||||
json::jsonb,
|
||||
'{sourceConfig,config,profileSampleConfig}',
|
||||
jsonb_build_object(
|
||||
'sampleConfigType', 'STATIC',
|
||||
'config', jsonb_build_object(
|
||||
'profileSample', json::jsonb #> '{sourceConfig,config,profileSample}',
|
||||
'profileSampleType', COALESCE(
|
||||
json::jsonb #> '{sourceConfig,config,profileSampleType}',
|
||||
'"PERCENTAGE"'::jsonb
|
||||
),
|
||||
'samplingMethodType', json::jsonb #> '{sourceConfig,config,samplingMethodType}'
|
||||
)
|
||||
)
|
||||
)::json
|
||||
WHERE json #>> '{pipelineType}' = 'testSuite'
|
||||
AND json::jsonb #>> '{sourceConfig,config,profileSample}' IS NOT NULL
|
||||
AND json::jsonb #> '{sourceConfig,config,profileSampleConfig}' IS NULL;
|
||||
|
||||
-- ingestion_pipeline_entity (testSuite pipelines): remove old flat fields
|
||||
UPDATE ingestion_pipeline_entity
|
||||
SET json = (json::jsonb #- '{sourceConfig,config,profileSample}'
|
||||
#- '{sourceConfig,config,profileSampleType}'
|
||||
#- '{sourceConfig,config,samplingMethodType}')::json
|
||||
WHERE json #>> '{pipelineType}' = 'testSuite'
|
||||
AND (json::jsonb #>> '{sourceConfig,config,profileSample}' IS NOT NULL
|
||||
OR json::jsonb #>> '{sourceConfig,config,profileSampleType}' IS NOT NULL
|
||||
OR json::jsonb #>> '{sourceConfig,config,samplingMethodType}' IS NOT NULL);
|
||||
|
||||
-- RDF distributed indexing state tables
|
||||
CREATE TABLE IF NOT EXISTS rdf_index_job (
|
||||
id VARCHAR(36) NOT NULL,
|
||||
|
|
@ -232,3 +347,198 @@ CREATE TABLE IF NOT EXISTS rdf_index_server_stats (
|
|||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_rdf_index_server_stats_job_id ON rdf_index_server_stats(jobId);
|
||||
|
||||
-- Speeds up the NOT EXISTS anti-join used by ContainerDAO root-only listings
|
||||
-- (?root=true&service=...). Covers the subquery's filter and projection so the
|
||||
-- planner can answer "does this container have a parent?" with an index-only
|
||||
-- scan instead of materializing the child-edge set.
|
||||
CREATE INDEX IF NOT EXISTS idx_er_fromentity_toentity_relation_toid
|
||||
ON entity_relationship (fromEntity, toEntity, relation, toId);
|
||||
|
||||
-- Add per-stage cumulative timing columns to search_index_server_stats so the
|
||||
-- distributed aggregator can surface where reindex latency is being spent
|
||||
-- (DB read in Reader, doc-build in Process, OpenSearch bulk in Sink, embeddings
|
||||
-- in Vector). Stored as BIGINT milliseconds; UI computes avg latency and
|
||||
-- throughput client-side from totalTimeMs / successRecords.
|
||||
ALTER TABLE search_index_server_stats
|
||||
ADD COLUMN IF NOT EXISTS readerTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN IF NOT EXISTS processTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN IF NOT EXISTS sinkTimeMs BIGINT NOT NULL DEFAULT 0,
|
||||
ADD COLUMN IF NOT EXISTS vectorTimeMs BIGINT NOT NULL DEFAULT 0;
|
||||
|
||||
-- Speed up `?service=` / `?database=` / `?databaseSchema=` / `?parent=` /
|
||||
-- `?apiCollection=` / `?spreadsheet=` / `?testSuite=` listings on entity
|
||||
-- tables. ListFilter.getFqnPrefixCondition turns each of these query params
|
||||
-- into a `<table>.fqnHash LIKE :prefix%` predicate. The unique B-tree index
|
||||
-- on `fqnHash` uses the default operator class, and the column inherits the
|
||||
-- database default collation (typically `en_US.UTF-8` on managed Postgres /
|
||||
-- RDS). Neither qualifies the planner to use the index for `LIKE 'prefix%'`,
|
||||
-- so count(*) and the page query degrade to a parallel seq scan over the
|
||||
-- JSONB heap — observed at ~3s on a ~580k-row storage_container_entity table
|
||||
-- even with ANALYZE / VACUUM tuned. A pattern-ops index supports LIKE-prefix
|
||||
-- lookups regardless of column collation, dropping cold count(*) on a
|
||||
-- service-filtered listing from seconds to tens of milliseconds.
|
||||
--
|
||||
-- Why `text_pattern_ops` and not `varchar_pattern_ops`:
|
||||
-- `fqnHash` is declared `VARCHAR(768)` / `VARCHAR(256)`, so on paper
|
||||
-- `varchar_pattern_ops` is the type-matched choice. In practice the planner
|
||||
-- normalizes `varchar LIKE text` (which is what every JDBC `setString` call
|
||||
-- and any `encode(...)`-derived RHS produces) by casting the column to text:
|
||||
-- the resulting filter expression is `(fqnhash)::text ~~ ...`. The
|
||||
-- `varchar_pattern_ops` opclass does NOT match that cast expression — the
|
||||
-- index is silently unused and the table seq-scans. `text_pattern_ops`
|
||||
-- matches `(varchar_col)::text ~~ ...` and gets picked up. Confirmed via
|
||||
-- EXPLAIN ANALYZE on a 580k-row storage_container_entity: the same query
|
||||
-- drops from ~470ms cold (Parallel Seq Scan) to <1ms (Index Scan) after
|
||||
-- recreating the index with `text_pattern_ops`.
|
||||
--
|
||||
-- Built CONCURRENTLY so the migration does not take a write lock on these
|
||||
-- tables (matches the 1.11.0 `idx_tag_usage_*` pattern). Each statement runs
|
||||
-- outside an implicit transaction, which the OpenMetadata native migration
|
||||
-- runner already supports — see 1.11.0/postgres/schemaChanges.sql.
|
||||
--
|
||||
-- Recreate, not "create if missing": the original 1.13.0 ship of these indexes
|
||||
-- used `varchar_pattern_ops` (incorrect — see the "Why text_pattern_ops" block
|
||||
-- above). On already-upgraded environments the old index already exists under
|
||||
-- the same name with the wrong opclass, and a plain `CREATE INDEX CONCURRENTLY
|
||||
-- IF NOT EXISTS` would no-op against that. We DROP first so the new SQL text
|
||||
-- (which `MigrationProcessImpl` keys on by hash, so it re-runs even after the
|
||||
-- old version was applied) actually replaces the existing index. On a fresh
|
||||
-- install the DROP is a no-op via `IF EXISTS`. The CREATE keeps `IF NOT EXISTS`
|
||||
-- only as a defensive against an interrupted-then-resumed migration where the
|
||||
-- DROP succeeded but the CREATE was killed before completion.
|
||||
--
|
||||
-- OPERATOR RUNBOOK — interrupted CONCURRENTLY builds.
|
||||
-- If a `CREATE INDEX CONCURRENTLY` is interrupted (deploy timeout, lock
|
||||
-- contention, OOM, connection drop), Postgres leaves an INVALID index
|
||||
-- behind. The `MigrationProcessImpl` runner caches statements by SQL text
|
||||
-- hash, so an embedded cleanup step cannot be made to re-run on retry — this
|
||||
-- is a known pattern-level gap (also present in 1.11.0).
|
||||
--
|
||||
-- Detection (run on the affected tenant):
|
||||
-- SELECT c.relname FROM pg_class c
|
||||
-- JOIN pg_index i ON i.indexrelid = c.oid
|
||||
-- WHERE NOT i.indisvalid
|
||||
-- AND c.relname LIKE 'idx\_%\_fqnhash\_pattern' ESCAPE '\';
|
||||
-- Remediation: `DROP INDEX CONCURRENTLY <relname>;` for each row, then
|
||||
-- delete the corresponding row from server_migration_sql_logs so the
|
||||
-- runner re-attempts the CREATE on the next deploy:
|
||||
-- DELETE FROM server_migration_sql_logs
|
||||
-- WHERE version = '1.13.0'
|
||||
-- AND sqlstatement LIKE '%idx\_<table>\_fqnhash\_pattern%' ESCAPE '\';
|
||||
--
|
||||
-- `pipeline_entity` is intentionally excluded: ListFilter.getServiceCondition
|
||||
-- special-cases `pipeline_entity` to an EXISTS join on
|
||||
-- `pipeline_service_entity` by service name (not `fqnHash LIKE`), and
|
||||
-- PipelineResource.list exposes no other prefix-LIKE filter, so a pattern
|
||||
-- index on `pipeline_entity.fqnHash` would be unused write overhead.
|
||||
--
|
||||
-- MySQL is unaffected: every entity-table `fqnHash` column ships with
|
||||
-- `CHARACTER SET ascii COLLATE ascii_bin`, a binary collation that already
|
||||
-- permits prefix scans on the unique index. This pass is Postgres-only.
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_chart_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_chart_entity_fqnhash_pattern
|
||||
ON chart_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_dashboard_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_dashboard_entity_fqnhash_pattern
|
||||
ON dashboard_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_dashboard_data_model_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_dashboard_data_model_entity_fqnhash_pattern
|
||||
ON dashboard_data_model_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_database_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_database_entity_fqnhash_pattern
|
||||
ON database_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_database_schema_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_database_schema_entity_fqnhash_pattern
|
||||
ON database_schema_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_glossary_term_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_glossary_term_entity_fqnhash_pattern
|
||||
ON glossary_term_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_ingestion_pipeline_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_ingestion_pipeline_entity_fqnhash_pattern
|
||||
ON ingestion_pipeline_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_metric_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_metric_entity_fqnhash_pattern
|
||||
ON metric_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_ml_model_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_ml_model_entity_fqnhash_pattern
|
||||
ON ml_model_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_policy_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_policy_entity_fqnhash_pattern
|
||||
ON policy_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_query_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_query_entity_fqnhash_pattern
|
||||
ON query_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_report_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_report_entity_fqnhash_pattern
|
||||
ON report_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_search_index_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_search_index_entity_fqnhash_pattern
|
||||
ON search_index_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_storage_container_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_storage_container_entity_fqnhash_pattern
|
||||
ON storage_container_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_table_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_table_entity_fqnhash_pattern
|
||||
ON table_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_test_case_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_test_case_fqnhash_pattern
|
||||
ON test_case (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_topic_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_topic_entity_fqnhash_pattern
|
||||
ON topic_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_api_collection_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_api_collection_entity_fqnhash_pattern
|
||||
ON api_collection_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_api_endpoint_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_api_endpoint_entity_fqnhash_pattern
|
||||
ON api_endpoint_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_directory_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_directory_entity_fqnhash_pattern
|
||||
ON directory_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_file_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_file_entity_fqnhash_pattern
|
||||
ON file_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_spreadsheet_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_spreadsheet_entity_fqnhash_pattern
|
||||
ON spreadsheet_entity (fqnHash text_pattern_ops);
|
||||
DROP INDEX CONCURRENTLY IF EXISTS idx_worksheet_entity_fqnhash_pattern;
|
||||
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_worksheet_entity_fqnhash_pattern
|
||||
ON worksheet_entity (fqnHash text_pattern_ops);
|
||||
|
||||
-- MCP OAuth: state parameter is opaque per RFC 6749 §4.1.1 and some clients (notably the
|
||||
-- Databricks MCP Proxy) send tokens longer than 255 characters. Widen mcp_state to TEXT to
|
||||
-- avoid INSERT failures on /mcp/authorize redirects.
|
||||
ALTER TABLE mcp_pending_auth_requests
|
||||
ALTER COLUMN mcp_state TYPE TEXT;
|
||||
|
||||
-- Allow multiple typed relations between the same pair of glossary terms.
|
||||
-- The previous PRIMARY KEY (fromId, toId, relation) caused INSERT ... ON CONFLICT
|
||||
-- DO UPDATE to overwrite the json discriminator when a second relationType
|
||||
-- ("synonym" + "seeAlso", etc.) was added between the same two terms, silently
|
||||
-- dropping the first relationship. Adding relationType to the PK lets the same
|
||||
-- (fromId, toId, RELATED_TO) pair carry one row per relation type.
|
||||
ALTER TABLE entity_relationship
|
||||
ADD COLUMN IF NOT EXISTS relationType character varying(64) DEFAULT ''::character varying NOT NULL;
|
||||
|
||||
-- Backfill relationType for every glossary-term ↔ glossary-term RELATED_TO row.
|
||||
-- Pre-1.13 data has json = NULL (no discriminator existed yet) — those rows MUST
|
||||
-- collapse onto 'relatedTo' so that a subsequent insert of the same logical
|
||||
-- relation matches the existing row instead of creating a duplicate under a
|
||||
-- different PK. relation=15 is the ordinal of Relationship.RELATED_TO (see
|
||||
-- openmetadata-spec entityRelationship.json). 'relatedTo' is the default
|
||||
-- relation type that the application code uses when none is specified.
|
||||
UPDATE entity_relationship
|
||||
SET relationType = COALESCE(NULLIF(json->>'relationType', ''), 'relatedTo')
|
||||
WHERE fromEntity = 'glossaryTerm'
|
||||
AND toEntity = 'glossaryTerm'
|
||||
AND relation = 15;
|
||||
|
||||
-- Swap the PK to include relationType. The native migration framework tracks
|
||||
-- completion in SERVER_CHANGE_LOG so this runs once per upgrade; we intentionally
|
||||
-- avoid information_schema gating because least-privilege migration users may
|
||||
-- not have SELECT on it. DROP CONSTRAINT IF EXISTS keeps the statement safe to
|
||||
-- replay against a table that's already been migrated.
|
||||
ALTER TABLE entity_relationship DROP CONSTRAINT IF EXISTS entity_relationship_pkey;
|
||||
ALTER TABLE entity_relationship
|
||||
ADD CONSTRAINT entity_relationship_pkey PRIMARY KEY (fromId, toId, relation, relationType);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,116 @@
|
|||
-- Post data migration script for Task System Redesign - OpenMetadata 2.0.0
|
||||
-- This script runs after the data migration completes
|
||||
|
||||
-- =====================================================
|
||||
-- NOTE: Suggestion migration (suggestions → task_entity),
|
||||
-- thread-based task migration (thread_entity → task_entity),
|
||||
-- and legacy system activity migration
|
||||
-- (thread_entity generated feed rows → activity_stream)
|
||||
-- are handled in Java MigrationUtil because they require
|
||||
-- entity-link aware transformation logic.
|
||||
-- =====================================================
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2D: Migrate announcements from thread_entity → announcement_entity
|
||||
-- =====================================================
|
||||
INSERT INTO announcement_entity (id, json, fqnHash)
|
||||
SELECT
|
||||
a_id AS id,
|
||||
a_json AS json,
|
||||
a_fqnHash AS fqnHash
|
||||
FROM (
|
||||
SELECT
|
||||
JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id')) AS a_id,
|
||||
JSON_OBJECT(
|
||||
'id', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id')),
|
||||
'name', CONCAT('announcement-', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id'))),
|
||||
'fullyQualifiedName', CONCAT('announcement-', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id'))),
|
||||
'displayName', NULLIF(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.message')), ''),
|
||||
'description', COALESCE(
|
||||
JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.announcement.description')),
|
||||
JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.message')),
|
||||
''
|
||||
),
|
||||
'entityLink', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.about')),
|
||||
'startTime', CAST(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.announcement.startTime')) AS UNSIGNED),
|
||||
'endTime', CAST(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.announcement.endTime')) AS UNSIGNED),
|
||||
'status', CASE
|
||||
WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.announcement.endTime')) AS UNSIGNED) < UNIX_TIMESTAMP() * 1000
|
||||
THEN 'Expired'
|
||||
WHEN CAST(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.announcement.startTime')) AS UNSIGNED) > UNIX_TIMESTAMP() * 1000
|
||||
THEN 'Scheduled'
|
||||
ELSE 'Active'
|
||||
END,
|
||||
'createdBy', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.createdBy')),
|
||||
'updatedBy', COALESCE(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.updatedBy')), JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.createdBy'))),
|
||||
'createdAt', CAST(JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.threadTs')) AS UNSIGNED),
|
||||
'updatedAt', CAST(
|
||||
COALESCE(
|
||||
JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.updatedAt')),
|
||||
JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.threadTs'))
|
||||
) AS UNSIGNED
|
||||
),
|
||||
'deleted', false,
|
||||
'version', 0.1,
|
||||
'reactions', COALESCE(JSON_EXTRACT(t.json, '$.reactions'), JSON_ARRAY())
|
||||
) AS a_json,
|
||||
MD5(CONCAT('announcement-', JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id')))) AS a_fqnHash
|
||||
FROM thread_entity t
|
||||
WHERE JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.type')) = 'Announcement'
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM announcement_entity a WHERE a.id = JSON_UNQUOTE(JSON_EXTRACT(t.json, '$.id'))
|
||||
)
|
||||
) migrated;
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2E: Rename legacy thread storage to fail stale references
|
||||
-- =====================================================
|
||||
SET @thread_entity_exists = (
|
||||
SELECT COUNT(*)
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'thread_entity'
|
||||
);
|
||||
|
||||
SET @thread_entity_legacy_exists = (
|
||||
SELECT COUNT(*)
|
||||
FROM information_schema.tables
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'thread_entity_legacy'
|
||||
);
|
||||
|
||||
SET @rename_thread_entity_sql = IF(
|
||||
@thread_entity_exists = 1 AND @thread_entity_legacy_exists = 0,
|
||||
'RENAME TABLE thread_entity TO thread_entity_legacy',
|
||||
'SELECT 1'
|
||||
);
|
||||
|
||||
PREPARE rename_thread_entity_stmt FROM @rename_thread_entity_sql;
|
||||
EXECUTE rename_thread_entity_stmt;
|
||||
DEALLOCATE PREPARE rename_thread_entity_stmt;
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2F: Lower workflow trigger polling intervals
|
||||
-- =====================================================
|
||||
-- Reduce WorkflowEventConsumer poll interval from 10s to 1s.
|
||||
-- The legacy 10s default added up to a 10s wait between an entity change and the
|
||||
-- workflow-triggered approval task being created. On CI under resource pressure this
|
||||
-- often drifted to >2 minutes when combined with Flowable's 60s async job poll. The
|
||||
-- new value keeps the trigger pipeline near-real-time.
|
||||
UPDATE event_subscription_entity
|
||||
SET json = JSON_SET(json, '$.pollInterval', 1)
|
||||
WHERE name = 'WorkflowEventConsumer'
|
||||
AND CAST(JSON_EXTRACT(json, '$.pollInterval') AS UNSIGNED) > 1;
|
||||
|
||||
-- Lower Flowable async/timer job acquisition intervals to keep workflow-driven
|
||||
-- task creation responsive. The previous 60s default was a Flowable production setting
|
||||
-- carried over verbatim; for OpenMetadata's interactive task UX we want sub-second pickup.
|
||||
UPDATE openmetadata_settings
|
||||
SET json = JSON_SET(
|
||||
JSON_SET(json, '$.executorConfiguration.asyncJobAcquisitionInterval', 1000),
|
||||
'$.executorConfiguration.timerJobAcquisitionInterval', 5000)
|
||||
WHERE configType = 'workflowSettings'
|
||||
AND JSON_EXTRACT(json, '$.executorConfiguration') IS NOT NULL
|
||||
AND (CAST(JSON_EXTRACT(json, '$.executorConfiguration.asyncJobAcquisitionInterval') AS UNSIGNED) > 1000
|
||||
OR CAST(JSON_EXTRACT(json, '$.executorConfiguration.timerJobAcquisitionInterval') AS UNSIGNED) > 5000);
|
||||
|
||||
|
|
@ -1 +1,357 @@
|
|||
-- MCP tables are created in 1.13.0 migration. This file is intentionally empty.
|
||||
-- Task System Redesign - OpenMetadata 2.0.0
|
||||
-- This migration creates the new Task entity tables and related infrastructure
|
||||
|
||||
CREATE TABLE IF NOT EXISTS task_entity (
|
||||
id varchar(36) NOT NULL,
|
||||
json json NOT NULL,
|
||||
fqnHash varchar(768) NOT NULL,
|
||||
taskId varchar(20) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.taskId'))) STORED NOT NULL,
|
||||
name varchar(256) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.name'))) STORED NOT NULL,
|
||||
category varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.category'))) STORED NOT NULL,
|
||||
type varchar(64) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.type'))) STORED NOT NULL,
|
||||
status varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.status'))) STORED NOT NULL,
|
||||
priority varchar(16) GENERATED ALWAYS AS (COALESCE(json_unquote(json_extract(`json`,_utf8mb4'$.priority')), 'Medium')) STORED,
|
||||
createdAt bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.createdAt'))) STORED NOT NULL,
|
||||
updatedAt bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.updatedAt'))) STORED NOT NULL,
|
||||
deleted tinyint(1) GENERATED ALWAYS AS (json_extract(`json`,_utf8mb4'$.deleted')) STORED,
|
||||
aboutFqnHash varchar(256) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.aboutFqnHash'))) STORED,
|
||||
createdById varchar(36) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.createdById'))) STORED,
|
||||
approvedById varchar(36) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.approvedById'))) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY uk_fqn_hash (fqnHash),
|
||||
KEY idx_task_id (taskId),
|
||||
KEY idx_status (status),
|
||||
KEY idx_category (category),
|
||||
KEY idx_type (type),
|
||||
KEY idx_priority (priority),
|
||||
KEY idx_created_at (createdAt),
|
||||
KEY idx_updated_at (updatedAt),
|
||||
KEY idx_deleted (deleted),
|
||||
KEY idx_status_category (status, category),
|
||||
KEY idx_about_fqn_hash (aboutFqnHash),
|
||||
KEY idx_status_about (status, aboutFqnHash),
|
||||
KEY idx_created_by_id (createdById),
|
||||
KEY idx_created_by_category (createdById, category),
|
||||
KEY idx_approved_by_id (approvedById)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- For 2.0.0 environments that ran the CREATE TABLE above before the
|
||||
-- approvedById generated column was added inline, attach it now. CREATE TABLE
|
||||
-- IF NOT EXISTS is a no-op on those environments so the column would never
|
||||
-- appear otherwise. MySQL doesn't reliably support `ADD COLUMN IF NOT EXISTS`
|
||||
-- across 8.0 versions and has no `ADD KEY IF NOT EXISTS`, so guard both via
|
||||
-- information_schema.
|
||||
SET @ddl = (
|
||||
SELECT IF(
|
||||
EXISTS (
|
||||
SELECT 1
|
||||
FROM information_schema.columns
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'task_entity'
|
||||
AND column_name = 'approvedById'
|
||||
),
|
||||
'SELECT 1',
|
||||
'ALTER TABLE task_entity ADD COLUMN approvedById varchar(36) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4''$.approvedById''))) STORED'
|
||||
)
|
||||
);
|
||||
PREPARE stmt FROM @ddl;
|
||||
EXECUTE stmt;
|
||||
DEALLOCATE PREPARE stmt;
|
||||
|
||||
SET @ddl = (
|
||||
SELECT IF(
|
||||
EXISTS (
|
||||
SELECT 1
|
||||
FROM information_schema.statistics
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'task_entity'
|
||||
AND index_name = 'idx_approved_by_id'
|
||||
),
|
||||
'SELECT 1',
|
||||
'ALTER TABLE task_entity ADD KEY idx_approved_by_id (approvedById)'
|
||||
)
|
||||
);
|
||||
PREPARE stmt FROM @ddl;
|
||||
EXECUTE stmt;
|
||||
DEALLOCATE PREPARE stmt;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS new_task_sequence (
|
||||
id bigint NOT NULL DEFAULT 0
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
INSERT INTO new_task_sequence (id) SELECT 0 WHERE NOT EXISTS (SELECT 1 FROM new_task_sequence);
|
||||
|
||||
-- =====================================================
|
||||
-- ACTIVITY STREAM TABLE (Partitioned by time)
|
||||
-- Lightweight, ephemeral activity notifications
|
||||
-- NOT for audit/compliance - use entity version history
|
||||
-- Partitions are managed dynamically by ActivityStreamPartitionManager
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS activity_stream (
|
||||
id varchar(36) NOT NULL,
|
||||
eventType varchar(64) NOT NULL,
|
||||
entityType varchar(64) NOT NULL,
|
||||
entityId varchar(36) NOT NULL,
|
||||
entityFqnHash varchar(768) CHARACTER SET ascii COLLATE ascii_bin,
|
||||
about varchar(2048),
|
||||
aboutFqnHash varchar(768) CHARACTER SET ascii COLLATE ascii_bin,
|
||||
actorId varchar(36) NOT NULL,
|
||||
actorName varchar(256),
|
||||
timestamp bigint NOT NULL,
|
||||
summary varchar(500),
|
||||
fieldName varchar(256),
|
||||
oldValue text,
|
||||
newValue text,
|
||||
domains json,
|
||||
json json NOT NULL,
|
||||
PRIMARY KEY (id, timestamp),
|
||||
KEY idx_activity_timestamp (timestamp),
|
||||
KEY idx_activity_entity (entityType, entityId, timestamp),
|
||||
KEY idx_activity_actor (actorId, timestamp),
|
||||
KEY idx_activity_event_type (eventType, timestamp),
|
||||
KEY idx_activity_entity_fqn (entityFqnHash, timestamp),
|
||||
KEY idx_activity_about (aboutFqnHash, timestamp)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
|
||||
PARTITION BY RANGE (timestamp) (
|
||||
-- Catch-all partition - ActivityStreamPartitionManager will reorganize this
|
||||
-- by splitting it into monthly partitions as needed
|
||||
PARTITION p_max VALUES LESS THAN MAXVALUE
|
||||
);
|
||||
|
||||
-- Activity stream configuration per domain
|
||||
CREATE TABLE IF NOT EXISTS activity_stream_config (
|
||||
id varchar(36) NOT NULL,
|
||||
json json NOT NULL,
|
||||
scope varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.scope'))) STORED NOT NULL,
|
||||
domainId varchar(36) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.scopeReference.id'))) STORED,
|
||||
enabled tinyint(1) GENERATED ALWAYS AS (json_extract(`json`,_utf8mb4'$.enabled')) STORED,
|
||||
retentionDays int GENERATED ALWAYS AS (json_extract(`json`,_utf8mb4'$.retentionDays')) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY uk_domain_config (domainId),
|
||||
KEY idx_scope (scope),
|
||||
KEY idx_enabled (enabled)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- =====================================================
|
||||
-- ANNOUNCEMENT ENTITY TABLE
|
||||
-- Standalone entity for asset announcements (migrated from thread_entity)
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS announcement_entity (
|
||||
id varchar(36) NOT NULL,
|
||||
json json NOT NULL,
|
||||
fqnHash varchar(768) NOT NULL,
|
||||
name varchar(256) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.name'))) STORED NOT NULL,
|
||||
entityLink varchar(512) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.entityLink'))) STORED,
|
||||
status varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.status'))) STORED,
|
||||
startTime bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.startTime'))) STORED,
|
||||
endTime bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.endTime'))) STORED,
|
||||
createdBy varchar(256) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.createdBy'))) STORED,
|
||||
createdAt bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.createdAt'))) STORED,
|
||||
updatedAt bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.updatedAt'))) STORED,
|
||||
deleted tinyint(1) GENERATED ALWAYS AS (json_extract(`json`,_utf8mb4'$.deleted')) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY uk_announcement_fqn_hash (fqnHash),
|
||||
KEY idx_announcement_status (status),
|
||||
KEY idx_announcement_entity_link (entityLink),
|
||||
KEY idx_announcement_start_time (startTime),
|
||||
KEY idx_announcement_end_time (endTime),
|
||||
KEY idx_announcement_deleted (deleted)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- =====================================================
|
||||
-- TASK FORM SCHEMA ENTITY TABLE
|
||||
-- Stores form schemas for different task types
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS task_form_schema_entity (
|
||||
id varchar(36) NOT NULL,
|
||||
json json NOT NULL,
|
||||
fqnHash varchar(768) NOT NULL,
|
||||
name varchar(256) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.name'))) STORED NOT NULL,
|
||||
taskType varchar(64) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.taskType'))) STORED,
|
||||
taskCategory varchar(32) GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.taskCategory'))) STORED,
|
||||
updatedAt bigint GENERATED ALWAYS AS (json_unquote(json_extract(`json`,_utf8mb4'$.updatedAt'))) STORED,
|
||||
deleted tinyint(1) GENERATED ALWAYS AS (json_extract(`json`,_utf8mb4'$.deleted')) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY uk_task_form_schema_fqn_hash (fqnHash),
|
||||
KEY idx_task_form_schema_name (name),
|
||||
KEY idx_task_form_schema_task_type (taskType),
|
||||
KEY idx_task_form_schema_deleted (deleted)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- =====================================================
|
||||
-- KNOWLEDGE CENTER + CONTEXT CENTER DRIVE (Collate → OM port)
|
||||
-- Appended below the Task Redesign tables to preserve main's
|
||||
-- migration order when merging.
|
||||
-- =====================================================
|
||||
|
||||
-- MCP tables are created in 1.13.0 migration.
|
||||
|
||||
-- Knowledge Center: page entity table (Article, QuickLink).
|
||||
-- Existing Collate customers already have this table from 1.2.0-collate with
|
||||
-- subsequent shape changes through 1.6.0-collate (nameHash -> fqnHash VARCHAR(756),
|
||||
-- pageType generated column, composite deleted index). CREATE TABLE IF NOT EXISTS
|
||||
-- is a no-op for them and creates the final shape for fresh OpenMetadata installs.
|
||||
CREATE TABLE IF NOT EXISTS knowledge_center (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
fqnHash VARCHAR(756) NOT NULL COLLATE ascii_bin,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.name') STORED NOT NULL,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
pageType VARCHAR(16) GENERATED ALWAYS AS (json ->> '$.pageType') STORED NOT NULL,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (fqnHash),
|
||||
INDEX knowledge_center_name_index (name),
|
||||
INDEX index_knowledge_center_deleted (fqnHash, deleted)
|
||||
);
|
||||
|
||||
-- Context Center Drive: Folder entity table.
|
||||
CREATE TABLE IF NOT EXISTS drive_folder (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL COLLATE ascii_bin,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY unique_drive_folder_name (nameHash),
|
||||
INDEX idx_drive_folder_updated_at (updatedAt)
|
||||
) DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- Context Center Drive: File entity table (uploaded PDF/image/spreadsheet/office docs).
|
||||
CREATE TABLE IF NOT EXISTS context_file (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL COLLATE ascii_bin,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY unique_context_file_name (nameHash),
|
||||
INDEX idx_context_file_updated_at (updatedAt)
|
||||
) DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- Attachments: Asset entity table for uploaded file blobs referenced by ContextFiles, Pages, etc.
|
||||
-- Existing Collate customers have this from 1.7.0-collate. CREATE TABLE IF NOT EXISTS is a no-op for them.
|
||||
CREATE TABLE IF NOT EXISTS asset_entity (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.fileName') STORED NOT NULL,
|
||||
url VARCHAR(1024) GENERATED ALWAYS AS (json ->> '$.url') STORED NOT NULL,
|
||||
fullyQualifiedName VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.fullyQualifiedName') STORED NOT NULL,
|
||||
assetType VARCHAR(100) GENERATED ALWAYS AS (json ->> '$.assetType') STORED NOT NULL,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
fqnHash VARCHAR(768) CHARACTER SET ascii COLLATE ascii_bin DEFAULT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
PRIMARY KEY (id),
|
||||
INDEX fqnhash_index (fqnHash),
|
||||
INDEX asset_type_index (assetType),
|
||||
INDEX idx_asset_deleted (deleted)
|
||||
);
|
||||
|
||||
-- Context Center Drive: File content snapshot table (revisions, extracted text).
|
||||
CREATE TABLE IF NOT EXISTS context_file_content (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL COLLATE ascii_bin,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY unique_context_file_content_name (nameHash),
|
||||
INDEX idx_context_file_content_updated_at (updatedAt)
|
||||
) DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- Add tag_usage.metadata column if missing (newer tag usage payloads carry metadata).
|
||||
SET @ddl = (
|
||||
SELECT IF(
|
||||
EXISTS (
|
||||
SELECT 1
|
||||
FROM information_schema.columns
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'tag_usage'
|
||||
AND column_name = 'metadata'
|
||||
),
|
||||
'SELECT 1',
|
||||
'ALTER TABLE tag_usage ADD COLUMN metadata JSON NULL'
|
||||
)
|
||||
);
|
||||
PREPARE stmt FROM @ddl;
|
||||
EXECUTE stmt;
|
||||
DEALLOCATE PREPARE stmt;
|
||||
|
||||
-- Add audit_log_event.search_text column if missing (searchable audit log text).
|
||||
SET @ddl = (
|
||||
SELECT IF(
|
||||
EXISTS (
|
||||
SELECT 1
|
||||
FROM information_schema.columns
|
||||
WHERE table_schema = DATABASE()
|
||||
AND table_name = 'audit_log_event'
|
||||
AND column_name = 'search_text'
|
||||
),
|
||||
'SELECT 1',
|
||||
'ALTER TABLE audit_log_event ADD COLUMN search_text LONGTEXT NULL'
|
||||
)
|
||||
);
|
||||
PREPARE stmt FROM @ddl;
|
||||
EXECUTE stmt;
|
||||
DEALLOCATE PREPARE stmt;
|
||||
|
||||
-- Distributed reindex job tracking.
|
||||
CREATE TABLE IF NOT EXISTS search_index_job (
|
||||
id VARCHAR(64) NOT NULL,
|
||||
status VARCHAR(64) NOT NULL,
|
||||
jobConfiguration JSON NOT NULL,
|
||||
targetIndexPrefix VARCHAR(256) NOT NULL,
|
||||
stagedIndexMapping JSON DEFAULT NULL,
|
||||
totalRecords BIGINT NOT NULL DEFAULT 0,
|
||||
processedRecords BIGINT NOT NULL DEFAULT 0,
|
||||
successRecords BIGINT NOT NULL DEFAULT 0,
|
||||
failedRecords BIGINT NOT NULL DEFAULT 0,
|
||||
stats JSON NOT NULL,
|
||||
createdBy VARCHAR(256) NOT NULL,
|
||||
createdAt BIGINT NOT NULL,
|
||||
startedAt BIGINT DEFAULT NULL,
|
||||
completedAt BIGINT DEFAULT NULL,
|
||||
updatedAt BIGINT NOT NULL,
|
||||
errorMessage LONGTEXT DEFAULT NULL,
|
||||
registrationDeadline BIGINT DEFAULT NULL,
|
||||
registeredServerCount INT DEFAULT NULL,
|
||||
PRIMARY KEY (id),
|
||||
KEY idx_search_index_job_status_created_at (status, createdAt DESC)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- Retry queue for failed search-index writes.
|
||||
CREATE TABLE IF NOT EXISTS search_index_retry_queue (
|
||||
entityId VARCHAR(64) NOT NULL,
|
||||
entityFqn VARCHAR(700) NOT NULL,
|
||||
failureReason LONGTEXT DEFAULT NULL,
|
||||
status VARCHAR(64) NOT NULL,
|
||||
entityType VARCHAR(128) NOT NULL,
|
||||
retryCount INT NOT NULL DEFAULT 0,
|
||||
claimedAt TIMESTAMP NULL DEFAULT NULL,
|
||||
PRIMARY KEY (entityId, entityFqn),
|
||||
KEY idx_search_index_retry_queue_status (status),
|
||||
KEY idx_search_index_retry_queue_claimed_at (claimedAt)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
||||
-- ContextMemory entity - reusable Context Center memory.
|
||||
CREATE TABLE IF NOT EXISTS context_memory (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> '$.id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL COLLATE ascii_bin,
|
||||
json JSON NOT NULL,
|
||||
updatedAt BIGINT UNSIGNED GENERATED ALWAYS AS (json ->> '$.updatedAt') STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> '$.updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (json -> '$.deleted') STORED,
|
||||
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE KEY unique_context_memory_name (nameHash),
|
||||
INDEX idx_context_memory_updated_at (updatedAt)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
|
|
|||
|
|
@ -0,0 +1,84 @@
|
|||
-- Post data migration script for Task System Redesign - OpenMetadata 2.0.0
|
||||
-- This script runs after the data migration completes
|
||||
|
||||
-- =====================================================
|
||||
-- NOTE: Suggestion migration (suggestions → task_entity),
|
||||
-- thread-based task migration (thread_entity → task_entity),
|
||||
-- and legacy system activity migration
|
||||
-- (thread_entity generated feed rows → activity_stream)
|
||||
-- are handled in Java MigrationUtil because they require
|
||||
-- entity-link aware transformation logic.
|
||||
-- =====================================================
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2D: Migrate announcements from thread_entity → announcement_entity
|
||||
-- =====================================================
|
||||
|
||||
INSERT INTO announcement_entity (id, json, fqnhash)
|
||||
SELECT
|
||||
json->>'id' AS id,
|
||||
jsonb_build_object(
|
||||
'id', json->>'id',
|
||||
'name', 'announcement-' || (json->>'id'),
|
||||
'fullyQualifiedName', 'announcement-' || (json->>'id'),
|
||||
'displayName', NULLIF(json->>'message', ''),
|
||||
'description', COALESCE(
|
||||
json->'announcement'->>'description',
|
||||
json->>'message',
|
||||
''
|
||||
),
|
||||
'entityLink', json->>'about',
|
||||
'startTime', (json->'announcement'->>'startTime')::bigint,
|
||||
'endTime', (json->'announcement'->>'endTime')::bigint,
|
||||
'status', CASE
|
||||
WHEN (json->'announcement'->>'endTime')::bigint < (extract(epoch from now()) * 1000)::bigint
|
||||
THEN 'Expired'
|
||||
WHEN (json->'announcement'->>'startTime')::bigint > (extract(epoch from now()) * 1000)::bigint
|
||||
THEN 'Scheduled'
|
||||
ELSE 'Active'
|
||||
END,
|
||||
'createdBy', json->>'createdBy',
|
||||
'updatedBy', COALESCE(json->>'updatedBy', json->>'createdBy'),
|
||||
'createdAt', (json->>'threadTs')::bigint,
|
||||
'updatedAt', COALESCE((json->>'updatedAt')::bigint, (json->>'threadTs')::bigint),
|
||||
'deleted', false,
|
||||
'version', 0.1,
|
||||
'reactions', COALESCE(json->'reactions', '[]'::jsonb)
|
||||
) AS json,
|
||||
md5('announcement-' || (json->>'id')) AS fqnhash
|
||||
FROM thread_entity t
|
||||
WHERE json->>'type' = 'Announcement'
|
||||
AND NOT EXISTS (
|
||||
SELECT 1 FROM announcement_entity a WHERE a.id = t.json->>'id'
|
||||
)
|
||||
ON CONFLICT (id) DO NOTHING;
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2E: Rename legacy thread storage to fail stale references
|
||||
-- =====================================================
|
||||
ALTER TABLE IF EXISTS thread_entity RENAME TO thread_entity_legacy;
|
||||
|
||||
-- =====================================================
|
||||
-- PHASE 2F: Lower workflow trigger polling intervals
|
||||
-- =====================================================
|
||||
-- Reduce WorkflowEventConsumer poll interval from 10s to 1s.
|
||||
-- The legacy 10s default added up to a 10s wait between an entity change and the
|
||||
-- workflow-triggered approval task being created. On CI under resource pressure this
|
||||
-- often drifted to >2 minutes when combined with Flowable's 60s async job poll. The
|
||||
-- new value keeps the trigger pipeline near-real-time.
|
||||
UPDATE event_subscription_entity
|
||||
SET json = jsonb_set(json, '{pollInterval}', '1'::jsonb)
|
||||
WHERE name = 'WorkflowEventConsumer'
|
||||
AND (json->>'pollInterval')::int > 1;
|
||||
|
||||
-- Lower Flowable async/timer job acquisition intervals to keep workflow-driven
|
||||
-- task creation responsive. The previous 60s default was a Flowable production setting
|
||||
-- carried over verbatim; for OpenMetadata's interactive task UX we want sub-second pickup.
|
||||
UPDATE openmetadata_settings
|
||||
SET json = jsonb_set(
|
||||
jsonb_set(json, '{executorConfiguration,asyncJobAcquisitionInterval}', '1000'::jsonb),
|
||||
'{executorConfiguration,timerJobAcquisitionInterval}', '5000'::jsonb)
|
||||
WHERE configtype = 'workflowSettings'
|
||||
AND json->'executorConfiguration' IS NOT NULL
|
||||
AND ((json->'executorConfiguration'->>'asyncJobAcquisitionInterval')::int > 1000
|
||||
OR (json->'executorConfiguration'->>'timerJobAcquisitionInterval')::int > 5000);
|
||||
|
|
@ -1 +1,308 @@
|
|||
-- MCP tables are created in 1.13.0 migration. This file is intentionally empty.
|
||||
-- Task System Redesign - OpenMetadata 2.0.0
|
||||
-- This migration creates the new Task entity tables and related infrastructure
|
||||
|
||||
CREATE TABLE IF NOT EXISTS task_entity (
|
||||
id character varying(36) NOT NULL,
|
||||
json jsonb NOT NULL,
|
||||
fqnhash character varying(768) NOT NULL,
|
||||
taskid character varying(20) GENERATED ALWAYS AS ((json ->> 'taskId'::text)) STORED NOT NULL,
|
||||
name character varying(256) GENERATED ALWAYS AS ((json ->> 'name'::text)) STORED NOT NULL,
|
||||
category character varying(32) GENERATED ALWAYS AS ((json ->> 'category'::text)) STORED NOT NULL,
|
||||
type character varying(64) GENERATED ALWAYS AS ((json ->> 'type'::text)) STORED NOT NULL,
|
||||
status character varying(32) GENERATED ALWAYS AS ((json ->> 'status'::text)) STORED NOT NULL,
|
||||
priority character varying(16) GENERATED ALWAYS AS (COALESCE((json ->> 'priority'::text), 'Medium'::text)) STORED,
|
||||
createdat bigint GENERATED ALWAYS AS (((json ->> 'createdAt'::text))::bigint) STORED NOT NULL,
|
||||
updatedat bigint GENERATED ALWAYS AS (((json ->> 'updatedAt'::text))::bigint) STORED NOT NULL,
|
||||
deleted boolean GENERATED ALWAYS AS (((json ->> 'deleted'::text))::boolean) STORED,
|
||||
aboutfqnhash character varying(256) GENERATED ALWAYS AS ((json ->> 'aboutFqnHash'::text)) STORED,
|
||||
createdbyid character varying(36) GENERATED ALWAYS AS ((json ->> 'createdById'::text)) STORED,
|
||||
approvedbyid character varying(36) GENERATED ALWAYS AS ((json ->> 'approvedById'::text)) STORED,
|
||||
PRIMARY KEY (id),
|
||||
CONSTRAINT uk_task_fqn_hash UNIQUE (fqnhash)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_task_taskid ON task_entity (taskid);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_status ON task_entity (status);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_category ON task_entity (category);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_type ON task_entity (type);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_priority ON task_entity (priority);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_createdat ON task_entity (createdat);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_updatedat ON task_entity (updatedat);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_deleted ON task_entity (deleted);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_status_category ON task_entity (status, category);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_about_fqn_hash ON task_entity (aboutfqnhash);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_status_about ON task_entity (status, aboutfqnhash);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_created_by_id ON task_entity (createdbyid);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_created_by_category ON task_entity (createdbyid, category);
|
||||
|
||||
-- For 2.0.0 environments that ran the CREATE TABLE above before the
|
||||
-- approvedbyid generated column was added inline, attach it now. CREATE TABLE
|
||||
-- IF NOT EXISTS is a no-op on those environments so the column would never
|
||||
-- appear otherwise. Postgres supports `ADD COLUMN IF NOT EXISTS` natively.
|
||||
-- The ALTER must run before idx_task_approved_by_id is created — otherwise
|
||||
-- existing-2.0.0 deployments would fail the CREATE INDEX with "column does
|
||||
-- not exist" before the ADD COLUMN ever runs.
|
||||
ALTER TABLE task_entity
|
||||
ADD COLUMN IF NOT EXISTS approvedbyid character varying(36)
|
||||
GENERATED ALWAYS AS ((json ->> 'approvedById'::text)) STORED;
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_task_approved_by_id ON task_entity (approvedbyid);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS new_task_sequence (
|
||||
id bigint NOT NULL DEFAULT 0
|
||||
);
|
||||
|
||||
INSERT INTO new_task_sequence (id) SELECT 0 WHERE NOT EXISTS (SELECT 1 FROM new_task_sequence);
|
||||
|
||||
-- =====================================================
|
||||
-- ACTIVITY STREAM TABLE (Partitioned by time)
|
||||
-- Lightweight, ephemeral activity notifications
|
||||
-- NOT for audit/compliance - use entity version history
|
||||
-- Partitions are managed dynamically by ActivityStreamPartitionManager
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS activity_stream (
|
||||
id character varying(36) NOT NULL,
|
||||
eventtype character varying(64) NOT NULL,
|
||||
entitytype character varying(64) NOT NULL,
|
||||
entityid character varying(36) NOT NULL,
|
||||
entityfqnhash character varying(768),
|
||||
about character varying(2048),
|
||||
aboutfqnhash character varying(768),
|
||||
actorid character varying(36) NOT NULL,
|
||||
actorname character varying(256),
|
||||
timestamp bigint NOT NULL,
|
||||
summary character varying(500),
|
||||
fieldname character varying(256),
|
||||
oldvalue text,
|
||||
newvalue text,
|
||||
domains jsonb,
|
||||
json jsonb NOT NULL,
|
||||
PRIMARY KEY (id, timestamp)
|
||||
) PARTITION BY RANGE (timestamp);
|
||||
|
||||
-- Default partition catches all data until monthly partitions are created
|
||||
-- ActivityStreamPartitionManager will create monthly partitions and detach old ones
|
||||
CREATE TABLE IF NOT EXISTS activity_stream_default PARTITION OF activity_stream DEFAULT;
|
||||
|
||||
-- Indexes for activity stream (created on parent, inherited by partitions)
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_timestamp ON activity_stream (timestamp);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_entity ON activity_stream (entitytype, entityid, timestamp);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_actor ON activity_stream (actorid, timestamp);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_event_type ON activity_stream (eventtype, timestamp);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_entity_fqn ON activity_stream (entityfqnhash, timestamp);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_about ON activity_stream (aboutfqnhash, timestamp);
|
||||
|
||||
-- Activity stream configuration per domain
|
||||
CREATE TABLE IF NOT EXISTS activity_stream_config (
|
||||
id character varying(36) NOT NULL,
|
||||
json jsonb NOT NULL,
|
||||
scope character varying(32) GENERATED ALWAYS AS ((json ->> 'scope'::text)) STORED NOT NULL,
|
||||
domainid character varying(36) GENERATED ALWAYS AS ((json -> 'scopeReference' ->> 'id'::text)) STORED,
|
||||
enabled boolean GENERATED ALWAYS AS (((json ->> 'enabled'::text))::boolean) STORED,
|
||||
retentiondays integer GENERATED ALWAYS AS (((json ->> 'retentionDays'::text))::integer) STORED,
|
||||
PRIMARY KEY (id),
|
||||
CONSTRAINT uk_activity_domain_config UNIQUE (domainid)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_config_scope ON activity_stream_config (scope);
|
||||
CREATE INDEX IF NOT EXISTS idx_activity_config_enabled ON activity_stream_config (enabled);
|
||||
|
||||
-- =====================================================
|
||||
-- ANNOUNCEMENT ENTITY TABLE
|
||||
-- Standalone entity for asset announcements (migrated from thread_entity)
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS announcement_entity (
|
||||
id character varying(36) NOT NULL,
|
||||
json jsonb NOT NULL,
|
||||
fqnhash character varying(768) NOT NULL,
|
||||
name character varying(256) GENERATED ALWAYS AS ((json ->> 'name'::text)) STORED NOT NULL,
|
||||
entitylink character varying(512) GENERATED ALWAYS AS ((json ->> 'entityLink'::text)) STORED,
|
||||
status character varying(32) GENERATED ALWAYS AS ((json ->> 'status'::text)) STORED,
|
||||
starttime bigint GENERATED ALWAYS AS (((json ->> 'startTime'::text))::bigint) STORED,
|
||||
endtime bigint GENERATED ALWAYS AS (((json ->> 'endTime'::text))::bigint) STORED,
|
||||
createdby character varying(256) GENERATED ALWAYS AS ((json ->> 'createdBy'::text)) STORED,
|
||||
createdat bigint GENERATED ALWAYS AS (((json ->> 'createdAt'::text))::bigint) STORED,
|
||||
updatedat bigint GENERATED ALWAYS AS (((json ->> 'updatedAt'::text))::bigint) STORED,
|
||||
deleted boolean GENERATED ALWAYS AS (((json ->> 'deleted'::text))::boolean) STORED,
|
||||
PRIMARY KEY (id),
|
||||
CONSTRAINT uk_announcement_fqn_hash UNIQUE (fqnhash)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_announcement_status ON announcement_entity (status);
|
||||
CREATE INDEX IF NOT EXISTS idx_announcement_entitylink ON announcement_entity (entitylink);
|
||||
CREATE INDEX IF NOT EXISTS idx_announcement_starttime ON announcement_entity (starttime);
|
||||
CREATE INDEX IF NOT EXISTS idx_announcement_endtime ON announcement_entity (endtime);
|
||||
CREATE INDEX IF NOT EXISTS idx_announcement_deleted ON announcement_entity (deleted);
|
||||
|
||||
-- =====================================================
|
||||
-- TASK FORM SCHEMA ENTITY TABLE
|
||||
-- Stores form schemas for different task types
|
||||
-- =====================================================
|
||||
CREATE TABLE IF NOT EXISTS task_form_schema_entity (
|
||||
id character varying(36) NOT NULL,
|
||||
json jsonb NOT NULL,
|
||||
fqnhash character varying(768) NOT NULL,
|
||||
name character varying(256) GENERATED ALWAYS AS ((json ->> 'name'::text)) STORED NOT NULL,
|
||||
tasktype character varying(64) GENERATED ALWAYS AS ((json ->> 'taskType'::text)) STORED,
|
||||
taskcategory character varying(32) GENERATED ALWAYS AS ((json ->> 'taskCategory'::text)) STORED,
|
||||
updatedat bigint GENERATED ALWAYS AS (((json ->> 'updatedAt'::text))::bigint) STORED,
|
||||
deleted boolean GENERATED ALWAYS AS (((json ->> 'deleted'::text))::boolean) STORED,
|
||||
PRIMARY KEY (id),
|
||||
CONSTRAINT uk_task_form_schema_fqn_hash UNIQUE (fqnhash)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_task_form_schema_name ON task_form_schema_entity (name);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_form_schema_tasktype ON task_form_schema_entity (tasktype);
|
||||
CREATE INDEX IF NOT EXISTS idx_task_form_schema_deleted ON task_form_schema_entity (deleted);
|
||||
|
||||
-- =====================================================
|
||||
-- KNOWLEDGE CENTER + CONTEXT CENTER DRIVE (Collate → OM port)
|
||||
-- Appended below the Task Redesign tables to preserve main's
|
||||
-- migration order when merging.
|
||||
-- =====================================================
|
||||
|
||||
-- MCP tables are created in 1.13.0 migration.
|
||||
|
||||
-- Knowledge Center: page entity table (Article, QuickLink).
|
||||
-- Existing Collate customers already have this table from 1.2.0-collate with
|
||||
-- subsequent shape changes through 1.6.0-collate (nameHash -> fqnHash VARCHAR(756),
|
||||
-- pageType generated column, composite deleted index). CREATE TABLE IF NOT EXISTS
|
||||
-- is a no-op for them and creates the final shape for fresh OpenMetadata installs.
|
||||
CREATE TABLE IF NOT EXISTS knowledge_center (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
fqnHash VARCHAR(756) NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'name') STORED NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (COALESCE((json ->> 'deleted')::boolean, false)) STORED,
|
||||
pageType VARCHAR(16) GENERATED ALWAYS AS (json ->> 'pageType') STORED NOT NULL,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (fqnHash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS knowledge_center_name_index ON knowledge_center (name);
|
||||
CREATE INDEX IF NOT EXISTS index_knowledge_center_deleted ON knowledge_center (fqnHash, deleted);
|
||||
|
||||
-- Context Center Drive: Folder entity table.
|
||||
CREATE TABLE IF NOT EXISTS drive_folder (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (COALESCE((json ->> 'deleted')::boolean, false)) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (nameHash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_drive_folder_updated_at ON drive_folder (updatedAt);
|
||||
|
||||
-- Context Center Drive: File entity table (uploaded PDF/image/spreadsheet/office docs).
|
||||
CREATE TABLE IF NOT EXISTS context_file (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (COALESCE((json ->> 'deleted')::boolean, false)) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (nameHash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_context_file_updated_at ON context_file (updatedAt);
|
||||
|
||||
-- Attachments: Asset entity table for uploaded file blobs referenced by ContextFiles, Pages, etc.
|
||||
-- Existing Collate customers have this from 1.7.0-collate. CREATE TABLE IF NOT EXISTS is a no-op for them.
|
||||
CREATE TABLE IF NOT EXISTS asset_entity (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'fileName') STORED NOT NULL,
|
||||
url VARCHAR(1024) GENERATED ALWAYS AS (json ->> 'url') STORED NOT NULL,
|
||||
fullyQualifiedName VARCHAR(256) GENERATED ALWAYS AS (json ->> 'fullyQualifiedName') STORED NOT NULL,
|
||||
assetType VARCHAR(100) GENERATED ALWAYS AS (json ->> 'assetType') STORED NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
fqnHash VARCHAR(768) NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (COALESCE(CAST(json ->> 'deleted' AS BOOLEAN), false)) STORED,
|
||||
PRIMARY KEY (id)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS fqnhash_index ON asset_entity (fqnHash);
|
||||
CREATE INDEX IF NOT EXISTS asset_type_index ON asset_entity (assetType);
|
||||
CREATE INDEX IF NOT EXISTS idx_asset_deleted ON asset_entity (deleted);
|
||||
|
||||
-- Context Center Drive: File content snapshot table (revisions, extracted text).
|
||||
CREATE TABLE IF NOT EXISTS context_file_content (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS (COALESCE((json ->> 'deleted')::boolean, false)) STORED,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (nameHash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_context_file_content_updated_at ON context_file_content (updatedAt);
|
||||
|
||||
-- Add tag_usage.metadata column if missing (newer tag usage payloads carry metadata).
|
||||
ALTER TABLE IF EXISTS tag_usage
|
||||
ADD COLUMN IF NOT EXISTS metadata JSONB;
|
||||
|
||||
-- Add audit_log_event.search_text column if missing (searchable audit log text).
|
||||
ALTER TABLE IF EXISTS audit_log_event
|
||||
ADD COLUMN IF NOT EXISTS search_text TEXT;
|
||||
|
||||
-- Distributed reindex job tracking.
|
||||
CREATE TABLE IF NOT EXISTS search_index_job (
|
||||
id VARCHAR(64) PRIMARY KEY,
|
||||
status VARCHAR(64) NOT NULL,
|
||||
jobConfiguration JSONB NOT NULL,
|
||||
targetIndexPrefix VARCHAR(256) NOT NULL,
|
||||
stagedIndexMapping JSONB NULL,
|
||||
totalRecords BIGINT NOT NULL DEFAULT 0,
|
||||
processedRecords BIGINT NOT NULL DEFAULT 0,
|
||||
successRecords BIGINT NOT NULL DEFAULT 0,
|
||||
failedRecords BIGINT NOT NULL DEFAULT 0,
|
||||
stats JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
createdBy VARCHAR(256) NOT NULL,
|
||||
createdAt BIGINT NOT NULL,
|
||||
startedAt BIGINT NULL,
|
||||
completedAt BIGINT NULL,
|
||||
updatedAt BIGINT NOT NULL,
|
||||
errorMessage TEXT NULL,
|
||||
registrationDeadline BIGINT NULL,
|
||||
registeredServerCount INTEGER NULL
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_index_job_status_created_at
|
||||
ON search_index_job (status, createdAt DESC);
|
||||
|
||||
-- Retry queue for failed search-index writes.
|
||||
CREATE TABLE IF NOT EXISTS search_index_retry_queue (
|
||||
entityId VARCHAR(64) NOT NULL,
|
||||
entityFqn VARCHAR(768) NOT NULL,
|
||||
failureReason TEXT NULL,
|
||||
status VARCHAR(64) NOT NULL,
|
||||
entityType VARCHAR(128) NOT NULL,
|
||||
retryCount INTEGER NOT NULL DEFAULT 0,
|
||||
claimedAt TIMESTAMP NULL,
|
||||
PRIMARY KEY (entityId, entityFqn)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_index_retry_queue_status
|
||||
ON search_index_retry_queue (status);
|
||||
CREATE INDEX IF NOT EXISTS idx_search_index_retry_queue_claimed_at
|
||||
ON search_index_retry_queue (claimedAt);
|
||||
|
||||
-- ContextMemory entity - reusable Context Center memory.
|
||||
CREATE TABLE IF NOT EXISTS context_memory (
|
||||
id VARCHAR(36) GENERATED ALWAYS AS (json ->> 'id') STORED NOT NULL,
|
||||
name VARCHAR(256) GENERATED ALWAYS AS (json ->> 'name') STORED NOT NULL,
|
||||
nameHash VARCHAR(256) NOT NULL,
|
||||
json JSONB NOT NULL,
|
||||
updatedAt BIGINT GENERATED ALWAYS AS ((json ->> 'updatedAt')::bigint) STORED NOT NULL,
|
||||
updatedBy VARCHAR(256) GENERATED ALWAYS AS (json ->> 'updatedBy') STORED NOT NULL,
|
||||
deleted BOOLEAN GENERATED ALWAYS AS ((json ->> 'deleted')::boolean) STORED,
|
||||
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (nameHash)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_context_memory_updated_at ON context_memory (updatedAt);
|
||||
|
|
|
|||
|
|
@ -0,0 +1,29 @@
|
|||
-- Post data migration script for Task workflow cutover - OpenMetadata 2.0.1
|
||||
|
||||
-- RdfIndexApp: switch to weekly Saturday cron and full-rebuild every run.
|
||||
-- Previous defaults (daily, incremental) were producing unbounded triple growth
|
||||
-- because relationship-removal paths weren't fully reconciled. With per-run
|
||||
-- CLEAR ALL the dataset always converges to MySQL state; weekly cadence keeps
|
||||
-- per-run cost from saturating Fuseki.
|
||||
--
|
||||
-- Also rewrite `entities` to `["all"]`. Pre-upgrade, an operator could have
|
||||
-- narrowed RDF indexing to a subset of entity types; the new recreateIndex=true
|
||||
-- semantics issues a CLEAR ALL before indexing, which would otherwise wipe
|
||||
-- triples for entity types still in MySQL but missing from the subset list.
|
||||
-- Forcing the subset list back to `["all"]` ensures the post-CLEAR-ALL run
|
||||
-- repopulates the graph fully; operators can re-narrow after the migration if
|
||||
-- they need partial indexing.
|
||||
UPDATE installed_apps
|
||||
SET json = JSON_SET(
|
||||
JSON_SET(
|
||||
json,
|
||||
'$.appConfiguration.recreateIndex', CAST('true' AS JSON),
|
||||
'$.appSchedule.cronExpression', '0 0 * * 6'
|
||||
),
|
||||
'$.appConfiguration.entities', JSON_ARRAY('all')
|
||||
)
|
||||
WHERE name = 'RdfIndexApp';
|
||||
|
||||
UPDATE apps_marketplace
|
||||
SET json = JSON_SET(json, '$.appConfiguration.recreateIndex', CAST('true' AS JSON))
|
||||
WHERE name = 'RdfIndexApp';
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
-- Task workflow cutover support - OpenMetadata 2.0.1
|
||||
-- Maps legacy thread task IDs to new task entity IDs for migration traceability and redirects.
|
||||
|
||||
CREATE TABLE IF NOT EXISTS task_migration_mapping (
|
||||
old_thread_id varchar(36) NOT NULL,
|
||||
new_task_id varchar(36) NOT NULL,
|
||||
migrated_at bigint NOT NULL,
|
||||
source varchar(64) DEFAULT 'thread_task_migration',
|
||||
PRIMARY KEY (old_thread_id),
|
||||
KEY idx_task_migration_mapping_new_task_id (new_task_id)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
|
@ -0,0 +1,30 @@
|
|||
-- Post data migration script for Task workflow cutover - OpenMetadata 2.0.1
|
||||
|
||||
-- RdfIndexApp: switch to weekly Saturday cron and full-rebuild every run.
|
||||
-- Previous defaults (daily, incremental) were producing unbounded triple growth
|
||||
-- because relationship-removal paths weren't fully reconciled. With per-run
|
||||
-- CLEAR ALL the dataset always converges to MySQL state; weekly cadence keeps
|
||||
-- per-run cost from saturating Fuseki.
|
||||
--
|
||||
-- Also rewrite `entities` to `["all"]`. Pre-upgrade, an operator could have
|
||||
-- narrowed RDF indexing to a subset of entity types; the new recreateIndex=true
|
||||
-- semantics issues a CLEAR ALL before indexing, which would otherwise wipe
|
||||
-- triples for entity types still in MySQL but missing from the subset list.
|
||||
-- Forcing the subset list back to `["all"]` ensures the post-CLEAR-ALL run
|
||||
-- repopulates the graph fully; operators can re-narrow after the migration if
|
||||
-- they need partial indexing.
|
||||
UPDATE installed_apps
|
||||
SET json = jsonb_set(
|
||||
jsonb_set(
|
||||
jsonb_set(json::jsonb, '{appConfiguration,recreateIndex}', 'true'),
|
||||
'{appSchedule,cronExpression}',
|
||||
'"0 0 * * 6"'
|
||||
),
|
||||
'{appConfiguration,entities}',
|
||||
'["all"]'::jsonb
|
||||
)
|
||||
WHERE name = 'RdfIndexApp';
|
||||
|
||||
UPDATE apps_marketplace
|
||||
SET json = jsonb_set(json::jsonb, '{appConfiguration,recreateIndex}', 'true')
|
||||
WHERE name = 'RdfIndexApp';
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
-- Task workflow cutover support - OpenMetadata 2.0.1
|
||||
-- Maps legacy thread task IDs to new task entity IDs for migration traceability and redirects.
|
||||
|
||||
CREATE TABLE IF NOT EXISTS task_migration_mapping (
|
||||
old_thread_id character varying(36) NOT NULL,
|
||||
new_task_id character varying(36) NOT NULL,
|
||||
migrated_at bigint NOT NULL,
|
||||
source character varying(64) DEFAULT 'thread_task_migration',
|
||||
PRIMARY KEY (old_thread_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_task_migration_mapping_new_task_id
|
||||
ON task_migration_mapping (new_task_id);
|
||||
|
|
@ -349,10 +349,11 @@ CREATE TABLE `entity_relationship` (
|
|||
`fromEntity` varchar(256) NOT NULL,
|
||||
`toEntity` varchar(256) NOT NULL,
|
||||
`relation` tinyint NOT NULL,
|
||||
`relationType` varchar(64) NOT NULL DEFAULT '',
|
||||
`jsonSchema` varchar(256) DEFAULT NULL,
|
||||
`json` json DEFAULT NULL,
|
||||
`deleted` tinyint(1) NOT NULL DEFAULT '0',
|
||||
PRIMARY KEY (`fromId`,`toId`,`relation`),
|
||||
PRIMARY KEY (`fromId`,`toId`,`relation`,`relationType`),
|
||||
KEY `from_index` (`fromId`,`relation`),
|
||||
KEY `to_index` (`toId`,`relation`)
|
||||
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
|
||||
|
|
|
|||
|
|
@ -323,6 +323,7 @@ CREATE TABLE public.entity_relationship (
|
|||
fromentity character varying(256) NOT NULL,
|
||||
toentity character varying(256) NOT NULL,
|
||||
relation smallint NOT NULL,
|
||||
relationtype character varying(64) DEFAULT ''::character varying NOT NULL,
|
||||
jsonschema character varying(256),
|
||||
json jsonb,
|
||||
deleted boolean DEFAULT false NOT NULL
|
||||
|
|
@ -1326,7 +1327,7 @@ ALTER TABLE ONLY public.entity_extension
|
|||
--
|
||||
|
||||
ALTER TABLE ONLY public.entity_relationship
|
||||
ADD CONSTRAINT entity_relationship_pkey PRIMARY KEY (fromid, toid, relation);
|
||||
ADD CONSTRAINT entity_relationship_pkey PRIMARY KEY (fromid, toid, relation, relationtype);
|
||||
|
||||
|
||||
-- Name: event_subscription_entity event_subscription_entity_namehash_key; Type: CONSTRAINT; Schema: public; Owner: openmetadata_user
|
||||
|
|
|
|||
|
|
@ -46,33 +46,15 @@
|
|||
<version>${org.junit.jupiter.version}</version>
|
||||
<scope>test</scope>
|
||||
</dependency>
|
||||
<!-- used for generating custom annotations in jsonschema2pojo-maven-plugin -->
|
||||
<!-- Build-time only (used by jsonschema2pojo-maven-plugin in openmetadata-spec).
|
||||
<scope>provided</scope> keeps it on compile classpath but excludes it (and all
|
||||
transitives — including Jackson 3.x: GHSA-2m67-wjpj-xhg9, CVE-2026-29062,
|
||||
GHSA-72hv-8253-57qq) from runtime / dist packaging. -->
|
||||
<dependency>
|
||||
<groupId>org.jsonschema2pojo</groupId>
|
||||
<artifactId>jsonschema2pojo-core</artifactId>
|
||||
<version>${jsonschema2pojo.version}</version>
|
||||
<exclusions>
|
||||
<exclusion>
|
||||
<groupId>com.fasterxml.jackson.core</groupId>
|
||||
<artifactId>jackson-databind</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>com.google.code.gson</groupId>
|
||||
<artifactId>gson</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>org.yaml</groupId>
|
||||
<artifactId>snakeyaml</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>org.apache.commons</groupId>
|
||||
<artifactId>commons-lang3</artifactId>
|
||||
</exclusion>
|
||||
<exclusion>
|
||||
<groupId>commons-lang</groupId>
|
||||
<artifactId>commons-lang</artifactId>
|
||||
</exclusion>
|
||||
</exclusions>
|
||||
<scope>provided</scope>
|
||||
</dependency>
|
||||
<dependency>
|
||||
<groupId>org.apache.commons</groupId>
|
||||
|
|
|
|||
|
|
@ -121,8 +121,8 @@ server:
|
|||
# jceProvider: (none)
|
||||
# validateCerts: true
|
||||
# validatePeers: true
|
||||
# supportedProtocols: SSLv3
|
||||
# supportedCipherSuites: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
|
||||
# supportedProtocols: [TLSv1.2, TLSv1.3]
|
||||
# supportedCipherSuites: [TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256]
|
||||
# allowRenegotiation: true
|
||||
# endpointIdentificationAlgorithm: (none)
|
||||
|
||||
|
|
@ -149,8 +149,8 @@ server:
|
|||
# jceProvider: (none)
|
||||
# validateCerts: true
|
||||
# validatePeers: true
|
||||
# supportedProtocols: SSLv3
|
||||
# supportedCipherSuites: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
|
||||
# supportedProtocols: [TLSv1.2, TLSv1.3]
|
||||
# supportedCipherSuites: [TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256]
|
||||
# allowRenegotiation: true
|
||||
# endpointIdentificationAlgorithm: (none)
|
||||
|
||||
|
|
@ -162,6 +162,16 @@ qos:
|
|||
maxSuspendedRequestCount: ${QOS_MAX_SUSPENDED_REQUEST_COUNT:-1000}
|
||||
maxSuspendSeconds: ${QOS_MAX_SUSPEND_SECONDS:-30}
|
||||
|
||||
cacheMemory:
|
||||
# Entity JSON caches (CACHE_WITH_ID, CACHE_WITH_NAME) — weight-based eviction.
|
||||
# Entity JSON can range from 1KB to 2MB+. Increase on high-memory deployments for better hit rates.
|
||||
entityCacheMaxSizeBytes: ${ENTITY_CACHE_MAX_SIZE_BYTES:-104857600} # 100 MB
|
||||
entityCacheTTLSeconds: ${ENTITY_CACHE_TTL_SECONDS:-30}
|
||||
# Auth caches (user context + policies) — TTLs hardcoded (2min policies, 15min user context)
|
||||
authCacheMaxEntries: ${AUTH_CACHE_MAX_ENTRIES:-5000}
|
||||
# RBAC query cache (OpenSearch role-based access control query DSL)
|
||||
rbacCacheMaxEntries: ${RBAC_CACHE_MAX_ENTRIES:-5000}
|
||||
|
||||
# Logging settings.
|
||||
# https://logback.qos.ch/manual/layouts.html#conversionWord
|
||||
# Set LOG_FORMAT=json for structured logs. The default text format preserves legacy output.
|
||||
|
|
@ -184,22 +194,6 @@ logging:
|
|||
archivedFileCount: 7
|
||||
timeZone: UTC
|
||||
maxFileSize: 50MB
|
||||
org.openmetadata.slowrequest:
|
||||
level: ${SLOW_REQUEST_LOG_LEVEL:-OFF}
|
||||
additive: false
|
||||
appenders:
|
||||
- type: file
|
||||
layout:
|
||||
type: om-event-layout
|
||||
format: ${LOG_FORMAT:-text}
|
||||
pattern: "%level [%d{ISO8601,UTC}] [%t] %logger{5} - %msg%n"
|
||||
appendLineSeparator: true
|
||||
threshold: WARN
|
||||
currentLogFilename: ./logs/slow-requests.log
|
||||
archivedLogFilenamePattern: ./logs/slow-requests-%d{yyyy-MM-dd}-%i.log.gz
|
||||
archivedFileCount: 7
|
||||
timeZone: UTC
|
||||
maxFileSize: 50MB
|
||||
org.openmetadata.service.util.OpenMetadataSetup:
|
||||
level: INFO
|
||||
appenders:
|
||||
|
|
@ -500,8 +494,8 @@ elasticsearch:
|
|||
naturalLanguageSearch:
|
||||
enabled: ${NATURAL_LANGUAGE_SEARCH_ENABLED:-false}
|
||||
semanticSearchEnabled: ${SEMANTIC_SEARCH_ENABLED:-false}
|
||||
embeddingProvider: ${EMBEDDING_PROVIDER:-bedrock} # Options: "openai", "bedrock", "djl"
|
||||
maxConcurrentEmbeddingRequests: ${MAX_CONCURRENT_EMBEDDING_REQUESTS:-10}
|
||||
embeddingProvider: ${EMBEDDING_PROVIDER:-bedrock} # Options: "openai", "bedrock", "google", "djl"
|
||||
maxConcurrentRequests: ${MAX_CONCURRENT_EMBEDDING_REQUESTS:-10}
|
||||
providerClass: ${NATURAL_LANGUAGE_SEARCH_PROVIDER_CLASS:-org.openmetadata.service.search.nlq.NoOpNLQService}
|
||||
bedrock:
|
||||
awsConfig:
|
||||
|
|
@ -521,6 +515,11 @@ elasticsearch:
|
|||
apiVersion: ${OPENAI_API_VERSION:-"2024-02-01"} # Azure OpenAI API version
|
||||
embeddingModelId: ${OPENAI_EMBEDDING_MODEL_ID:-"text-embedding-3-small"}
|
||||
embeddingDimension: ${OPENAI_EMBEDDING_DIMENSION:-1536}
|
||||
google:
|
||||
apiKey: ${GOOGLE_API_KEY:-""} # API key from Google AI Studio
|
||||
embeddingModelId: ${GOOGLE_EMBEDDING_MODEL_ID:-"gemini-embedding-001"}
|
||||
embeddingDimension: ${GOOGLE_EMBEDDING_DIMENSION:-768} # Sent as outputDimensionality. gemini-embedding-001 supports 768/1536/3072; text-embedding-004 supports 768.
|
||||
endpoint: ${GOOGLE_API_ENDPOINT:-""} # Optional override; full :embedContent URL. Leave empty to use the default Generative Language API endpoint.
|
||||
djl:
|
||||
embeddingModel: ${DJL_EMBEDDING_MODEL:-"ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2"}
|
||||
|
||||
|
|
@ -698,6 +697,15 @@ web:
|
|||
permission-policy:
|
||||
enabled: ${WEB_CONF_PERMISSION_POLICY_ENABLED:-false}
|
||||
option: ${WEB_CONF_PERMISSION_POLICY_OPTION:-""}
|
||||
cross-origin-embedder-policy:
|
||||
enabled: ${WEB_CONF_CROSS_ORIGIN_EMBEDDER_POLICY_ENABLED:-false}
|
||||
option: ${WEB_CONF_CROSS_ORIGIN_EMBEDDER_POLICY_OPTION:-"REQUIRE_CORP"}
|
||||
cross-origin-resource-policy:
|
||||
enabled: ${WEB_CONF_CROSS_ORIGIN_RESOURCE_POLICY_ENABLED:-false}
|
||||
option: ${WEB_CONF_CROSS_ORIGIN_RESOURCE_POLICY_OPTION:-"SAME_ORIGIN"}
|
||||
cross-origin-opener-policy:
|
||||
enabled: ${WEB_CONF_CROSS_ORIGIN_OPENER_POLICY_ENABLED:-false}
|
||||
option: ${WEB_CONF_CROSS_ORIGIN_OPENER_POLICY_OPTION:-"SAME_ORIGIN"}
|
||||
cache-control: ${WEB_CONF_CACHE_CONTROL:-""}
|
||||
pragma: ${WEB_CONF_PRAGMA:-""}
|
||||
|
||||
|
|
@ -746,6 +754,8 @@ cache:
|
|||
# Connection pool settings
|
||||
poolSize: ${CACHE_REDIS_POOL_SIZE:-64}
|
||||
connectTimeoutMs: ${CACHE_REDIS_CONNECT_TIMEOUT:-2000}
|
||||
# Per-command timeout. Bounds request-thread blocking when Redis is slow.
|
||||
commandTimeoutMs: ${CACHE_REDIS_COMMAND_TIMEOUT:-300}
|
||||
|
||||
# AWS ElastiCache IAM Authentication (only if using ElastiCache)
|
||||
aws:
|
||||
|
|
|
|||
|
|
@ -34,7 +34,16 @@ services:
|
|||
condition: service_healthy
|
||||
|
||||
fuseki:
|
||||
image: stain/jena-fuseki:5.0.0
|
||||
# Build from the in-repo Dockerfile (Fuseki 5.6.0) instead of the
|
||||
# unmaintained `stain/jena-fuseki` Docker Hub image, which capped at 5.1.0
|
||||
# and never picked up the 2025 admin-side Fuseki CVE fixes (CVE-2025-49656,
|
||||
# CVE-2025-50151 — both fixed in Jena 5.5.0). The `image:` tag below names
|
||||
# the locally-built image so subsequent `docker compose up` runs reuse the
|
||||
# cached build instead of rebuilding from scratch each time.
|
||||
build:
|
||||
context: ../rdf-store
|
||||
dockerfile: Dockerfile
|
||||
image: openmetadata-fuseki:5.6.0
|
||||
container_name: openmetadata-fuseki
|
||||
hostname: fuseki
|
||||
ports:
|
||||
|
|
@ -42,11 +51,24 @@ services:
|
|||
networks:
|
||||
- local_app_net
|
||||
environment:
|
||||
- ADMIN_PASSWORD=admin
|
||||
# Default for local dev only — production deployments MUST override
|
||||
# via the FUSEKI_ADMIN_PASSWORD env var (and FUSEKI_OPENMETADATA_PASSWORD)
|
||||
# before bringing this stack up. The entrypoint envsubsts these into
|
||||
# shiro.ini at container start so the override actually takes effect.
|
||||
- FUSEKI_ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_OPENMETADATA_PASSWORD=${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}
|
||||
- JVM_ARGS=${FUSEKI_JVM_ARGS:--Xmx1500m -Xms256m}
|
||||
- FUSEKI_BASE=/fuseki
|
||||
volumes:
|
||||
- fuseki-data:/fuseki
|
||||
# New volume name (was `fuseki-data` mounted at `/fuseki`). The in-repo
|
||||
# Dockerfile stores TDB2 at `/fuseki-data` and the data layout differs
|
||||
# from the old stain/jena-fuseki image — re-using the previous volume
|
||||
# name would mount stale state at a path Fuseki no longer reads from,
|
||||
# silently looking like an empty database. Using a fresh volume name
|
||||
# forces operators to consciously migrate (or accept a re-index). The
|
||||
# orphaned `fuseki-data` volume can be removed manually with
|
||||
# `docker volume rm fuseki-data` after confirming the new stack is
|
||||
# healthy.
|
||||
- fuseki-tdb2-data:/fuseki-data
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
|
|
@ -60,8 +82,6 @@ services:
|
|||
timeout: 10s
|
||||
retries: 20
|
||||
start_period: 60s
|
||||
# Create the database directory before starting Fuseki
|
||||
entrypoint: /bin/sh -c "mkdir -p /fuseki/databases/openmetadata && exec /docker-entrypoint.sh /jena-fuseki/fuseki-server --update --loc=/fuseki/databases/openmetadata /openmetadata"
|
||||
|
||||
networks:
|
||||
local_app_net:
|
||||
|
|
@ -72,5 +92,5 @@ networks:
|
|||
- subnet: "172.16.239.0/24"
|
||||
|
||||
volumes:
|
||||
fuseki-data:
|
||||
fuseki-tdb2-data:
|
||||
driver: local
|
||||
|
|
|
|||
|
|
@ -15,7 +15,9 @@ volumes:
|
|||
ingestion-volume-dags:
|
||||
ingestion-volume-tmp:
|
||||
es-data:
|
||||
fuseki-data:
|
||||
# See docker-compose-fuseki.yml — renamed from `fuseki-data` to avoid
|
||||
# silently mounting stale state under the new Fuseki layout.
|
||||
fuseki-tdb2-data:
|
||||
services:
|
||||
postgresql:
|
||||
build:
|
||||
|
|
@ -565,17 +567,29 @@ services:
|
|||
- /var/run/docker.sock:/var/run/docker.sock:z # Need 600 permissions to run DockerOperator
|
||||
|
||||
fuseki:
|
||||
image: stain/jena-fuseki:5.0.0
|
||||
# See docker-compose-fuseki.yml for the rationale behind building from the
|
||||
# in-repo Dockerfile instead of using `stain/jena-fuseki:*` (unmaintained,
|
||||
# capped at 5.1.0, missing 2025 admin-side CVE fixes).
|
||||
build:
|
||||
context: ../rdf-store
|
||||
dockerfile: Dockerfile
|
||||
image: openmetadata-fuseki:5.6.0
|
||||
container_name: openmetadata-fuseki
|
||||
hostname: fuseki
|
||||
ports:
|
||||
- "3030:3030"
|
||||
environment:
|
||||
- ADMIN_PASSWORD=admin
|
||||
# Local-dev default — production deployments MUST override via
|
||||
# FUSEKI_ADMIN_PASSWORD / FUSEKI_OPENMETADATA_PASSWORD env vars.
|
||||
- FUSEKI_ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_OPENMETADATA_PASSWORD=${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}
|
||||
- JVM_ARGS=-Xmx4g -Xms2g
|
||||
- FUSEKI_BASE=/fuseki
|
||||
volumes:
|
||||
- fuseki-data:/fuseki
|
||||
# See docker-compose-fuseki.yml for why the volume was renamed from
|
||||
# `fuseki-data` to `fuseki-tdb2-data` (the data layout differs from the
|
||||
# previous stain/jena-fuseki image and reusing the old name silently
|
||||
# mounts stale state at a path Fuseki no longer reads).
|
||||
- fuseki-tdb2-data:/fuseki-data
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
|
|
@ -584,8 +598,6 @@ services:
|
|||
memory: 2G
|
||||
networks:
|
||||
- local_app_net
|
||||
# Create the database directory before starting Fuseki
|
||||
entrypoint: /bin/sh -c "mkdir -p /fuseki/databases/openmetadata && exec /docker-entrypoint.sh /jena-fuseki/fuseki-server --update --loc=/fuseki/databases/openmetadata /openmetadata"
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
8
docker/development/docker-compose.cache-off.yml
Normal file
8
docker/development/docker-compose.cache-off.yml
Normal file
|
|
@ -0,0 +1,8 @@
|
|||
# Override that disables the cache while leaving the rest of the stack intact.
|
||||
# Used in the local A/B benchmark to flip cache off without tearing down volumes.
|
||||
# Apply on TOP of base compose (NOT the redis overlay):
|
||||
# docker compose -f docker-compose.yml -f docker-compose.cache-off.yml up -d --no-deps openmetadata-server
|
||||
services:
|
||||
openmetadata-server:
|
||||
environment:
|
||||
CACHE_PROVIDER: none
|
||||
86
docker/development/docker-compose.multiserver.yml
Normal file
86
docker/development/docker-compose.multiserver.yml
Normal file
|
|
@ -0,0 +1,86 @@
|
|||
# Copyright 2021 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
||||
# Adds a second OM instance that shares MySQL/Elasticsearch/Redis with the
|
||||
# primary one in docker-compose.yml. Used to validate that pub/sub
|
||||
# invalidation keeps per-instance Guava caches coherent.
|
||||
#
|
||||
# Usage:
|
||||
# docker compose -f docker-compose.yml -f docker-compose.redis.yml \
|
||||
# -f docker-compose.multiserver.yml up -d
|
||||
services:
|
||||
openmetadata-server-2:
|
||||
image: development-openmetadata-server
|
||||
build:
|
||||
context: ../../.
|
||||
dockerfile: docker/development/Dockerfile
|
||||
container_name: openmetadata_server_2
|
||||
restart: always
|
||||
networks:
|
||||
- local_app_net
|
||||
depends_on:
|
||||
mysql:
|
||||
condition: service_healthy
|
||||
elasticsearch:
|
||||
condition: service_healthy
|
||||
redis:
|
||||
condition: service_healthy
|
||||
ports:
|
||||
- "8587:8585"
|
||||
- "8588:8586"
|
||||
environment:
|
||||
OPENMETADATA_CLUSTER_NAME: openmetadata
|
||||
SERVER_PORT: 8585
|
||||
SERVER_ADMIN_PORT: 8586
|
||||
LOG_LEVEL: INFO
|
||||
FERNET_KEY: jJ/9sz0g0OHxsfxOoSfdFdmk3ysNmPRnH3TUAbz3IHA=
|
||||
DB_DRIVER_CLASS: com.mysql.cj.jdbc.Driver
|
||||
DB_SCHEME: mysql
|
||||
DB_USE_SSL: "false"
|
||||
DB_USER: openmetadata_user
|
||||
DB_USER_PASSWORD: openmetadata_password
|
||||
DB_HOST: mysql
|
||||
DB_PORT: 3306
|
||||
DB_PARAMS: allowPublicKeyRetrieval=true&useSSL=false&serverTimezone=UTC
|
||||
OM_DATABASE: openmetadata_db
|
||||
ELASTICSEARCH_HOST: elasticsearch
|
||||
ELASTICSEARCH_PORT: 9200
|
||||
ELASTICSEARCH_SCHEME: http
|
||||
SEARCH_TYPE: elasticsearch
|
||||
ELASTICSEARCH_CLUSTER_ALIAS: openmetadata
|
||||
AUTHENTICATION_PROVIDER: basic
|
||||
AUTHENTICATION_ENABLE_SELF_SIGNUP: "true"
|
||||
AUTHORIZER_CLASS_NAME: org.openmetadata.service.security.DefaultAuthorizer
|
||||
AUTHORIZER_REQUEST_FILTER: org.openmetadata.service.security.JwtFilter
|
||||
AUTHORIZER_ADMIN_PRINCIPALS: "[admin]"
|
||||
AUTHORIZER_PRINCIPAL_DOMAIN: open-metadata.org
|
||||
AUTHORIZER_ALLOWED_DOMAINS: "[]"
|
||||
AUTHORIZER_ALLOWED_REGISTRATION_DOMAIN: '["all"]'
|
||||
AUTHORIZER_INGESTION_PRINCIPALS: "[ingestion-bot]"
|
||||
AUTHENTICATION_RESPONSE_TYPE: id_token
|
||||
AUTHENTICATION_CLIENT_TYPE: public
|
||||
AUTHENTICATION_PUBLIC_KEYS: "[http://openmetadata-server-2:8585/api/v1/system/config/jwks]"
|
||||
AUTHENTICATION_AUTHORITY: https://accounts.google.com
|
||||
AUTHENTICATION_JWT_PRINCIPAL_CLAIMS: "[email,preferred_username,sub]"
|
||||
RSA_PUBLIC_KEY_FILE_PATH: ./conf/public_key.der
|
||||
RSA_PRIVATE_KEY_FILE_PATH: ./conf/private_key.der
|
||||
JWT_ISSUER: open-metadata.org
|
||||
JWT_KEY_ID: Gb389a-9f76-gdjs-a92j-0242bk94356
|
||||
PIPELINE_SERVICE_CLIENT_ENDPOINT: http://ingestion:8080
|
||||
PIPELINE_SERVICE_CLIENT_CLASS_NAME: org.openmetadata.service.clients.pipeline.airflow.AirflowRESTClient
|
||||
AIRFLOW_USERNAME: admin
|
||||
AIRFLOW_PASSWORD: admin
|
||||
AIRFLOW_TIMEOUT: 10
|
||||
SECRET_MANAGER: db
|
||||
SERVER_HOST_API_URL: http://openmetadata-server-2:8585/api
|
||||
EVENT_MONITOR: prometheus
|
||||
OPENMETADATA_HEAP_OPTS: "-Xmx1G -Xms1G"
|
||||
CACHE_PROVIDER: redis
|
||||
CACHE_REDIS_URL: redis://redis:6379
|
||||
CACHE_REDIS_AUTH_TYPE: NONE
|
||||
CACHE_REDIS_KEYSPACE: om:dev
|
||||
CACHE_ENTITY_TTL: 3600
|
||||
CACHE_RELATIONSHIP_TTL: 3600
|
||||
CACHE_TAG_TTL: 3600
|
||||
CACHE_REDIS_COMMAND_TIMEOUT: 300
|
||||
36
docker/development/docker-compose.redis.yml
Normal file
36
docker/development/docker-compose.redis.yml
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
# Copyright 2021 Collate
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
|
||||
# Override that adds a Redis cache to the development stack.
|
||||
# Usage:
|
||||
# docker compose -f docker-compose.yml -f docker-compose.redis.yml up -d
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
container_name: openmetadata_redis
|
||||
restart: always
|
||||
command: ["redis-server", "--appendonly", "no", "--save", "", "--maxmemory", "512mb", "--maxmemory-policy", "allkeys-lru"]
|
||||
networks:
|
||||
- local_app_net
|
||||
ports:
|
||||
- "6379:6379"
|
||||
healthcheck:
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
interval: 10s
|
||||
timeout: 3s
|
||||
retries: 5
|
||||
|
||||
openmetadata-server:
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
environment:
|
||||
CACHE_PROVIDER: redis
|
||||
CACHE_REDIS_URL: redis://redis:6379
|
||||
CACHE_REDIS_AUTH_TYPE: NONE
|
||||
CACHE_REDIS_KEYSPACE: om:dev
|
||||
CACHE_ENTITY_TTL: 3600
|
||||
CACHE_RELATIONSHIP_TTL: 3600
|
||||
CACHE_TAG_TTL: 3600
|
||||
CACHE_REDIS_COMMAND_TIMEOUT: 300
|
||||
|
|
@ -143,7 +143,7 @@ SMTP_SERVER_STRATEGY="SMTP_TLS"
|
|||
OM_RESOURCE_PACKAGES="[]"
|
||||
OM_EXTENSIONS="[]"
|
||||
# Heap OPTS Configurations
|
||||
OPENMETADATA_HEAP_OPTS="-Xmx1G -Xms1G"
|
||||
OPENMETADATA_HEAP_OPTS="-Xmx2G -Xms256M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError"
|
||||
# Application Config
|
||||
CUSTOM_LOGO_URL_PATH=""
|
||||
CUSTOM_MONOGRAM_URL_PATH=""
|
||||
|
|
|
|||
|
|
@ -143,7 +143,7 @@ SMTP_SERVER_STRATEGY="SMTP_TLS"
|
|||
OM_RESOURCE_PACKAGES="[]"
|
||||
OM_EXTENSIONS="[]"
|
||||
# Heap OPTS Configurations
|
||||
OPENMETADATA_HEAP_OPTS="-Xmx1G -Xms1G"
|
||||
OPENMETADATA_HEAP_OPTS="-Xmx2G -Xms512M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError"
|
||||
# Application Config
|
||||
CUSTOM_LOGO_URL_PATH=""
|
||||
CUSTOM_MONOGRAM_URL_PATH=""
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ RUN apt-get update && \
|
|||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set Fuseki version and paths
|
||||
ENV FUSEKI_VERSION=4.10.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
ENV FUSEKI_BASE=/fuseki
|
||||
|
||||
|
|
|
|||
|
|
@ -14,7 +14,7 @@ RUN apt-get update || true && \
|
|||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set Fuseki version
|
||||
ENV FUSEKI_VERSION=5.0.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
|
||||
# Download and install Fuseki
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
# Multi-architecture Fuseki build
|
||||
FROM --platform=$TARGETPLATFORM openjdk:17-jdk-slim
|
||||
# eclipse-temurin replaces the deprecated openjdk Docker Hub images.
|
||||
FROM --platform=$TARGETPLATFORM eclipse-temurin:17-jre-jammy
|
||||
|
||||
# Install required packages
|
||||
RUN apt-get update && \
|
||||
|
|
@ -9,7 +10,7 @@ RUN apt-get update && \
|
|||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set Fuseki version and paths
|
||||
ENV FUSEKI_VERSION=4.10.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
ENV FUSEKI_BASE=/fuseki
|
||||
|
||||
|
|
|
|||
|
|
@ -3,15 +3,15 @@ FROM --platform=$BUILDPLATFORM alpine:latest AS downloader
|
|||
|
||||
RUN apk add --no-cache wget tar
|
||||
|
||||
ENV FUSEKI_VERSION=5.0.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
WORKDIR /tmp
|
||||
|
||||
RUN wget -q https://archive.apache.org/dist/jena/binaries/apache-jena-fuseki-${FUSEKI_VERSION}.tar.gz && \
|
||||
tar -xzf apache-jena-fuseki-${FUSEKI_VERSION}.tar.gz && \
|
||||
mv apache-jena-fuseki-${FUSEKI_VERSION} /fuseki-dist
|
||||
|
||||
# Use OpenJDK base image for runtime
|
||||
FROM openjdk:17-slim
|
||||
# Runtime: eclipse-temurin replaces the deprecated openjdk Docker Hub images.
|
||||
FROM eclipse-temurin:17-jre-jammy
|
||||
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
ENV FUSEKI_BASE=/fuseki
|
||||
|
|
|
|||
|
|
@ -1,8 +1,9 @@
|
|||
# Simple Fuseki build that works on ARM64
|
||||
FROM openjdk:17-slim
|
||||
# eclipse-temurin replaces the deprecated openjdk Docker Hub images.
|
||||
FROM eclipse-temurin:17-jre-jammy
|
||||
|
||||
# Set Fuseki version and paths
|
||||
ENV FUSEKI_VERSION=4.10.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
ENV FUSEKI_BASE=/fuseki
|
||||
|
||||
|
|
|
|||
|
|
@ -3,24 +3,32 @@ services:
|
|||
fuseki:
|
||||
# Force AMD64 platform with Rosetta 2 emulation
|
||||
platform: linux/amd64
|
||||
image: stain/jena-fuseki:5.0.0
|
||||
# Build from the in-repo Dockerfile (Fuseki 5.6.0). See
|
||||
# docker-compose-fuseki.yml for the full rationale.
|
||||
build:
|
||||
context: ../rdf-store
|
||||
dockerfile: Dockerfile
|
||||
image: openmetadata-fuseki:5.6.0
|
||||
container_name: fuseki-standalone
|
||||
hostname: fuseki
|
||||
ports:
|
||||
- "3030:3030"
|
||||
environment:
|
||||
# Admin credentials
|
||||
- ADMIN_PASSWORD=admin
|
||||
# JVM memory settings
|
||||
# Local-dev default — production deployments MUST override via
|
||||
# FUSEKI_ADMIN_PASSWORD / FUSEKI_OPENMETADATA_PASSWORD env vars.
|
||||
- FUSEKI_ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_OPENMETADATA_PASSWORD=${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}
|
||||
- JVM_ARGS=-Xmx8g -Xms4g
|
||||
- FUSEKI_BASE=/fuseki
|
||||
volumes:
|
||||
# Mount directory for persistent storage
|
||||
- ${DOCKER_VOLUMES_PATH:-./docker-volumes}/fuseki:/fuseki
|
||||
# Host bind path renamed from `.../fuseki` (used by the old stain image
|
||||
# layout) to `.../fuseki-tdb2-data` so an existing host directory with
|
||||
# the previous layout isn't silently mounted at /fuseki-data — Fuseki
|
||||
# would see an empty TDB2 store and the old data would appear lost.
|
||||
# Operators upgrading can either delete the new dir to start fresh or
|
||||
# migrate old data manually.
|
||||
- ${DOCKER_VOLUMES_PATH:-./docker-volumes}/fuseki-tdb2-data:/fuseki-data
|
||||
networks:
|
||||
- fuseki-net
|
||||
# Create openmetadata dataset on startup
|
||||
entrypoint: /bin/sh -c "mkdir -p /fuseki/databases/openmetadata && exec /docker-entrypoint.sh /jena-fuseki/fuseki-server --update --loc=/fuseki/databases/openmetadata /openmetadata"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:3030/$/ping"]
|
||||
interval: 15s
|
||||
|
|
|
|||
|
|
@ -1,25 +1,29 @@
|
|||
# Standalone Apache Jena Fuseki for RDF/Knowledge Graph storage
|
||||
services:
|
||||
fuseki:
|
||||
image: stain/jena-fuseki:5.0.0
|
||||
# Build from the in-repo Dockerfile (Fuseki 5.6.0). See
|
||||
# ../development/docker-compose-fuseki.yml for the full rationale.
|
||||
build:
|
||||
context: ../rdf-store
|
||||
dockerfile: Dockerfile
|
||||
image: openmetadata-fuseki:5.6.0
|
||||
container_name: fuseki-standalone
|
||||
hostname: fuseki
|
||||
ports:
|
||||
- "3030:3030"
|
||||
environment:
|
||||
# Admin credentials
|
||||
- ADMIN_PASSWORD=admin
|
||||
# JVM memory settings - adjust based on your system
|
||||
# Local-dev default — production deployments MUST override via
|
||||
# FUSEKI_ADMIN_PASSWORD / FUSEKI_OPENMETADATA_PASSWORD env vars.
|
||||
- FUSEKI_ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_OPENMETADATA_PASSWORD=${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}
|
||||
- JVM_ARGS=-Xmx4g -Xms2g
|
||||
# Fuseki configuration
|
||||
- FUSEKI_BASE=/fuseki
|
||||
volumes:
|
||||
# Mount directory for persistent storage (configurable via .env)
|
||||
- ${DOCKER_VOLUMES_PATH:-./docker-volumes}/fuseki:/fuseki
|
||||
# See docker-compose-fuseki-rosetta.yml — host bind path renamed so
|
||||
# existing directories with the old stain layout aren't silently
|
||||
# mounted at the new /fuseki-data path.
|
||||
- ${DOCKER_VOLUMES_PATH:-./docker-volumes}/fuseki-tdb2-data:/fuseki-data
|
||||
networks:
|
||||
- fuseki-net
|
||||
# Create openmetadata dataset on startup
|
||||
entrypoint: /bin/sh -c "mkdir -p /fuseki/databases/openmetadata && exec /docker-entrypoint.sh /jena-fuseki/fuseki-server --update --loc=/fuseki/databases/openmetadata /openmetadata"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:3030/$/ping"]
|
||||
interval: 15s
|
||||
|
|
|
|||
|
|
@ -18,7 +18,10 @@ volumes:
|
|||
ingestion-volume-dags:
|
||||
ingestion-volume-tmp:
|
||||
es-data:
|
||||
fuseki-data:
|
||||
# See ../development/docker-compose-fuseki.yml — renamed from `fuseki-data`
|
||||
# because the new Dockerfile uses a different on-disk layout and reusing
|
||||
# the old volume name silently mounts stale state.
|
||||
fuseki-tdb2-data:
|
||||
|
||||
services:
|
||||
mysql:
|
||||
|
|
@ -70,20 +73,27 @@ services:
|
|||
reservations:
|
||||
memory: 2G
|
||||
|
||||
# Apache Jena Fuseki for RDF/Knowledge Graph storage
|
||||
# Apache Jena Fuseki for RDF/Knowledge Graph storage. Built from the
|
||||
# in-repo Dockerfile (Fuseki 5.6.0) — see ../development/docker-compose-fuseki.yml
|
||||
# for why we don't use the unmaintained stain/jena-fuseki image.
|
||||
fuseki:
|
||||
container_name: openmetadata_fuseki
|
||||
image: stain/jena-fuseki:4.10.0
|
||||
build:
|
||||
context: ../rdf-store
|
||||
dockerfile: Dockerfile
|
||||
image: openmetadata-fuseki:5.6.0
|
||||
restart: always
|
||||
environment:
|
||||
- ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_DATASET_1=openmetadata
|
||||
# Local-dev defaults — production deployments MUST override via the
|
||||
# FUSEKI_ADMIN_PASSWORD / FUSEKI_OPENMETADATA_PASSWORD env vars before
|
||||
# bringing this stack up. The entrypoint envsubsts these into shiro.ini.
|
||||
- FUSEKI_ADMIN_PASSWORD=${FUSEKI_ADMIN_PASSWORD:-admin}
|
||||
- FUSEKI_OPENMETADATA_PASSWORD=${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}
|
||||
- JVM_ARGS=-Xmx4g -Xms2g
|
||||
- FUSEKI_BASE=/fuseki
|
||||
ports:
|
||||
- "3030:3030"
|
||||
volumes:
|
||||
- fuseki-data:/fuseki
|
||||
- fuseki-tdb2-data:/fuseki-data
|
||||
networks:
|
||||
- app_net
|
||||
healthcheck:
|
||||
|
|
|
|||
|
|
@ -12,13 +12,16 @@ volumes:
|
|||
o: bind
|
||||
device: ./docker-volume/elasticsearch-data
|
||||
|
||||
# Increase Fuseki data volume
|
||||
fuseki-data:
|
||||
# Fuseki data volume. Renamed from `fuseki-data` (and host path changed
|
||||
# from `./docker-volume/fuseki-data` to `./docker-volume/fuseki-tdb2-data`)
|
||||
# to match docker-compose-rdf.yml — see ../development/docker-compose-fuseki.yml
|
||||
# for the migration rationale (new on-disk layout vs the old stain image).
|
||||
fuseki-tdb2-data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: none
|
||||
o: bind
|
||||
device: ./docker-volume/fuseki-data
|
||||
device: ./docker-volume/fuseki-tdb2-data
|
||||
|
||||
services:
|
||||
# Additional Elasticsearch optimizations
|
||||
|
|
|
|||
|
|
@ -37,7 +37,8 @@ chmod -R 777 "$VOLUMES_PATH"
|
|||
|
||||
# Start Fuseki
|
||||
echo "Starting Apache Jena Fuseki..."
|
||||
# Use Rosetta 2 emulation on ARM64 for stain/jena-fuseki:5.0.0
|
||||
# Use Rosetta 2 emulation on ARM64 because the local Fuseki Dockerfile
|
||||
# bases on a Linux x86_64 image; the Rosetta variant pins platform=linux/amd64.
|
||||
if [[ $(uname -m) == "arm64" ]] || [[ $(uname -m) == "aarch64" ]]; then
|
||||
echo "Detected ARM64 architecture, using Rosetta 2 emulation..."
|
||||
docker compose -f docker-compose-fuseki-rosetta.yml up -d
|
||||
|
|
|
|||
22
docker/local-sso/keycloak-saml/README.md
Normal file
22
docker/local-sso/keycloak-saml/README.md
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
# Keycloak SAML Fixture
|
||||
|
||||
Local SAML IdP fixture for the Playwright SSO login spec.
|
||||
|
||||
```bash
|
||||
docker compose -f docker/local-sso/keycloak-saml/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
It imports one realm for an OpenMetadata server running at `http://localhost:8585`:
|
||||
|
||||
- `om-azure-saml`
|
||||
- User: `azure.saml@openmetadata.local`
|
||||
- Password: `OpenMetadata@123`
|
||||
|
||||
Use the matching Playwright provider type:
|
||||
|
||||
```bash
|
||||
SSO_PROVIDER_TYPE=keycloak-azure-saml \
|
||||
SSO_USERNAME=azure.saml@openmetadata.local \
|
||||
SSO_PASSWORD=OpenMetadata@123 \
|
||||
npx playwright test playwright/e2e/Auth/SSOLogin.spec.ts --project=sso-auth --workers=1
|
||||
```
|
||||
23
docker/local-sso/keycloak-saml/docker-compose.yml
Normal file
23
docker/local-sso/keycloak-saml/docker-compose.yml
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
name: openmetadata-keycloak-saml
|
||||
|
||||
services:
|
||||
keycloak:
|
||||
image: ${KEYCLOAK_IMAGE:-quay.io/keycloak/keycloak:26.3.3}
|
||||
container_name: openmetadata-keycloak-saml
|
||||
command: ['start-dev', '--import-realm']
|
||||
environment:
|
||||
KC_BOOTSTRAP_ADMIN_USERNAME: ${KEYCLOAK_BOOTSTRAP_ADMIN_USERNAME:-admin}
|
||||
KC_BOOTSTRAP_ADMIN_PASSWORD: ${KEYCLOAK_BOOTSTRAP_ADMIN_PASSWORD:-admin123}
|
||||
KC_HEALTH_ENABLED: 'true'
|
||||
KC_HTTP_ENABLED: 'true'
|
||||
KC_HTTP_PORT: '8080'
|
||||
ports:
|
||||
- '${KEYCLOAK_SAML_PORT:-8080}:8080'
|
||||
volumes:
|
||||
- ./realms:/opt/keycloak/data/import:ro
|
||||
healthcheck:
|
||||
test: ['CMD-SHELL', 'exec 3<>/dev/tcp/localhost/8080']
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 18
|
||||
start_period: 20s
|
||||
109
docker/local-sso/keycloak-saml/realms/om-azure-saml-realm.json
Normal file
109
docker/local-sso/keycloak-saml/realms/om-azure-saml-realm.json
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
{
|
||||
"realm": "om-azure-saml",
|
||||
"enabled": true,
|
||||
"displayName": "OpenMetadata Azure SAML",
|
||||
"sslRequired": "none",
|
||||
"registrationAllowed": false,
|
||||
"loginWithEmailAllowed": true,
|
||||
"duplicateEmailsAllowed": false,
|
||||
"resetPasswordAllowed": false,
|
||||
"editUsernameAllowed": false,
|
||||
"clients": [
|
||||
{
|
||||
"clientId": "http://localhost:8585/api/v1/saml/metadata",
|
||||
"name": "OpenMetadata",
|
||||
"enabled": true,
|
||||
"protocol": "saml",
|
||||
"publicClient": true,
|
||||
"frontchannelLogout": true,
|
||||
"redirectUris": ["http://localhost:8585/*"],
|
||||
"baseUrl": "http://localhost:8585",
|
||||
"adminUrl": "http://localhost:8585",
|
||||
"attributes": {
|
||||
"saml.assertion.signature": "true",
|
||||
"saml.authnstatement": "true",
|
||||
"saml.client.signature": "false",
|
||||
"saml.encrypt": "false",
|
||||
"saml.force.name.id.format": "true",
|
||||
"saml.force.post.binding": "true",
|
||||
"saml.multivalued.roles": "false",
|
||||
"saml.server.signature": "true",
|
||||
"saml.signature.algorithm": "RSA_SHA256",
|
||||
"saml_assertion_consumer_url_post": "http://localhost:8585/api/v1/saml/acs",
|
||||
"saml_force_name_id_format": "true",
|
||||
"saml_name_id_format": "email"
|
||||
},
|
||||
"protocolMappers": [
|
||||
{
|
||||
"name": "Email",
|
||||
"protocol": "saml",
|
||||
"protocolMapper": "saml-user-property-mapper",
|
||||
"consentRequired": false,
|
||||
"config": {
|
||||
"attribute.name": "email",
|
||||
"attribute.nameformat": "Basic",
|
||||
"friendly.name": "email",
|
||||
"user.attribute": "email"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Display Name",
|
||||
"protocol": "saml",
|
||||
"protocolMapper": "saml-user-attribute-mapper",
|
||||
"consentRequired": false,
|
||||
"config": {
|
||||
"attribute.name": "http://schemas.microsoft.com/identity/claims/displayname",
|
||||
"attribute.nameformat": "Basic",
|
||||
"friendly.name": "displayname",
|
||||
"user.attribute": "displayName"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Given Name",
|
||||
"protocol": "saml",
|
||||
"protocolMapper": "saml-user-property-mapper",
|
||||
"consentRequired": false,
|
||||
"config": {
|
||||
"attribute.name": "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/givenname",
|
||||
"attribute.nameformat": "Basic",
|
||||
"friendly.name": "givenname",
|
||||
"user.attribute": "firstName"
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "Surname",
|
||||
"protocol": "saml",
|
||||
"protocolMapper": "saml-user-property-mapper",
|
||||
"consentRequired": false,
|
||||
"config": {
|
||||
"attribute.name": "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/surname",
|
||||
"attribute.nameformat": "Basic",
|
||||
"friendly.name": "surname",
|
||||
"user.attribute": "lastName"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"users": [
|
||||
{
|
||||
"username": "azure.saml@openmetadata.local",
|
||||
"email": "azure.saml@openmetadata.local",
|
||||
"firstName": "Azure",
|
||||
"lastName": "SAML",
|
||||
"enabled": true,
|
||||
"emailVerified": true,
|
||||
"requiredActions": [],
|
||||
"attributes": {
|
||||
"displayName": ["Azure SAML User"]
|
||||
},
|
||||
"credentials": [
|
||||
{
|
||||
"type": "password",
|
||||
"value": "OpenMetadata@123",
|
||||
"temporary": false
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -1,11 +1,27 @@
|
|||
# Apache Jena Fuseki Docker Image for OpenMetadata RDF Store
|
||||
FROM openjdk:17-jdk-slim
|
||||
# eclipse-temurin replaces the deprecated `openjdk` Docker Hub images
|
||||
# (`openjdk:17-jdk-slim` was removed from the registry — CI builds against it
|
||||
# fail with "manifest unknown"). JRE is enough since this image only runs the
|
||||
# Fuseki shell launcher; no compilation happens inside the container.
|
||||
FROM eclipse-temurin:17-jre-jammy
|
||||
|
||||
ENV FUSEKI_VERSION=4.10.0
|
||||
ENV FUSEKI_VERSION=5.6.0
|
||||
ENV FUSEKI_HOME=/fuseki
|
||||
# FUSEKI_BASE must point at the directory containing shiro.ini so Fuseki picks
|
||||
# up our auth config on boot. Without this, Fuseki falls back to its built-in
|
||||
# default base (typically the working directory) and the bundled shiro.ini is
|
||||
# never loaded — leaving the admin endpoints (incl. /$/compact and
|
||||
# /$/datasets) reachable without authentication. The Dockerfile copies
|
||||
# config.ttl + shiro.ini into /fuseki below, so we point FUSEKI_BASE there.
|
||||
ENV FUSEKI_BASE=/fuseki
|
||||
|
||||
# gettext-base provides `envsubst`, used by the entrypoint to inject
|
||||
# FUSEKI_ADMIN_PASSWORD / FUSEKI_OPENMETADATA_PASSWORD into shiro.ini at
|
||||
# container start. Without this, operators could not override the default
|
||||
# Fuseki credentials via environment variables.
|
||||
RUN apt-get update && apt-get install -y \
|
||||
wget \
|
||||
gettext-base \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Download and install Fuseki
|
||||
|
|
@ -19,9 +35,14 @@ WORKDIR ${FUSEKI_HOME}
|
|||
# Create data directory
|
||||
RUN mkdir -p /fuseki-data
|
||||
|
||||
# Copy custom configuration
|
||||
# Custom configuration. shiro.ini ships as a TEMPLATE because Apache Shiro's
|
||||
# INI realm does not interpolate ${VAR} placeholders natively — we have to
|
||||
# render it at container start with the actual passwords. The entrypoint
|
||||
# does that via envsubst.
|
||||
COPY config.ttl /fuseki/config.ttl
|
||||
COPY shiro.ini /fuseki/shiro.ini
|
||||
COPY shiro.ini.template /fuseki/shiro.ini.template
|
||||
COPY entrypoint.sh /entrypoint.sh
|
||||
RUN chmod +x /entrypoint.sh
|
||||
|
||||
# Expose Fuseki port
|
||||
EXPOSE 3030
|
||||
|
|
@ -33,5 +54,6 @@ VOLUME ["/fuseki-data"]
|
|||
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
|
||||
CMD wget -q --spider http://localhost:3030/$/ping || exit 1
|
||||
|
||||
# Run Fuseki with OpenMetadata dataset
|
||||
# Run Fuseki via the entrypoint (which renders shiro.ini then execs fuseki-server)
|
||||
ENTRYPOINT ["/entrypoint.sh"]
|
||||
CMD ["./fuseki-server", "--loc=/fuseki-data", "--update", "/openmetadata"]
|
||||
45
docker/rdf-store/entrypoint.sh
Normal file
45
docker/rdf-store/entrypoint.sh
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
#!/bin/sh
|
||||
#
|
||||
# Render shiro.ini from its template, substituting FUSEKI_ADMIN_PASSWORD and
|
||||
# FUSEKI_OPENMETADATA_PASSWORD. Apache Shiro's INI realm does not interpolate
|
||||
# ${VAR} placeholders natively, so we have to expand them before Fuseki reads
|
||||
# the file — otherwise Shiro stores the literal string `${FUSEKI_...}` as the
|
||||
# password and every basic-auth attempt returns 401.
|
||||
#
|
||||
# Defaults: admin / admin and openmetadata / openmetadata-secret. Operators
|
||||
# who want different credentials set the env vars in their compose / k8s
|
||||
# deployment manifest — that override now actually takes effect.
|
||||
#
|
||||
# Operators who need to fully replace shiro.ini (different role layout,
|
||||
# custom realms, …) have two options:
|
||||
#
|
||||
# 1. Bind-mount your file onto /fuseki/shiro.ini AND set
|
||||
# FUSEKI_RENDER_SHIRO=false — the entrypoint then skips the
|
||||
# envsubst render and leaves the mounted file in place.
|
||||
#
|
||||
# 2. Bind-mount onto /fuseki/shiro.ini.template instead of /fuseki/shiro.ini
|
||||
# and the entrypoint will envsubst your template (handy if you want
|
||||
# env-driven password injection in your custom realm too).
|
||||
#
|
||||
# Defaulting FUSEKI_RENDER_SHIRO=true preserves the prior, password-injection
|
||||
# behavior for every dev/quickstart compose deployment that doesn't override
|
||||
# it.
|
||||
|
||||
set -eu
|
||||
|
||||
: "${FUSEKI_ADMIN_PASSWORD:=admin}"
|
||||
: "${FUSEKI_OPENMETADATA_PASSWORD:=openmetadata-secret}"
|
||||
: "${FUSEKI_RENDER_SHIRO:=true}"
|
||||
export FUSEKI_ADMIN_PASSWORD FUSEKI_OPENMETADATA_PASSWORD
|
||||
|
||||
if [ "$FUSEKI_RENDER_SHIRO" = "true" ] && [ -f /fuseki/shiro.ini.template ]; then
|
||||
# Restrict envsubst to the two variables we expect. Without an explicit
|
||||
# list, envsubst would interpret any `${...}` in the template — including
|
||||
# comments — which would silently blank out unrelated placeholders if
|
||||
# they were ever added.
|
||||
envsubst '${FUSEKI_ADMIN_PASSWORD} ${FUSEKI_OPENMETADATA_PASSWORD}' \
|
||||
</fuseki/shiro.ini.template \
|
||||
>/fuseki/shiro.ini
|
||||
fi
|
||||
|
||||
exec "$@"
|
||||
|
|
@ -1,29 +0,0 @@
|
|||
# Apache Shiro configuration for Fuseki security
|
||||
# This integrates with OpenMetadata's authentication
|
||||
|
||||
[main]
|
||||
# Allow anonymous read access, require auth for writes
|
||||
anon = org.apache.shiro.web.filter.authc.AnonymousFilter
|
||||
authcBasic = org.apache.shiro.web.filter.authc.BasicHttpAuthenticationFilter
|
||||
|
||||
# Use environment variables for credentials
|
||||
[users]
|
||||
# Default admin user - should be overridden in production
|
||||
admin = ${FUSEKI_ADMIN_PASSWORD:-admin}, admin
|
||||
|
||||
# OpenMetadata service account for updates
|
||||
openmetadata = ${FUSEKI_OPENMETADATA_PASSWORD:-openmetadata-secret}, writer
|
||||
|
||||
[roles]
|
||||
admin = *
|
||||
writer = update:*, upload:*, data:*
|
||||
|
||||
[urls]
|
||||
/$/ping = anon
|
||||
/$/stats/* = anon
|
||||
/openmetadata/sparql = anon
|
||||
/openmetadata/query = anon
|
||||
/openmetadata/update = authcBasic, roles[writer]
|
||||
/openmetadata/upload = authcBasic, roles[writer]
|
||||
/openmetadata/data = authcBasic, roles[writer]
|
||||
/** = authcBasic, roles[admin]
|
||||
62
docker/rdf-store/shiro.ini.template
Normal file
62
docker/rdf-store/shiro.ini.template
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Apache Shiro configuration for Fuseki security — TEMPLATE FILE.
|
||||
#
|
||||
# Do NOT mount or copy this file as-is into /fuseki/shiro.ini. It contains
|
||||
# `${FUSEKI_ADMIN_PASSWORD}` and `${FUSEKI_OPENMETADATA_PASSWORD}` placeholders
|
||||
# that Apache Shiro's INI realm does NOT interpolate — if Fuseki loads the
|
||||
# raw template, the literal string `${FUSEKI_ADMIN_PASSWORD}` becomes the
|
||||
# admin password, which silently lets `${FUSEKI_ADMIN_PASSWORD}` log in.
|
||||
#
|
||||
# The entrypoint.sh in this image envsubsts the file into /fuseki/shiro.ini
|
||||
# at container start. If you need a different layout, render the substituted
|
||||
# file yourself and bind-mount it onto /fuseki/shiro.ini WITH
|
||||
# FUSEKI_RENDER_SHIRO=false set on the container.
|
||||
#
|
||||
# This integrates with OpenMetadata's authentication
|
||||
|
||||
[main]
|
||||
# Allow anonymous read access, require auth for writes
|
||||
anon = org.apache.shiro.web.filter.authc.AnonymousFilter
|
||||
authcBasic = org.apache.shiro.web.filter.authc.BasicHttpAuthenticationFilter
|
||||
|
||||
# Fuseki 5.x uses Shiro without a session manager configured by default; if
|
||||
# the INI doesn't override the SubjectDAO it ends up trying to use the
|
||||
# default session storage and throws `IllegalStateException: No SessionManager`
|
||||
# on the first authenticated request. Disable session storage so every
|
||||
# request re-authenticates via Basic auth (stateless, REST-style).
|
||||
sessionStorageEvaluator = org.apache.shiro.web.mgt.DefaultWebSessionStorageEvaluator
|
||||
sessionStorageEvaluator.sessionStorageEnabled = false
|
||||
securityManager.subjectDAO.sessionStorageEvaluator = $sessionStorageEvaluator
|
||||
|
||||
# Credentials.
|
||||
#
|
||||
# This file is a TEMPLATE. The entrypoint envsubsts the FUSEKI_ADMIN_PASSWORD
|
||||
# and FUSEKI_OPENMETADATA_PASSWORD variables into the [users] section below
|
||||
# at container start. Defaults applied in the entrypoint:
|
||||
# admin user password = admin
|
||||
# openmetadata user password = openmetadata-secret
|
||||
# (Variable names intentionally NOT written with `$` here so the rendered
|
||||
# file in /fuseki/shiro.ini does not leak the substituted password back
|
||||
# into this comment block.)
|
||||
#
|
||||
# The admin user carries BOTH the `admin` (server-management) and `writer`
|
||||
# (data-mutation) roles so a single credential covers /$/datasets, /$/compact,
|
||||
# /openmetadata/data and /openmetadata/update — Shiro evaluates roles[] as
|
||||
# membership, not permission, so admin's `*` permission alone does NOT satisfy
|
||||
# roles[writer]; explicit membership in both is required.
|
||||
[users]
|
||||
admin = ${FUSEKI_ADMIN_PASSWORD}, admin, writer
|
||||
openmetadata = ${FUSEKI_OPENMETADATA_PASSWORD}, writer
|
||||
|
||||
[roles]
|
||||
admin = *
|
||||
writer = update:*, upload:*, data:*
|
||||
|
||||
[urls]
|
||||
/$/ping = anon
|
||||
/$/stats/* = anon
|
||||
/openmetadata/sparql = anon
|
||||
/openmetadata/query = anon
|
||||
/openmetadata/update = authcBasic, roles[writer]
|
||||
/openmetadata/upload = authcBasic, roles[writer]
|
||||
/openmetadata/data = authcBasic, roles[writer]
|
||||
/** = authcBasic, roles[admin]
|
||||
1390
docs/auto-classification/add-support-for-another-entity.md
Normal file
1390
docs/auto-classification/add-support-for-another-entity.md
Normal file
File diff suppressed because it is too large
Load diff
259
docs/streamable-logs.md
Normal file
259
docs/streamable-logs.md
Normal file
|
|
@ -0,0 +1,259 @@
|
|||
# Streamable Ingestion Logs
|
||||
|
||||
This document describes the end-to-end design of OpenMetadata's streamable ingestion-pipeline log system: how logs flow from a running connector to durable S3 storage, how the UI reads them while a run is in progress, and how the system handles long idle gaps, restarts, and abandoned runs.
|
||||
|
||||
## Overview
|
||||
|
||||
Ingestion pipelines (metadata, profiler, lineage, usage, dbt, etc.) emit logs as they run. Operators need to:
|
||||
|
||||
- Watch logs **live** while a pipeline is running, including for long-running connectors that can take hours.
|
||||
- Read logs **after the run ends**, with a single canonical artifact per run.
|
||||
- Recover gracefully from server restarts, network blips, and connector idle gaps.
|
||||
|
||||
OpenMetadata addresses this with a server-side log storage abstraction backed by S3 (or any S3-compatible store like MinIO). The connector pushes log batches over HTTP; the server persists them and serves both live and post-run reads.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────┐
|
||||
│ Python ingestion │ POST /logs/{fqn}/{runId} (append)
|
||||
│ connector │ POST /logs/{fqn}/{runId}/close (finalize)
|
||||
│ (logs_mixin.py) │
|
||||
└──────────┬───────────┘
|
||||
│ HTTP
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ OpenMetadata server │
|
||||
│ IngestionPipeline │
|
||||
│ Resource │
|
||||
└──────────┬───────────┘
|
||||
│ LogStorageInterface
|
||||
▼
|
||||
┌──────────────────────┐ ┌──────────────────────┐
|
||||
│ S3LogStorage │────────▶│ S3 / MinIO bucket │
|
||||
│ (streaming, in-mem │ │ partial.txt │
|
||||
│ buffers, sweeper) │ │ logs.txt │
|
||||
└──────────┬───────────┘ └──────────────────────┘
|
||||
│ SSE / GET (paginated / download)
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ OpenMetadata UI │
|
||||
│ (live tail + history)│
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
The `LogStorageInterface` abstraction supports multiple backends:
|
||||
|
||||
| Backend | Purpose |
|
||||
|---------|---------|
|
||||
| `S3LogStorage` | Production: stores logs durably in S3 / MinIO. The focus of this document. |
|
||||
| `DefaultLogStorage` | Backward-compat: delegates to the pipeline service client (Airflow / Argo). No first-class storage. |
|
||||
|
||||
This document covers the `S3LogStorage` implementation.
|
||||
|
||||
## Storage Layout
|
||||
|
||||
Each pipeline run is identified by a `(fqn, runId)` tuple. On S3 the layout is:
|
||||
|
||||
```
|
||||
{bucket}/{prefix}/ # prefix defaults to "pipeline-logs"
|
||||
{sanitizedFQN}/{runId}/
|
||||
partial.txt # readable view during the run
|
||||
logs.txt # final artifact, materialized at /close
|
||||
.active/{sanitizedFQN}/{runId}/{serverId} # heartbeat marker
|
||||
```
|
||||
|
||||
**`partial.txt`** is the durable, readable view of an in-progress run. It is updated periodically as the connector appends batches. It carries durable offset state in S3 user-defined metadata:
|
||||
|
||||
| Metadata key | Purpose |
|
||||
|--------------|---------|
|
||||
| `x-amz-meta-last-flushed-line` | Logical line counter at the moment of this PUT. Drives retry idempotency and post-restart recovery. |
|
||||
| `x-amz-meta-total-bytes` | Cross-check on body size; helps detect drift. |
|
||||
| `x-amz-meta-writer-epoch` | Bumped each time a fresh OM-server instance picks up the stream after a restart. |
|
||||
| `x-amz-meta-writer-version` | Identifies the writer code version. Useful during migration windows. |
|
||||
|
||||
**`logs.txt`** is the canonical post-run artifact. It is created **only** at `/close` (or by the abandoned-run sweeper), as a server-side S3 copy of the final `partial.txt`. Content matches `partial.txt` exactly at the moment of close.
|
||||
|
||||
**`.active/...`** markers are dropped as a side effect of `appendLogs`. They have no functional role in correctness; they are operational hints for diagnostics ("which OM-server instance most recently saw this run").
|
||||
|
||||
A bucket lifecycle policy ensures cleanup:
|
||||
- `expirationDays` (default 30) on the `pipeline-logs/` prefix expires all logs after the retention window.
|
||||
|
||||
## Run Lifecycle
|
||||
|
||||
### 1. Connector emits a batch
|
||||
|
||||
The Python ingestion runner buffers log lines and POSTs batches to the server:
|
||||
|
||||
```
|
||||
POST /api/v1/services/ingestionPipelines/logs/{fqn}/{runId}
|
||||
Content-Type: application/json
|
||||
|
||||
"<raw log content>" OR
|
||||
|
||||
{
|
||||
"logs": "<base64-gzipped log content>",
|
||||
"connectorId": "...",
|
||||
"compressed": true
|
||||
}
|
||||
```
|
||||
|
||||
`IngestionPipelineResource.writePipelineLogs` decodes the body and calls `repository.appendLogs(fqn, runId, content)`, which delegates to `S3LogStorage.appendLogs`.
|
||||
|
||||
### 2. Server-side append
|
||||
|
||||
`S3LogStorage.appendLogs` does five things, all in memory, all under a per-stream `ReentrantLock`:
|
||||
|
||||
1. **Increments `totalLinesAppended`**, the monotonic logical line counter that anchors retry idempotency.
|
||||
2. **Appends to `SimpleLogBuffer`** (in-memory ring, capacity 1000 lines). This is the source for the SSE/WebSocket live-tail UI experience. It is bounded; oldest lines evict on overflow. It is **not** load-bearing for durability.
|
||||
3. **Appends to `pendingFlush`** (in-memory queue, no fixed cap, byte-tracked). This is the durable-pending-write queue and survives until the next successful PUT.
|
||||
4. **Notifies SSE listeners**, fanning out the new lines to any open live-tail HTTP connections.
|
||||
5. **Schedules an early flush** if `pendingFlush` exceeds `earlyFlushWatermarkBytes` (default 5 MB). This protects against memory bloat under bursty writes.
|
||||
|
||||
A single-threaded `cleanupExecutor` schedules the periodic flush, the abandoned-run sweeper, and metrics updates.
|
||||
|
||||
### 3. Periodic flush to `partial.txt`
|
||||
|
||||
Every `partialFlushIntervalMinutes` (default 2) and on demand from the early-flush watermark, `writePartialLogsForStream` runs under the per-stream lock:
|
||||
|
||||
1. Snapshot `pendingFlush` and clear it.
|
||||
2. If empty, no-op (idle streams cost nothing).
|
||||
3. `GetObject partial.txt` → reads `Content-Length` and metadata from the response headers. On 404, treat as empty.
|
||||
4. Build new metadata (`last-flushed-line`, `total-bytes`, `writer-epoch`, `writer-version`).
|
||||
5. **If existing body < 5 MB** — read the body, build merged body = existing + `\n`-joined snapshot, `PutObject` atomically.
|
||||
6. **If existing body ≥ 5 MB** — abort the body stream and concatenate server-side via Multipart Upload: `CreateMultipartUpload`, `UploadPartCopy` (existing body as part 1), `UploadPart` (new content as part 2, the last part has no 5 MB minimum), `CompleteMultipartUpload`. The merged body never enters JVM heap and is not re-uploaded.
|
||||
7. On failure, abort any in-flight multipart upload, re-merge the snapshot to the head of `pendingFlush`, and try again next tick. No data loss.
|
||||
|
||||
Because `pendingFlush` is unbounded by the `SimpleLogBuffer` cap, no line is ever evicted before being flushed.
|
||||
|
||||
### 4. Live read while running
|
||||
|
||||
The UI's "live logs" view does two things in parallel:
|
||||
|
||||
- **HTTP GET** `/logs/{fqn}/{runId}?after={cursor}` for paginated history. The server reads `partial.txt` from S3 and concatenates the in-memory `pendingFlush` snapshot for the most-recent-tail bytes that haven't yet been flushed. The cursor is a line offset.
|
||||
- **Server-Sent Events (SSE)** for live tail. The endpoint registers a `LogStreamListener` against the stream key and pushes new lines as `notifyListeners` fires from each `appendLogs`.
|
||||
|
||||
This gives the user "everything written so far" via GET and "everything written in real time from now on" via SSE.
|
||||
|
||||
### 5. `/close` finalization
|
||||
|
||||
When the connector terminates (success, graceful failure, or graceful abort), it calls:
|
||||
|
||||
```
|
||||
POST /api/v1/services/ingestionPipelines/logs/{fqn}/{runId}/close
|
||||
```
|
||||
|
||||
`S3LogStorage.closeStream` runs under the per-stream lock:
|
||||
|
||||
1. **Final flush**: drain remaining `pendingFlush` to `partial.txt` (same path as the periodic flush).
|
||||
2. **Server-side copy** `partial.txt` → `logs.txt`. Bytes do not transit through OM. Cheap and constant-time regardless of log size.
|
||||
3. **Delete `partial.txt`**.
|
||||
4. **Best-effort delete** the `.active/{fqn}/{runId}/{serverId}` marker.
|
||||
5. Drop in-memory state for the stream (`activeStreams`, `pendingFlush`, `totalLinesAppended`, `recentLogsCache`, the per-stream lock).
|
||||
|
||||
`/close` is idempotent. A second call finds no `partial.txt` and no in-memory state; it is a graceful no-op. A `/close` that arrives after the abandoned-run sweeper already finalized the stream behaves the same way.
|
||||
|
||||
### 6. Post-`/close` reads
|
||||
|
||||
Once `/close` completes, `logs.txt` is the canonical artifact. `getLogs(fqn, runId)` reads it directly. Pagination is by line offset; the response includes `after` (next cursor) and `total` (total bytes / lines).
|
||||
|
||||
There is also a download endpoint that streams the full file (or composes from segments / partial in legacy fallbacks).
|
||||
|
||||
## Read Paths
|
||||
|
||||
| Endpoint | Pre-`/close` | Post-`/close` |
|
||||
|----------|-------------|---------------|
|
||||
| `GET /logs/{fqn}/{runId}` | Reads `partial.txt` + appends `pendingFlush` snapshot. Apply cursor pagination. | Reads `logs.txt`. |
|
||||
| `GET /logs/{fqn}/{runId}/download` | Streams `partial.txt`. | Streams `logs.txt`. |
|
||||
| `GET /logs/{fqn}/stream/{runId}` (SSE) | Registers a listener; replays last 100 buffered lines, then live-streams new lines. | (Not used post-close; the run is over.) |
|
||||
|
||||
Legacy `partial.txt` files written by older code (without S3 metadata) read normally; the new flush logic treats them as "no prior offset" and merges any new content correctly.
|
||||
|
||||
## Abandoned-Run Recovery
|
||||
|
||||
Connectors can die without calling `/close` — process killed, OOM, network partition, infrastructure failure. To bound resource use and still produce a final `logs.txt`, a sweeper runs periodically:
|
||||
|
||||
- **Schedule**: every `cleanupIntervalMinutes` (default 60).
|
||||
- **Threshold**: `streamTimeoutMinutes` since last `appendLogs` (default 1440 = 24h).
|
||||
|
||||
For each expired stream, the sweeper does the same finalization steps as `/close` (final flush, copy to `logs.txt`, delete `partial.txt`, drop in-memory state). The end result is identical: an abandoned run produces a finalized `logs.txt` artifact that the UI can read, just delayed.
|
||||
|
||||
The 24h default is intentionally lenient: typical idle gaps in slow connectors (waiting on source queries, batch boundaries, queues) are minutes-to-hours, not days. Operators can tune the threshold downward in deployments where memory pressure from many parallel runs requires more aggressive reclamation.
|
||||
|
||||
## Failure Modes & Recovery
|
||||
|
||||
| Failure | Recovery |
|
||||
|---------|----------|
|
||||
| S3 PUT fails during periodic flush | `pendingFlush` snapshot is restored under the lock. Next tick retries. No data loss. |
|
||||
| OM-server restart mid-run | All in-memory state lost. `partial.txt` on S3 retains all previously-flushed content. The next `appendLogs` re-creates state; the first flush after restart reads `partial.txt` (with metadata) and resumes from `last-flushed-line`. Worst-case loss: lines that were in `pendingFlush` at restart time, bounded above by `partialFlushIntervalMinutes`. |
|
||||
| Connector dies without `/close` | Abandoned-run sweeper finalizes the run after `streamTimeoutHours`. `logs.txt` is materialized from the most recent `partial.txt`. |
|
||||
| `/close` retries after partial success | All steps are idempotent. Second call finds no `partial.txt` and no in-memory state; no-op. |
|
||||
| Concurrent `appendLogs` and cleanup | The per-stream lock serializes them. Cleanup finds the stream "fresh" again and skips it next tick. |
|
||||
| Bucket lifecycle expires `partial.txt` mid-run | Should not happen at default `expirationDays = 30`. If misconfigured (very low retention), the next flush would treat it as a fresh `partial.txt` and start over. Recommended floor: 7 days. |
|
||||
|
||||
## Configuration
|
||||
|
||||
All settings live under `LogStorageConfiguration` in `openmetadata.yaml`:
|
||||
|
||||
| Field | Default | Description |
|
||||
|-------|---------|-------------|
|
||||
| `bucketName` | (required) | S3 bucket for log storage. |
|
||||
| `prefix` | `pipeline-logs` | Key prefix within the bucket. |
|
||||
| `enableServerSideEncryption` | `true` | Apply SSE on every PUT. |
|
||||
| `sseAlgorithm` | `AES_256` | Or `AWS_KMS` (requires `kmsKeyId`). |
|
||||
| `storageClass` | `STANDARD_IA` | S3 storage class for log objects. |
|
||||
| `expirationDays` | 30 | Bucket lifecycle: expire all logs after this many days. |
|
||||
| `streamTimeoutMinutes` | 1440 | Idle threshold (in minutes) before the abandoned-run sweeper finalizes a stream. |
|
||||
| `cleanupIntervalMinutes` | 60 | How often the sweeper wakes up to check for abandoned streams. |
|
||||
| `partialFlushIntervalMinutes` | 2 | Periodic `pendingFlush` → `partial.txt` cadence. |
|
||||
| `earlyFlushWatermarkBytes` | 5242880 (5 MB) | Triggers an out-of-band flush when `pendingFlush` exceeds this size. |
|
||||
| `pendingFlushAlertAfterFailures` | 10 | Emit an alerting metric after this many consecutive failed flushes for a stream. |
|
||||
| `maxConcurrentStreams` | 100 | Bound on in-flight pipeline runs per OM-server instance. |
|
||||
| `awsConfig.*` | — | AWS credentials / region / endpoint (also supports IAM role + custom endpoints for MinIO). |
|
||||
|
||||
## Concurrency Model
|
||||
|
||||
Coordination is a per-stream lock keyed by `streamKey = fqn + "/" + runId`. The lock is held for the duration of `appendLogs`, periodic flush, abandoned-run cleanup, and `/close`. Locks are backed by a Guava `Striped<Lock>` with a fixed stripe count, so memory does not grow with completed-run accumulation; the same key always maps to the same lock instance, eliminating the acquire-vs-remove race that a per-key map would have. False contention across stripes is bounded by `maxConcurrentStreams << stripe count`.
|
||||
|
||||
A single-threaded `ScheduledExecutorService` (`cleanupExecutor`) drives:
|
||||
- Periodic flushes (`writePartialLogs`)
|
||||
- Abandoned-run sweeper (`cleanupAbandonedStreams`)
|
||||
- Metrics updates (`updateStreamMetrics`)
|
||||
- One-shot early flushes scheduled by the watermark trigger
|
||||
|
||||
Under sustained burst load, scheduled tasks queue on this single thread. This is intentional: it bounds resource use and avoids unbounded thread creation under spikes. If a deployment regularly sees queue backlog, the watermark or flush interval can be tuned.
|
||||
|
||||
## Observability
|
||||
|
||||
Key metrics exposed by `StreamableLogsMetrics`:
|
||||
|
||||
- `om_streamable_logs_log_shipment_*` — distribution of append latencies.
|
||||
- `om_streamable_logs_logs_sent` / `logs_failed` — counter of successful and failed appends.
|
||||
- `om_streamable_logs_batch_size` — distribution of lines per batch.
|
||||
- `om_streamable_logs_s3_*` — distribution of S3 read/write latencies and counters of S3 errors.
|
||||
- `om_streamable_logs_pending_part_uploads` — gauge for monitoring queue backlog (legacy, will be retired with multipart removal).
|
||||
- `om_streamable_logs_multipart_uploads` — gauge for active multipart uploads (legacy, will be retired).
|
||||
- `om_streamable_logs_pending_flush_bytes` — gauge for in-memory `pendingFlush` size per stream (new).
|
||||
- `om_streamable_logs_consecutive_flush_failures` — gauge per stream (new).
|
||||
|
||||
Recommended alerts:
|
||||
- `pending_flush_bytes` > 50 MB sustained → memory pressure or persistent S3 failures.
|
||||
- `consecutive_flush_failures` ≥ 10 → S3 connectivity or auth issue.
|
||||
- `s3_errors` rate > 1/min → S3 health degradation.
|
||||
|
||||
## Multi-Server Topology
|
||||
|
||||
The design assumes single-writer-per-run: an ALB / load balancer enforces sticky sessions for `(fqn, runId)` via the `PIPELINE_SESSION` cookie set on the first `appendLogs` response. All subsequent requests for the same run land on the same OM-server instance for the lifetime of the run.
|
||||
|
||||
If stickiness is broken (cookie stripped by a proxy, multi-cluster routing without coordination), two OM-server instances could write to the same `partial.txt` and clobber each other. This is **out of scope** for the current design. A future iteration could move offset state to the database for cross-server coordination.
|
||||
|
||||
## References
|
||||
|
||||
- Source files:
|
||||
- `openmetadata-service/src/main/java/org/openmetadata/service/logstorage/S3LogStorage.java`
|
||||
- `openmetadata-service/src/main/java/org/openmetadata/service/logstorage/LogStorageFactory.java`
|
||||
- `openmetadata-spec/src/main/java/org/openmetadata/service/logstorage/LogStorageInterface.java`
|
||||
- `openmetadata-service/src/main/java/org/openmetadata/service/resources/services/ingestionpipelines/IngestionPipelineResource.java`
|
||||
- `ingestion/src/metadata/utils/streamable_logger.py`
|
||||
- `ingestion/src/metadata/ingestion/ometa/mixins/logs_mixin.py`
|
||||
- Related PRs: #23590, #24198, #24287, #24410
|
||||
152934
ingestion/.basedpyright/baseline.json
Normal file
152934
ingestion/.basedpyright/baseline.json
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -1,6 +1,6 @@
|
|||
FROM mysql:8.3 AS mysql
|
||||
|
||||
FROM apache/airflow:3.1.7-python3.10
|
||||
FROM apache/airflow:3.2.1-python3.10
|
||||
USER root
|
||||
RUN curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg \
|
||||
&& echo "deb [arch=amd64,arm64,armhf signed-by=/usr/share/keyrings/microsoft-prod.gpg] https://packages.microsoft.com/debian/12/prod bookworm main" > /etc/apt/sources.list.d/mssql-release.list
|
||||
|
|
@ -43,6 +43,9 @@ RUN apt-get -qq update \
|
|||
wget --no-install-recommends \
|
||||
# Accept MSSQL ODBC License
|
||||
&& ACCEPT_EULA=Y apt-get install -y msodbcsql18 \
|
||||
&& apt-get -qq purge -y \
|
||||
'imagemagick*' 'libmagick*' 'graphicsmagick*' \
|
||||
&& apt-get -qq autoremove -y --purge \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
COPY --from=mysql /usr/bin/mysqldump /usr/bin/mysqldump
|
||||
|
||||
|
|
@ -58,11 +61,16 @@ RUN if [ $(uname -m) = "arm64" ] | [ $(uname -m) = "aarch64" ]; \
|
|||
ENV LD_LIBRARY_PATH=/instantclient
|
||||
|
||||
# Install DB2 iAccess Driver
|
||||
RUN if [ $(uname -m) = "x86_64" ]; \
|
||||
then \
|
||||
curl https://public.dhe.ibm.com/software/ibmi/products/odbc/debs/dists/1.1.0/ibmi-acs-1.1.0.list | tee /etc/apt/sources.list.d/ibmi-acs-1.1.0.list \
|
||||
&& apt update \
|
||||
&& apt install ibm-iaccess; \
|
||||
# Mirrored on cdn.getcollate.io to decouple builds from IBM's CDN availability.
|
||||
# Use dpkg --force-depends because the .deb declares old Debian package names
|
||||
# (libodbc1, odbcinst1debian2) that don't exist in Debian 12; the actual
|
||||
# libraries (unixodbc, odbcinst) are installed earlier. SHA256 pinned to v29.
|
||||
RUN if [ $(uname -m) = "x86_64" ]; then \
|
||||
wget -q https://cdn.getcollate.io/deps/ingestion/ibm/ibm-iaccess-1.1.0.29-1.0.amd64.deb -O /tmp/ibm-iaccess.deb \
|
||||
&& echo "e60e968d2cee96b2851964456f5b31ab990b1aa47d8f2399607809f7d4514f58 /tmp/ibm-iaccess.deb" | sha256sum -c - \
|
||||
&& dpkg -i --force-depends /tmp/ibm-iaccess.deb \
|
||||
&& apt-get install -f -y --no-install-recommends \
|
||||
&& rm -f /tmp/ibm-iaccess.deb; \
|
||||
fi
|
||||
|
||||
# Required for Starting Ingestion Container in Docker Compose
|
||||
|
|
@ -86,7 +94,7 @@ ARG RI_VERSION="1.12.0.0.dev0"
|
|||
RUN pip install --upgrade pip "setuptools<81"
|
||||
# Pre-install cx-Oracle without build isolation to use the pinned setuptools
|
||||
RUN pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
|
||||
RUN pip install "openmetadata-managed-apis~=${RI_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.1.7/constraints-3.10.txt"
|
||||
RUN pip install "openmetadata-managed-apis~=${RI_VERSION}" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.2.1/constraints-3.10.txt"
|
||||
RUN pip install "openmetadata-ingestion[${INGESTION_DEPENDENCY}]~=${RI_VERSION}"
|
||||
|
||||
# Temporary workaround for https://github.com/open-metadata/OpenMetadata/issues/9593
|
||||
|
|
@ -94,6 +102,12 @@ RUN [ $(uname -m) = "x86_64" ] \
|
|||
&& pip install "openmetadata-ingestion[db2]~=${RI_VERSION}" \
|
||||
|| echo "DB2 not supported on ARM architectures."
|
||||
|
||||
# Ship py-spy so a hung worker can be sampled in place
|
||||
# (`py-spy dump --pid <pid>`) without first installing anything in the pod.
|
||||
# Container-only — kept out of setup.py to avoid forcing a native binary on
|
||||
# dev laptops / CI / non-container installs.
|
||||
RUN pip install "py-spy>=0.3.14"
|
||||
|
||||
# bump python-daemon for https://github.com/apache/airflow/pull/29916
|
||||
RUN pip install "python-daemon>=3.0.0"
|
||||
# remove all airflow providers except for docker, cncf kubernetes, and standard (required in Airflow 3.x)
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
FROM mysql:8.3 AS mysql
|
||||
|
||||
FROM apache/airflow:3.1.7-python3.10
|
||||
FROM apache/airflow:3.2.1-python3.10
|
||||
USER root
|
||||
RUN curl -fsSL https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor -o /usr/share/keyrings/microsoft-prod.gpg \
|
||||
&& echo "deb [arch=amd64,arm64,armhf signed-by=/usr/share/keyrings/microsoft-prod.gpg] https://packages.microsoft.com/debian/12/prod bookworm main" > /etc/apt/sources.list.d/mssql-release.list
|
||||
|
|
@ -43,6 +43,9 @@ RUN dpkg --configure -a \
|
|||
wget --no-install-recommends \
|
||||
# Accept MSSQL ODBC License
|
||||
&& ACCEPT_EULA=Y apt-get -qq install -y msodbcsql18 \
|
||||
&& apt-get -qq purge -y \
|
||||
'imagemagick*' 'libmagick*' 'graphicsmagick*' \
|
||||
&& apt-get -qq autoremove -y --purge \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
COPY --from=mysql /usr/bin/mysqldump /usr/bin/mysqldump
|
||||
|
||||
|
|
@ -58,11 +61,16 @@ RUN if [ $(uname -m) = "arm64" ] | [ $(uname -m) = "aarch64" ]; \
|
|||
ENV LD_LIBRARY_PATH=/instantclient
|
||||
|
||||
# Install DB2 iAccess Driver
|
||||
RUN if [ $(uname -m) = "x86_64" ]; \
|
||||
then \
|
||||
curl https://public.dhe.ibm.com/software/ibmi/products/odbc/debs/dists/1.1.0/ibmi-acs-1.1.0.list | tee /etc/apt/sources.list.d/ibmi-acs-1.1.0.list \
|
||||
&& apt update \
|
||||
&& apt install ibm-iaccess; \
|
||||
# Mirrored on cdn.getcollate.io to decouple builds from IBM's CDN availability.
|
||||
# Use dpkg --force-depends because the .deb declares old Debian package names
|
||||
# (libodbc1, odbcinst1debian2) that don't exist in Debian 12; the actual
|
||||
# libraries (unixodbc, odbcinst) are installed earlier. SHA256 pinned to v29.
|
||||
RUN if [ $(uname -m) = "x86_64" ]; then \
|
||||
wget -q https://cdn.getcollate.io/deps/ingestion/ibm/ibm-iaccess-1.1.0.29-1.0.amd64.deb -O /tmp/ibm-iaccess.deb \
|
||||
&& echo "e60e968d2cee96b2851964456f5b31ab990b1aa47d8f2399607809f7d4514f58 /tmp/ibm-iaccess.deb" | sha256sum -c - \
|
||||
&& dpkg -i --force-depends /tmp/ibm-iaccess.deb \
|
||||
&& apt-get install -f -y --no-install-recommends \
|
||||
&& rm -f /tmp/ibm-iaccess.deb; \
|
||||
fi
|
||||
|
||||
# Required for Starting Ingestion Container in Docker Compose
|
||||
|
|
@ -77,7 +85,7 @@ COPY --chown=airflow:0 openmetadata-airflow-apis /home/airflow/openmetadata-airf
|
|||
# Required for Airflow DAGs of Sample Data
|
||||
COPY --chown=airflow:0 ingestion/examples/airflow/dags /opt/airflow/dags
|
||||
COPY --chown=airflow:0 ingestion/examples/airflow/test_dags /opt/airflow/dags
|
||||
COPY --chown=airflow:0 ingestion/airflow-constraints-3.1.7.txt /home/airflow/airflow-constraints-3.1.7.txt
|
||||
COPY --chown=airflow:0 ingestion/airflow-constraints-3.2.1.txt /home/airflow/airflow-constraints-3.2.1.txt
|
||||
|
||||
USER airflow
|
||||
|
||||
|
|
@ -95,17 +103,17 @@ RUN pip install --upgrade pip "setuptools<81"
|
|||
RUN pip install --no-build-isolation "cx_Oracle>=8.3.0,<9"
|
||||
|
||||
# Install FAB provider for Airflow 3.x Flask Blueprint compatibility
|
||||
RUN pip install "apache-airflow-providers-fab>=1.0.0" --constraint "/home/airflow/airflow-constraints-3.1.7.txt" || true
|
||||
RUN pip install "apache-airflow-providers-fab>=1.0.0" --constraint "/home/airflow/airflow-constraints-3.2.1.txt" || true
|
||||
|
||||
|
||||
WORKDIR /home/airflow/openmetadata-airflow-apis
|
||||
RUN pip install "." --constraint "/home/airflow/airflow-constraints-3.1.7.txt"
|
||||
RUN pip install "." --constraint "/home/airflow/airflow-constraints-3.2.1.txt"
|
||||
|
||||
WORKDIR /home/airflow/ingestion
|
||||
|
||||
# Pre-install dialect packages that declare SQLAlchemy<2 in their metadata
|
||||
# but work fine at runtime with SQLAlchemy 2.0 (unmaintained packages).
|
||||
RUN pip install --no-deps "sqlalchemy-redshift==0.8.14" "sqlalchemy-databricks==0.2.0" "sqlalchemy-ibmi==0.9.3" "pydoris-custom==1.1.0"
|
||||
RUN pip install --no-deps "sqlalchemy-redshift==0.8.14" "sqlalchemy-ibmi==0.9.3" "pydoris-custom==1.1.0"
|
||||
RUN pip install "datamodel-code-generator==0.25.6"
|
||||
RUN mkdir -p /home/airflow/ingestion/src/metadata/generated
|
||||
RUN python /home/airflow/scripts/datamodel_generation.py
|
||||
|
|
@ -134,5 +142,10 @@ RUN pip install psycopg2 mysqlclient==2.1.1
|
|||
RUN mkdir -p /opt/airflow/dag_generated_configs
|
||||
|
||||
EXPOSE 8080
|
||||
# Airflow 3.2.1 requires universal-pathlib>=0.3.8, but prior installs in this image
|
||||
# can leave stale 0.2.6 `upath` module files in site-packages that cause import
|
||||
# errors at runtime. Force-remove the stale registration then pin to the required version.
|
||||
RUN pip uninstall upath -y && pip install "universal-pathlib==0.3.10"
|
||||
|
||||
# This is required as it's responsible to create airflow.cfg file
|
||||
RUN airflow db migrate && rm -f /opt/airflow/airflow.db
|
||||
|
|
|
|||
|
|
@ -40,20 +40,9 @@ install_all: ## Install the ingestion module with all dependencies
|
|||
install_apis: ## Install the REST APIs module to the current environment
|
||||
python -m pip install $(ROOT_DIR)/openmetadata-airflow-apis/ setuptools~=70.3.0
|
||||
|
||||
.PHONY: lint
|
||||
lint: ## Run pylint on the Python sources to analyze the codebase
|
||||
PYTHONPATH="${PYTHONPATH}:$(INGESTION_DIR)/plugins" find $(PY_SOURCE) -path $(PY_SOURCE)/metadata/generated -prune -false -o -type f -name "*.py" | xargs pylint --rcfile=$(INGESTION_DIR)/pyproject.toml
|
||||
|
||||
.PHONY: static-checks
|
||||
static-checks:
|
||||
# For Python 3.9, optionally skip SDK type checks if OM_SKIP_SDK_PY39=1
|
||||
PY_VER=$$(python -c 'import sys; print(f"{sys.version_info[0]}.{sys.version_info[1]}")'); \
|
||||
if [ "$$PY_VER" = "3.9" ] && [ "$$OM_SKIP_SDK_PY39" = "1" ]; then \
|
||||
echo "[static-checks] Python $$PY_VER detected with OM_SKIP_SDK_PY39=1 — excluding sdk/ from type checks"; \
|
||||
basedpyright $$(find $(INGESTION_DIR)/src/metadata -maxdepth 1 -mindepth 1 -type d \( -not -name sdk -a -not -name __pycache__ \) | tr '\n' ' ') $$(find $(INGESTION_DIR)/src/metadata -maxdepth 1 -type f -name '*.py' | tr '\n' ' '); \
|
||||
else \
|
||||
basedpyright -p $(INGESTION_DIR)/pyproject.toml; \
|
||||
fi
|
||||
static-checks: ## Run basedpyright type checks (delegates to nox so local matches CI)
|
||||
cd $(INGESTION_DIR) && nox --no-venv -s static-checks
|
||||
|
||||
.PHONY: precommit_install
|
||||
precommit_install: ## Install the project's precommit hooks from .pre-commit-config.yaml
|
||||
|
|
@ -62,17 +51,14 @@ precommit_install: ## Install the project's precommit hooks from .pre-commit-co
|
|||
pre-commit install
|
||||
|
||||
.PHONY: py_format
|
||||
py_format: ## Run black and isort to format the Python codebase
|
||||
pycln $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
isort $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --settings-file $(INGESTION_DIR)/pyproject.toml
|
||||
black $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
py_format: ## Run ruff to lint-fix and format the Python codebase
|
||||
ruff check --fix $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
ruff format $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
|
||||
.PHONY: py_format_check
|
||||
py_format_check: ## Check if Python sources are correctly formatted
|
||||
pycln $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --diff --config $(INGESTION_DIR)/pyproject.toml
|
||||
isort --check-only $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --settings-file $(INGESTION_DIR)/pyproject.toml
|
||||
black --check --diff $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
PYTHONPATH="${PYTHONPATH}:$(INGESTION_DIR)/plugins" pylint --rcfile=$(INGESTION_DIR)/pyproject.toml --fail-under=10 $(PY_SOURCE)/metadata || (echo "PyLint error code $$?"; exit 1)
|
||||
ruff check $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
ruff format --check $(INGESTION_DIR)/ $(ROOT_DIR)/openmetadata-airflow-apis/ --config $(INGESTION_DIR)/pyproject.toml
|
||||
|
||||
.PHONY: unit_ingestion
|
||||
unit_ingestion: ## Run Python unit tests
|
||||
|
|
@ -116,7 +102,7 @@ sonar_ingestion: ## Run the Sonar analysis based on the tests results and push
|
|||
.PHONY: run_apis_tests
|
||||
run_apis_tests: ## Run the openmetadata airflow apis tests
|
||||
coverage erase
|
||||
coverage run --rcfile $(ROOT_DIR)/openmetadata-airflow-apis/pyproject.toml -a --branch -m pytest -c $(INGESTION_DIR)/pyproject.toml --junitxml=$(ROOT_DIR)/openmetadata-airflow-apis/junit/test-results.xml $(ROOT_DIR)/openmetadata-airflow-apis/tests
|
||||
coverage run --rcfile $(ROOT_DIR)/openmetadata-airflow-apis/pyproject.toml --branch -m pytest -c $(INGESTION_DIR)/pyproject.toml --junitxml=$(ROOT_DIR)/openmetadata-airflow-apis/junit/test-results.xml $(ROOT_DIR)/openmetadata-airflow-apis/tests
|
||||
coverage report --rcfile $(ROOT_DIR)/openmetadata-airflow-apis/pyproject.toml
|
||||
|
||||
.PHONY: coverage_apis
|
||||
|
|
|
|||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue