mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

History

IceS2 03334b574f fix(docker): mirror IBM iAccess driver on collate CDN (#28097 ) IBM's public CDN (public.dhe.ibm.com) has been unreliable, causing CI build failures with "Failed to connect to ... port 443". Switch all ingestion Dockerfiles to wget the .deb from cdn.getcollate.io with SHA256 verification. Changes: - ingestion/Dockerfile + Dockerfile.ci: replace apt-list+apt-install pattern with direct wget+dpkg, matching the operators' existing shape. - ingestion/operators/docker/Dockerfile + Dockerfile.ci: bump pinned version 1.1.0.13 (2022) -> 1.1.0.29 (matches production ingestion-slim image), add SHA256 verification. The CDN-mirrored .deb is byte-identical to IBM's upstream (verified by SHA256). Production ingestion-slim:1.13.0-n103 already runs 1.1.0.29 (confirmed via dpkg -l inside the image). Decouples Docker builds from IBM's CDN availability — the recent CI failure mode (curl timeout to public.dhe.ibm.com) can no longer occur.		2026-05-14 10:47:02 +02:00
..
Dockerfile	fix(docker): mirror IBM iAccess driver on collate CDN (#28097 )	2026-05-14 10:47:02 +02:00
Dockerfile.ci	fix(docker): mirror IBM iAccess driver on collate CDN (#28097 )	2026-05-14 10:47:02 +02:00
exit_handler.py	chore(ingestion): drop pylint, expand ruff (#27774 )	2026-04-28 07:21:59 +02:00
main.py	chore(ingestion): drop pylint, expand ruff (#27774 )	2026-04-28 07:21:59 +02:00
README.md	Prepare Ingestion Base Docker image (#8065 )	2022-10-11 07:50:49 +02:00
run_automation.py	chore(ingestion): drop pylint, expand ruff (#27774 )	2026-04-28 07:21:59 +02:00

README.md

OpenMetadata Ingestion Docker Operator

Utilities required to handle metadata ingestion in Airflow using DockerOperator.

The whole idea behind this approach is to avoid having to install packages directly in any Airflow host, as this adds many (unnecessary) constraints to be aligned on the openmetadata-ingestion package just to have the Python installation as a virtualenv within the Airflow host.

The proposed solution - or alternative approach - is to use the DockerOperator and run the ingestion workflows dynamically.

This requires the following:

Docker image with the bare openmetadata-ingestion requirements,
main.py file to execute the Workflows,
Handling of environment variables as input parameters for the operator.

Note that Airflow's Docker Operator works as follows (example from here):

DockerOperator(
    docker_url='unix://var/run/docker.sock',  # Set your docker URL
    command='/bin/sleep 30',
    image='centos:latest',
    network_mode='bridge',
    task_id='docker_op_tester',
    dag=dag,
)

We need to provide as ingredients:

Docker image to execute,
And command to run.

This is not a Python-first approach, and therefore it is not allowing us to set a base image and pass a Python function as a parameter (which would have been the preferred approach). Instead, we will leverage the environment input parameter of the DockerOperator and pass all the necessary information in there.

Our main.py Python file will then be in charge of:

Loading the workflow configuration from the environment variables,
Get the required workflow class to run and finally,
Execute the workflow.

To try this locally, you can build the DEV image with make build-ingestion-base-local from the project root.

Further improvements

We have two operator to leverage if we don't want to run the ingestion from Airflow's host environment:

from airflow.providers.docker.operators.docker import DockerOperator
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

Which can be installed from apache-airflow[docker] and apache-airflow[kubernetes] respectively.

If we want to handle both of these directly on the openmetadata-managed-apis we need to consider a couple of things:

DockerOperator will only work with Docker and KubernetesPodOperator will only work with a k8s cluster. This means that we'll need to dynamically handle the internal logic to use either of them depending on the deployment. Docs.
For GKE deployment things get a bit more complicated, as we'll need to use and test yet another operator custom-built for GKE: GKEStartPodOperator. Docs