OpenMetadata/ingestion/docs/design
Sriharsha Chintalapani e41544764b
Some checks are pending
Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Waiting to run
Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Blocked by required conditions
Integration Tests - PostgreSQL + Elasticsearch + Redis / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + Elasticsearch + Redis / integration-tests-postgres-elasticsearch-redis (push) Blocked by required conditions
Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Waiting to run
Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Blocked by required conditions
Java Checkstyle / java-checkstyle (push) Waiting to run
Maven Collate Tests / maven-collate-ci (push) Waiting to run
OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions
OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions
Publish Package to Maven Central Repository / publish-maven-packages (push) Waiting to run
ingestion: runtime diagnostics subsystem (#28161)
* docs(ingestion): design for runtime diagnostics subsystem

Proposal for an always-available, opt-in (loggerLevel=DEBUG) diagnostics
layer inside the ingestion framework so connector runs that hang, OOM, or
slow down produce enough live evidence to identify the root cause in
`kubectl logs` — without `py-spy`, `kubectl debug`, or ptrace.

Grounded in three concrete production cases:
- The Snowflake "hang" that was actually a logging recursion bug in
  StreamableLogHandler (fixed by PR #28160) but took ~6 hours and one
  wrong-theory fix to identify.
- Recurring OOMKills with no last-state evidence and no way to attribute
  growth to a specific object type or stage.
- "Is it stuck or just slow?" with no way to answer from outside the pod.

The design is gated entirely on the existing `workflowConfig.loggerLevel`
(no new env vars, no new config fields). When off, the module is dead
code. When on (~250 KB / <0.01% CPU), it provides:
- An operation registry of "what each thread is doing right now"
- SIGUSR1 / SIGUSR2 handlers for on-demand dumps to stderr
- A watchdog thread that auto-logs hangs at 60s and auto-dumps at 300s
- A heartbeat thread emitting one structured progress line every 30s
- A memory tracker (RSS / cgroup / GC top-types on dump)
- Stage-backpressure visibility (queue depths between source/processor/sink)
- HTTP introspection of OMetaClient and DB cursor execute()
2026-05-20 16:38:09 -07:00
..
ingestion-diagnostics.md ingestion: runtime diagnostics subsystem (#28161) 2026-05-20 16:38:09 -07:00