OpenMetadata

mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

Author	SHA1	Message	Date
Eugenio	782a87a706	refactor(sampler): collapse SamplerInterface to a single typed config object (#28147 ) * refactor(sampler): collapse SamplerInterface to a single config object Replace the 9-parameter constructor/create() signature with a typed SamplerConfig hierarchy (SamplerConfig / DatabaseSamplerConfig / StorageSamplerConfig). Config resolution — partition_details, sample_query, include/exclude columns, sample_config, sample_data_count — now happens in callers (entity_adapters, profiler_source, base_test_suite_source) before construction, so the interface only receives already-resolved values. - Add sampler_config.py with SamplerConfig dataclass hierarchy - Remove database-specific imports from SamplerInterface base class - Move SSL connection setup and column include/exclude filtering to database-family subclasses (SQASampler, PandasSampler, NoSQLSampler) - Simplify BigQuery/Postgres/Snowflake samplers to args/kwargs init - Remove StorageSampler.create() override; base create() is sufficient - Update profiler_source and base_test_suite_source to build DatabaseSamplerConfig before calling sampler_class.create() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> docs(sampler): fix build_sampler_kwargs example to use SamplerConfig The non-database adapter example was showing the old flat kwargs pattern (sample_config, sample_data_count) that SamplerInterface now silently ignores via *__. Replace with the correct "config": SamplerConfig(...) pattern that matches the actual ContainerAdapter implementation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> fix(sampler): guard BigQuerySampler.tableType access with isinstance check ClassifiableEntityType includes Container which has no tableType. The args/kwargs init simplified the constructor but lost the explicit Table type annotation, triggering a basedpyright error. Guard with isinstance(self.entity, Table) so the type checker knows tableType is only accessed on Table entities. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Fix tests * Gitar bot feedback --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-19 14:56:02 +02:00
Pere Miquel Brull	4217e6db8d	fix(log-storage): plug clobber bugs in streamable S3 logs (partial.txt + logs.txt) (#27926 ) * fix(api): make closeStream idempotent when log storage is not configured closeStream used to throw IllegalStateException("Log storage is not configured") which the resource layer translates to a 500 response. That made the contract surprising for callers: any defensive cleanup path (exit handlers, retry logic, generic teardown) had to know in advance whether streaming was configured before calling close, or eat spurious server errors. Closing a stream is naturally idempotent — same shape as DELETE on a non-existent resource. When log storage is not configured, return silently with a debug log so callers can call close() defensively without checking state first. Adds a unit test covering the no-op path. * Add design spec for streamable logs stability fix Captures the design discussion for fixing partial.txt and logs.txt clobber bugs in S3LogStorage when ingestion runs hit idle gaps longer than the 5-minute stream timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add full design flow doc for streamable ingestion logs End-to-end documentation of the streamable logs feature: architecture, storage layout, run lifecycle, read paths, abandoned-run recovery, configuration, concurrency model, and observability. Reflects the post-fix design captured in the streamable-logs-stability spec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add implementation plan for streamable-logs stability fix Step-by-step TDD plan grouped into 8 PR-sized tasks: config schema additions, per-stream lock, pendingFlush + merge-always flush, multipart removal, sweeper rewrite, /close rewrite, read-path correction, and integration tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(log-storage): add config fields for streamable-logs stability fix Adds streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes, earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures. Deprecates streamTimeoutMinutes in favor of streamTimeoutHours. Pure schema-only change; no Java code consumes these fields yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): add deprecated:true keyword and clarify watermark unit Addresses code review on Task 1: project convention uses the JSON Schema deprecated keyword alongside description annotation. Also clarifies that earlyFlushWatermarkBytes default (5242880) equals 5 MB. * feat(log-storage): wire new stability-fix config fields into S3LogStorage Reads streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes, earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures from LogStorageConfiguration with sane defaults. No behavioral change yet — values are stored but not consumed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): broaden streamTimeoutMinutes deprecation warning + drop FQN Addresses code review on Task 2: warning now fires whenever streamTimeoutMinutes is set (not only for values < 30 min), since the field is deprecated for all deployments. Also imports java.lang.reflect.Field in the test helper instead of using a fully-qualified name (CLAUDE.md no-FQN rule). * refactor(log-storage): add per-stream ReentrantLock for S3LogStorage Introduces streamLocks map and acquire/release helpers. appendLogs, writePartialLogsForStream, closeStream, and cleanupExpiredStreams all serialize on the per-stream lock. No behavior change; locking is pure mutual-exclusion at this point. * fix(log-storage): close iterator.remove race in cleanupExpiredStreams Move iterator.remove() inside the per-stream lock to prevent a window where a concurrent appendLogs sees the still-present closed StreamContext and writes to a closed stream. Also clarifies the comment on flush(fqn,runId) ordering and documents that streamLocks accumulates monotonically until Tasks 7 and 8 add cleanup. * feat(log-storage): track pendingFlush queue and totalLinesAppended counter Each appendLogs now also populates per-stream pendingFlush (lines awaiting flush) and totalLinesAppended (monotonic logical line counter). State is written but not yet consumed; the new flush logic in the next commit reads it. * fix(log-storage): document thread-safety + lifecycle on Task 4 maps, add test Addresses review on Task 4: documents that pendingFlush ArrayList values may only be accessed under the per-stream lock; clarifies that consecutiveFlushFailures is written and consumed in Task 5 (not just consumed); aligns its type with AtomicInteger for consistency with the other counters; adds a test for the trailing-newline trim path. * fix(log-storage): merge-always partial.txt PUT and persist offset in S3 metadata Replaces the old writePartialLogsForStream that skipped the read-merge step when partialLogOffsets[streamKey] was 0 (the canonical 80MB->KB clobber bug). The new flush always reads existing partial.txt, appends a snapshot of pendingFlush, and PUTs with offset state in S3 user-defined metadata. Also adds an early-flush watermark trigger so high-burst writes don't pile up unbounded in pendingFlush. Closes the partial.txt-clobber half of the streamable-logs-stability spec. * fix(log-storage): replace task-number comments with intent-describing language Addresses code review on Task 5: production code comments should describe invariants, not the planning-doc task that filled the gap. Also clarifies the parse-before-lock and the byte-counter atomicity assumption. * refactor(log-storage): remove MultipartS3OutputStream, rewrite closeStream as server-side copy appendLogs no longer initiates a multipart upload; bytes flow only through pendingFlush -> partial.txt PUTs. closeStream now: (1) drains pendingFlush via final partial.txt PUT, (2) issues CopyObjectRequest from partial.txt to logs.txt server-side, (3) deletes partial.txt and the .active marker, (4) drops in-memory state. Idempotent: a second /close sees no partial.txt (NoSuchKeyException) and returns gracefully. Closes the logs.txt-clobber half of the streamable-logs-stability spec and finalizes the canonical /close flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): plug listener/lock leaks, propagate SSE on copy, recover counter from metadata Addresses code review on combined Tasks 6+8: - dropStreamState now removes activeListeners entries (SSE listener leak fix). - cleanupExpiredStreams now removes streamLocks entries on expire (lock leak fix). - copyPartialToLogs applies SSE configuration to CopyObjectRequest (was unencrypted on copy). - writePartialLogsForStreamLocked reads last-flushed-line metadata from existing partial.txt and uses it to keep totalLinesAppended monotonic across restarts. - consecutiveFlushFailures reset uses computeIfAbsent + set(0) instead of allocating a new AtomicInteger every successful flush. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(log-storage): rewrite sweeper as cleanupAbandonedStreams (24h/1h) Bumps the idle threshold from 5 min to streamTimeoutHours (default 24h) and the poll interval from 1 min to cleanupIntervalMinutes (default 1h). On expire, finalizes the abandoned run by copying partial.txt -> logs.txt server-side, deleting partial.txt, and dropping in-memory state — same end-state as closeStream. Also wires partialFlushIntervalMinutes into the periodic flush schedule and removes the legacy streamTimeoutMs field that no longer drives behavior. * fix(log-storage): preserve streamLocks entry on cleanup retry path Addresses code review on Task 7: streamLocks.remove was unconditionally in the finally block of finalizeAbandonedStream, so it ran even when the sweeper returned early to retry next tick on a copy failure. That meant the next sweep tick would create a fresh ReentrantLock, and any concurrent appendLogs in the meantime would contend on a different lock object than the retry, defeating mutual exclusion. Now we only remove the lock entry once finalization has succeeded (after dropStreamState). The retry path leaves the lock in place so the next tick and any concurrent appendLogs see the same lock identity. * fix(log-storage): include pendingFlush snapshot in mid-run reads getCombinedLogsForActiveStream now appends the in-memory pendingFlush snapshot to the partial.txt body when reading mid-run, so the UI's paginated GET surfaces the most recent tail even before the next scheduled flush has happened. Only appends pendingFlush when a partial.txt file exists, avoiding duplication in the fallback path where recentLogsCache already includes those lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): tighten Task 9 read path safety + invariant comment Addresses review on Task 9: the unsafe null-lock fallback in the pendingFlush append path is removed (it was structurally unreachable but a latent hazard for future lifecycle changes). The pendingFlush read now happens entirely under the per-stream lock, with a conservative skip if no lock entry exists. Also documents the recentLogsCache-vs-pendingFlush invariant in the fallback path and adds a total-count assertion to the new test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(log-storage): add bug-reproducer ITs for streamable-logs stability - testIdleGapDoesNotClobberPartial: two log bursts within an open run; asserts both are present in the read response. - testCloseProducesLogsTxtMatchingPartial: write, close, read; asserts content survives the close. - testCloseIsIdempotent: a second /close is a graceful no-op. Tests are tolerant of the storage backend in the test environment (DefaultLogStorage in CI may not persist; S3LogStorage in S3-configured environments). Deep behavioral coverage is in S3LogStorageTest unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): address final-review critical bugs - closeStream and finalizeAbandonedStream now propagate PUT failures from writePartialLogsForStreamLocked (which returns boolean). closeStream throws IOException; the sweeper retains state for retry. Fixes silent data loss when the final flush PUT fails. - streamLocks entries are no longer removed; this prevents an acquire-vs-remove race that would break mutual exclusion. Memory growth is bounded by maxConcurrentStreams in practice. - cleanupAbandonedStreams re-checks expiration inside the per-stream lock so a stream that was bumped by appendLogs between the scan and the lock acquisition is not finalized. - deleteLogs now acquires the per-stream lock before mutating state. - getCombinedLogsForActiveStream appends pendingFlush in BOTH the S3-found and memory-fallback branches, so reads aren't truncated when recentLogsCache evicts oldest at its 1000-line cap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): use pendingFlush as canonical mid-run read source (no duplicates) The previous Issue 5 fix appended pendingFlush unconditionally, which caused duplicate lines in the read response when the fallback branch used recentLogsCache (since both are populated by the same appendLogs). Now: in the foundPartialFile branch, append pendingFlush AFTER the S3 body (non-overlapping by construction). In the fallback branch (no partial.txt yet), use pendingFlush directly as the canonical source — this is more complete than recentLogsCache (1000-line cap) and avoids the duplicate issue. recentLogsCache remains a defensive fallback for the rare case where pendingFlush is empty in the fallback path. * Update generated TypeScript types * chore(log-storage): drop dead abortIncompleteMultipartUpload lifecycle rule The multipart upload write path was removed; the bucket lifecycle's abortIncompleteMultipartUpload(7 days) rule served only as migration cleanup for in-flight uploads from the old code at deploy time. After the migration window it does nothing. Drops the rule from configureLifecyclePolicy, the AWS SDK import, the "7 days multipart cleanup" string in the startup log, and the corresponding bullet in docs/streamable-logs.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ignore docs/superpowers/ Local-only working notes (specs, plans) live there and shouldn't be tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(log-storage): tolerate DefaultLogStorage in CI for streamable-logs ITs CI runs the integration tests against the bootstrap config which uses DefaultLogStorage (delegates to k8s/Airflow which isn't running). The storage returns: - "No pods found for this pipeline" sentinel for getLogs - non-2xx status (the SDK wraps it as statusCode -1) for /close Adjustments: - testIdleGapDoesNotClobberPartial: parse JSON, only assert when total>0. When storage actually persists (S3 deployments), assert BOTH bursts are present — that's the real "no clobber" check. - postClose helper: tolerate any exception from the close call (idempotency is the contract; transient errors are acceptable). The deep behavioural coverage continues to live in S3LogStorageTest unit tests where mock S3 is the storage backend. * test * fix * Update generated TypeScript types * fix * Update generated TypeScript types * fix(log-storage): record UTF-8 byte length in partial.txt total-bytes metadata String.length() returns UTF-16 code units; for non-ASCII content this diverged from the actual S3 object size, breaking the drift cross-check documented in docs/streamable-logs.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(log-storage): address PR review findings on S3LogStorage Plumbs the documented timing knobs (cleanupIntervalMinutes, partialFlushIntervalMinutes, earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures) through LogStorageConfiguration so operators can actually tune them. Replaces the unbounded streamLocks ConcurrentHashMap with a Guava Striped<Lock> capped at 256 stripes, eliminating the per-(fqn, runId) memory leak and the acquire-vs-remove race that a per-key map would have. Adds a Multipart Upload + UploadPartCopy concatenation path for partial.txt >= 5 MB, avoiding the O(n^2) total transfer and full in-JVM body merge that the prior GET+PUT-everything strategy hit on long-running pipelines. Realigns docs/streamable-logs.md with the actual schema and implementation, drops the broken superpowers/* spec link, and renames the misleading testIdleGapDoesNotClobberPartial IT (which posted bursts back-to-back without simulating any gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-05-15 11:02:19 +02:00
Eugenio	56bda498b5	Refactor(ingestion): introduce ClassifiableEntityAdapter to eliminate scattered isinstance checks (#27716 ) * refactor(sampler): introduce EntityAdapter to centralise per-entity classification logic Replace scattered isinstance(entity, Table/Container) branches across processor.py, pii/base_processor.py, patch_mixin.py, and metadata_rest.py with a single EntityAdapter strategy pattern in sampler/entity_adapters.py. Each adapter encodes get_columns, set_columns, patch_fields, build_sampler_kwargs, pipeline_config_class, and service_type for one entity type. _BY_ENTITY and _BY_PIPELINE registries make lookup O(1). Adding a new classifiable entity now requires changes to entity_adapters.py only — no other ingestion files need to change. Also extracts build_database_service_conn_config into sampler/config_utils.py and updates the developer guide accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Apply PR feedback * Typing and text fixes * Apply gitar bot feedback * Fix tests * Apply Gitar bot suggestions --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 09:37:58 +02:00
Eugenio	88c44502ae	feat: Add auto-classification support for storage service containers (#26495 ) Some checks failed Java Checkstyle / java-checkstyle (push) Waiting to run Details Maven Collate Tests / maven-collate-ci (push) Waiting to run Details OpenMetadata Service Unit Tests / Detect Changes (push) Waiting to run Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (mysql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests (postgresql) (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / k8s_operator-unit-tests (push) Blocked by required conditions Details OpenMetadata Service Unit Tests / openmetadata-service-unit-tests-status (push) Blocked by required conditions Details Integration Tests - MySQL + Elasticsearch / Detect Changes (push) Has been cancelled Details Integration Tests - PostgreSQL + OpenSearch / Detect Changes (push) Has been cancelled Details Publish Package to Maven Central Repository / publish-maven-packages (push) Has been cancelled Details Integration Tests - PostgreSQL + OpenSearch / integration-tests-postgres-opensearch (push) Has been cancelled Details Integration Tests - MySQL + Elasticsearch / integration-tests-mysql-elasticsearch (push) Has been cancelled Details * Add schema support for container auto-classification Extend container entity schema to support sample data storage, enabling PII detection and classification workflows on storage service containers. Changes: - Add sampleData field to container.json for storing sample data - Create storageServiceAutoClassificationPipeline.json schema defining configuration for storage service auto-classification pipelines - Update workflow.json to include StorageServiceAutoClassificationPipeline as a supported pipeline type This provides the schema foundation for running auto-classification workflows on S3, GCS, and other storage service containers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add backend support for container sample data and classification Implement Java backend functionality to handle sample data ingestion, storage, and PII masking for container entities. Changes: - ContainerRepository: Add sample data retrieval and storage operations - EntityRepository: Extend sample data support to container entities - ContainerResource: Add REST endpoint for container sample data ingestion - PIIMasker: Extend PII masking to support container entities This enables the backend to process and store sample data from storage service containers and apply PII masking rules during data retrieval. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Extend classifiable entity types to include containers Add Container to the ClassifiableEntityType union, enabling PII detection and auto-classification workflows to process storage service containers alongside database tables. Changes: - Update ClassifiableEntityType from Table-only to Union[Table, Container] - Import Container entity type - Update module docstring to reflect current support This type extension allows the PII processor to handle both database tables and storage containers uniformly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add container sample data ingestion to OpenMetadata API Implement container-specific API mixin for sample data operations and integrate it into the main OpenMetadata client. Changes: - Add OMetaContainerMixin with ingest_container_sample_data method - Handle binary data encoding (base64) and serialization errors - Register mixin in OpenMetadata class hierarchy - Mirror table sample data ingestion patterns for consistency This provides the Python API layer for ingesting sample data from storage service containers into OpenMetadata. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Implement storage service samplers for S3 and GCS Add sampler implementations for storage services to extract sample data from structured containers (Parquet, CSV) for auto-classification. Changes: - Create base StorageSamplerInterface for storage service sampling - Implement S3Sampler for AWS S3 containers with structured file support - Implement GCSSampler for Google Cloud Storage containers - Support column extraction and data sampling for structured formats - Handle dataModel-based column definitions from containers Storage samplers read container metadata, fetch file contents, and generate sample datasets for downstream PII detection. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update PII processor to support container entities Extend the base PII processor to handle both Table and Container entities with unified column extraction logic. Changes: - Add _get_entity_columns helper to extract columns from Table or Container - Handle Container entities with optional dataModel.columns structure - Improve column matching with safe fallback for missing columns - Use generic entity reference in error reporting - Add early return when entity has no columns to process This enables PII detection to run on storage containers the same way it processes database tables. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add storage service support to sampler processor Extend the sampler processor to handle both database and storage service entities with appropriate sampler class selection. Changes: - Detect service type from source config (Database vs Storage) - Import StorageServiceAutoClassificationPipeline - Handle both Table and Container entity types in _run method - Add column validation for Container entities (via dataModel.columns) - Create storage-specific sampler interfaces for S3 and GCS - Update sampler_interface to support Container entities - Improve error messages with entity type context The processor now dynamically selects database or storage samplers based on the pipeline configuration type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add storage fetcher strategy for container classification Implement fetcher strategy pattern for storage services to retrieve containers for auto-classification workflows. Changes: - Add StorageFetcherStrategy to handle storage service entity fetching - Update EntityFetcher to select appropriate strategy based on service type - Support both DatabaseService and StorageService in strategy selection - Import StorageService type for service detection - Improve error messages with specific service type information The fetcher now dynamically creates database or storage-specific strategies to retrieve entities based on pipeline configuration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Register auto-classification pipeline in storage service specs Add AutoClassification pipeline support to S3 and GCS storage service specifications, enabling UI and workflow registration. Changes: - Add AutoClassification to S3ServiceSpec supported pipelines - Add AutoClassification to GCSServiceSpec supported pipelines - Import StorageServiceAutoClassificationPipeline in both specs This registers the auto-classification workflow type for storage services in the ingestion framework's service registry. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add container support to metadata sink and patch operations Extend metadata sink and patch mixin to handle container entities, enabling sample data ingestion and tag updates for containers. Changes: - Add Container to MetadataRestSink entity type handling - Implement container sample data ingestion in sink._run - Add Container to PatchMixin tag operations - Import Container entity type in both modules This completes the metadata ingestion pipeline by allowing the sink to persist sample data and classification tags for container entities. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update classification workflow for storage service support Extend the auto-classification workflow to handle both database and storage service pipelines with unified step orchestration. Changes: - Import StorageServiceAutoClassificationPipeline - Add type checking for both Database and Storage pipeline configs - Remove unnecessary cast, use direct type checks - Add validation warning for unsupported config types - Preserve enableAutoClassification flag behavior for both types The workflow now supports running PII detection and classification on both database tables and storage containers based on config type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add unit tests for container classification components Add test coverage for container-specific fetcher and sampler components. Changes: - Add test_container_fetcher.py for StorageFetcherStrategy tests - Add test_container_sampler_processor.py for container sampler tests Tests validate: - Storage service fetcher strategy selection and instantiation - Container sampler processor initialization and execution - Proper handling of Container entities vs Table entities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Reorganize integration tests by entity type Restructure auto-classification integration tests into separate directories for databases and containers to improve organization. Changes: - Move database classification tests to databases/ subdirectory - Move conftest.py, init.sql, and test_tag_processor.py into databases/ - Container tests already organized in containers/ subdirectory - Remove old flat test structure This organization makes it clearer which tests target database entities vs storage container entities in classification workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Properly retrieve sample data * Update generated TypeScript types * Apply Gitar bot * Fix tests * feat: Add supportsProfiler to storage connection schemas Add supportsProfiler field to storage connection schemas (S3, GCS, ADLS, Custom Storage) to enable auto-classification pipeline support for storage services. This aligns with the backend changes in PR #26495 that added container auto-classification functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Add UI support for storage service auto-classification - Update IngestionWorkflowUtils to route storage services to storage-specific auto-classification schema - Modify getSupportedPipelineTypes to filter pipeline types based on service category (storage services only show AutoClassification, not Profiler) - Update AddIngestionButton to pass serviceCategory parameter - Add unit test to verify storage services only get AutoClassification option This enables users to configure and run auto-classification agents on storage services (S3, GCS, ADLS) for PII detection on containers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add BucketArn field to S3BucketResponse model AWS S3 API now returns a BucketArn field in list_buckets() responses. Add this optional field to prevent Pydantic extra_forbidden validation errors. Error: BucketArn Extra inputs are not permitted [type=extra_forbidden] 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add Container permissions to AutoClassificationBotPolicy Add Container entity permissions to AutoClassificationBotPolicy to allow the autoClassification-bot to apply tags and sample data to storage containers. Previously, the bot only had permissions for Table entities, causing permission denied errors when running auto-classification on storage services. Changes: - Add Container rule with EditAll and ViewAll operations to policy seed data - Create migrations for MySQL and PostgreSQL to update existing installations Error fixed: Principal: CatalogPrincipal{name='autoclassification-bot'} operations [EditTags] not allowed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Update generated TypeScript types * fix: Add fallback for storage service type detection in sampler Add fallback logic to detect storage services by source type name when the pipeline config type check fails. This handles cases where the Airflow environment might not have the updated schema/package with StorageServiceAutoClassificationPipeline. Changes: - Add fallback detection for s3, gcs, azuredatalake, customstorage - Add debug logging for service type detection - Preserve primary instanceof check for proper type detection This fixes the "No module named 'metadata.ingestion.source.database.gcs'" error when running storage auto-classification pipelines. * Guide to support new entities in classification agent * docs: Update auto-classification guide with debugging learnings Add critical troubleshooting information discovered during container classification debugging: 1. storeSampleData defaults to false - Sample data NOT ingested unless explicitly enabled - Document why this is by design (avoid large datasets) - Add troubleshooting steps to verify flag is set 2. Service type detection fallback pattern - Explain why fallback is needed (Airflow package caching) - Show complete implementation with source type lists - Add debug logging pattern 3. Troubleshooting section - Sample data not appearing: check storeSampleData, database, logs - Module import errors: service type detection issues - PII tags not applied: config and data issues 4. Common pitfalls additions - Emphasize storeSampleData default value - Service type detection in cached environments These updates reflect real debugging scenarios and will help future developers avoid the same issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Apply gitar bot suggestions * Fix suggestions, linting, and SonarCloud issues * More gitar bot suggestions * Fix compile error * Fix linting * Fix broken tests * Fix unorganized import * Improve config parsing This is so that we rightly discover polymorphic properties of `source` when the config does not provide enough fields for Pydantic to correctly discriminate between models (e.g: confusing database source config with storage source config) * Gitar bot comment * Fix s3 source test * Apply comments from reviews * Extract cantidate column logic in samplers * Fix tests * Fix container customization test --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-04-24 06:29:16 -07:00
Sriharsha Chintalapani	bb0daa180e	RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex (#26902 ) * RDF, cleanup relations and remove unnecessary bindings, add distributed mode for RDF reindex * Update generated TypeScript types * Address comments from copilot * Update generated TypeScript types * fix test issues * Fix minor UI bugs * Add the missing filters * Fix RDF export API error * Add export functionality * Fix ui-checkstyle * Fix java checkstyle * Fix unit tests * Fix and increase the coverage for KnowledgeGraph.spec.ts * Fix tests * Remove rdf as default in playwright and local docker * fix ui-checkstyle * Address comments * Potential fix for pull request finding 'CodeQL / Artifact poisoning' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Address copilot comments * Address copilot comments * FIx tests * FIx docker * Update openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/rdf/distributed/DistributedRdfIndexCoordinator.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Address copilot review comments: license headers, JSON escaping, type safety, border-color, stop semantics Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1 Co-authored-by: harshach <38649+harshach@users.noreply.github.com> * Show error toast for unsupported export format in KnowledgeGraph Agent-Logs-Url: https://github.com/open-metadata/OpenMetadata/sessions/c026e52e-162b-4c9a-9874-43791d4aaac1 Co-authored-by: harshach <38649+harshach@users.noreply.github.com> * Fix docker * Fix docker for playwright * Fix docker for playwright * Fix tests * Fix tests * Fix docker * Fix docker * Fix glossary and pagination spec flakiness * update the missing translations * Fix docker * Fix docker * Fix integration test * Fix fuseki not starting * Fixed the run local docker script * worked on comments * Fix flakiness in knowledge graph tests * Fix checkstyle --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Aniket Katkar <aniketkatkar97@gmail.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: harshach <38649+harshach@users.noreply.github.com>	2026-04-14 13:24:41 -07:00
Sriharsha Chintalapani	6d99ba2dc0	Glossary relations (#25886 ) * Glossary Term Relations * Add GlossaryTerm Relations * Add GlossaryTerm Relations, Add custom relations, onotolgoy explorer * Add Translations * Update generated TypeScript types * Address comments * Address comments * Address comments * Update generated TypeScript types * Update yarn.lock after merging cytoscape dependencies from glossary_relations * fix zoom in and out functionality and added missing translate keys * fix test * Remove unwanted changes * nit * nit * nit * Remove conflict test * nit * fix test * Add test for ontology explorer * New yarn lock and 2.0.0 schema changes missed during merge conflicts * Revamped glossary term relation settings * Refactor code * Addressed comments * nit * Update generated TypeScript types * Java Checkstyle and Yarn lock * Update generated TypeScript types * fix unit test * Remove 2.0.0 migration folders placed at wrong loc * Merge main * fix navigation to relation graph in glossary * fix ontology explorer spec * Added filter support in the data mode * Fix glossary term relation CI failures ### Canonical Relation Storage (GlossaryTermRepository) * Introduced `computeCanonicalRelationType()` to normalize relation direction using UUID ordering (lower UUID is always treated as "from") * Prevents duplicate and inconsistent relation rows when created from either side * Updated `setTermRelations()` and `addRelation()` to store canonical relation types * Fixed `setFields()` read logic: * Invert relation type for `fromRecords` (entity is the TO side) * Keep `toRecords` unchanged * Updated `deleteBidirectionalRelatedTo()` to match canonical storage format * Added `RequestEntityCache.invalidate()` after relation mutations to ensure consistency ### Lazy RDF Resource Initialization * Added `RdfRepository.getInstanceOrNull()` for null-safe access without throwing * Refactored `RdfResource` constructor to avoid eager `RdfRepository.getInstance()` call * Enabled resource registration even when Fuseki is not initialized * Introduced lazy getters: * `getRdfRepository()` * `getSemanticSearchEngine()` * Updated all endpoints to guard with null checks before `isEnabled()` * Return `503 Service Unavailable` when RDF is not ready ### Graceful Test Degradation (Fuseki-dependent tests) * Added `TestSuiteBootstrap.isFusekiEnabled()` to detect Fuseki availability * `GlossaryOntologyExportIT`: * Falls back to Testcontainers-based local Fuseki when bootstrap Fuseki is unavailable * `GlossaryTermRelationIT`: * Skipped via `assumeTrue` when Fuseki is unavailable * `MetricResourceIT`: * Skips RDF-specific tests when Fuseki is unavailable * fix package conflicts * nit * Fix merge conflicts, Python test, RDF reliability, and VectorDocBuilder tests - Fix Python test_patch_glossary_term_related_terms to use TermRelation instead of EntityReferenceList (schema changed relatedTerms type) - Rewrite VectorDocBuilder tests for current buildEmbeddingFields API - Improve JenaFusekiStorage retry logic to retry on all HTTP errors - Increase Fuseki tmpfs size to prevent disk space exhaustion in tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix pycheck * Address all 8 PR review findings 1. Add authorization check on getTermRelationGraph endpoint 2. Add null guard on getBaseUri() to prevent NPE 3. Add React key prop on RelatedTermTagButton in map renders 4. Mark RdfResource lazy-init fields as volatile for thread safety 5. Replace exception messages with generic errors in API responses 6. Unify DEFAULT_RELATION_TYPES between CSV and repository (10 types) 7. Add jitter backoff to deadlock retry in CollectionDAO 8. Replace N+1 queries in prefetchGraphTerms with batch fetch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Fuseki tmpfs exhaustion and GlossaryTermRelationIT double init - Remove tmpfs size limit on Fuseki container to prevent disk exhaustion - Guard RdfUpdater.initialize() in GlossaryTermRelationIT to skip if already initialized by bootstrap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix duplicate edges, null term NPE, and silent exception in graph builder - Deduplicate edges in buildGraph() using edgesSeen set - Skip TermRelation entries with null term references to prevent NPE - Add warning log when glossary term relation settings fail to load Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix cardinality count after canonical swap and double-checked locking - getRelationCount now matches inverse relation type for fromRecords where the term is the target, fixing cardinality bypass after bidirectional UUID canonicalization - Use double-checked locking in RdfResource.getSemanticSearchEngine() to prevent duplicate instance creation under concurrency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anuj-kumary <anujf0510@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com> Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>	2026-03-18 10:51:03 +05:30
Mohit Yadav	21750aaa90	Feature/search indexing issues (#25594 ) * Add design doc for search indexing stats redesign Covers: - Simplified 4-stage pipeline model (Reader, Process, Sink, Vector) - Per-entity index promotion instead of batch promotion - Alias management from indexMapping.json - Payload-aware vector bulk processor Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Support for Per Entity Index Promotion * Add UI Bit * Add Lang * Add AppLog View Test coverage * Add Bathced Vector index querying * Add Improvements for Vector to be async and also stats to be better handled * Use Virtual Thread * Use Virtual Thread * Fix Tests * Make reading stats easier * Fixed Stats to be accurate * Fix Stats getting null * Fix partition worker stats * Fix Reader Stats - final * Update generated TypeScript types * Make updates in 1.12.0 * Revert "Use Virtual Thread" This reverts commit `4eb23374d1`. * Revert "Use Virtual Thread" This reverts commit `efe8d03b5d`. * Reapply "Use Virtual Thread" This reverts commit `d59cde18b2`. * Reapply "Use Virtual Thread" This reverts commit `769e5710c3`. * Fix Final Update on stat * - Add atomic alias swap - remove unnecessary migration * Fix Sonar test jest * Fix Final Update on stat --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-01-29 18:50:39 +05:30
Sriharsha Chintalapani	43f85a8969	Add RDF local dev (#24825 ) * Add RDF local dev * remove doc --------- Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>	2025-12-15 10:49:13 +01:00
Pere Miquel Brull	c9cffa00db	Update roadmap (#6440 ) * remove docs dir * Update roadmap	2022-07-30 09:40:05 -07:00
Ayush Shah	fc2bd386a6	Clean gitbook from main (#5007 )	2022-05-17 23:29:47 +05:30
Shannon Bradshaw	a2151e473d	GitBook: [#182 ] Correct advanced search text	2022-04-10 21:12:00 -07:00
OpenMetadata	dbe6b641ac	GitBook: [#179 ] No subject	2022-04-10 21:12:00 -07:00
OpenMetadata	694eba2799	GitBook: [#178 ] No subject	2022-04-10 21:12:00 -07:00
Shannon Bradshaw	383bca1315	GitBook: [#177 ] Fix image for advanced search	2022-04-10 21:12:00 -07:00
OpenMetadata	7a65d27010	GitBook: [#174 ] Update Kubernetes Docs	2022-04-10 21:11:58 -07:00
OpenMetadata	2a4c894f14	GitBook: [#175 ] No subject	2022-04-10 21:11:34 -07:00
Shannon Bradshaw	9b7bc505d7	GitBook: [#173 ] Fix TOC links for snowflake metadata ingestion	2022-04-10 21:11:34 -07:00
Shannon Bradshaw	7b9c48e674	GitBook: [#172 ] Separate Snowflake UI docs	2022-04-10 21:11:34 -07:00
Shannon Bradshaw	b219e20e3d	GitBook: [#168 ] General cleanup for snowflake metadata ingestion docs	2022-04-10 21:11:34 -07:00
Shilpa V	8618f9c669	GitBook: [#171 ] Deleting service_type	2022-04-10 21:11:33 -07:00
pmbrull	1baf1bc310	GitBook: [#170 ] SQLAlchemy constraint	2022-04-10 21:11:33 -07:00
pmbrull	cc794b780b	GitBook: [#169 ] Lineage Airflow 1.10.15	2022-04-10 21:11:33 -07:00
Shilpa V	1df5a4e52e	GitBook: [#167 ] MySQL Updates	2022-04-10 21:11:33 -07:00
Shannon Bradshaw	06ee54e5ca	GitBook: [#166 ] No subject	2022-04-10 21:11:33 -07:00
Shilpa V	971c9aad90	GitBook: [#162 ] MSSQL updates	2022-04-10 21:11:33 -07:00
Shannon Bradshaw	090ded1fd7	GitBook: [#163 ] Update Try OpenMetadata in Docker with latest success output messaging	2022-04-10 21:11:33 -07:00
Shilpa V	d9b5197e24	GitBook: [#161 ] MLflow Updates	2022-04-10 21:11:32 -07:00
Shilpa V	6b2d406439	GitBook: [#160 ] Glue Updates	2022-04-10 21:11:32 -07:00
Shilpa V	5f2a2ef49b	GitBook: [#159 ] Glue	2022-04-10 21:11:32 -07:00
Shilpa V	d4d291008a	GitBook: [#157 ] Glue Changes	2022-04-10 21:11:32 -07:00
Shannon Bradshaw	2412a436ea	GitBook: [#156 ] Add procedure TOC to BigQuery UI page	2022-04-10 21:11:32 -07:00
Shannon Bradshaw	ef8bae6708	GitBook: [#155 ] Add BigQuery UI config page	2022-04-10 21:11:32 -07:00
Shannon Bradshaw	0214668096	GitBook: [#154 ] Fix broken link to Try OpenMetadata in Docker	2022-04-10 21:11:32 -07:00
Shilpa V	cd65905b54	GitBook: [#153 ] 3 Tab Connector Steps - Changes	2022-04-10 21:11:32 -07:00
Shilpa V	3f5aa6391b	GitBook: [#152 ] Usage - Edits	2022-04-10 21:11:31 -07:00
Shilpa V	ed3def7de2	GitBook: [#151 ] MSSQL Usage Edits	2022-04-10 21:11:31 -07:00
Shilpa V	123149655a	GitBook: [#149 ] Delta Lake changes	2022-04-10 21:11:31 -07:00
Shilpa V	c605819368	GitBook: [#148 ] Delta Lake Changes	2022-04-10 21:11:31 -07:00
Shilpa V	346b72b569	GitBook: [#129 ] New Connectors	2022-04-10 21:11:31 -07:00
OpenMetadata	82bad2cc1f	GitBook: [#147 ] No subject	2022-04-10 21:11:31 -07:00
OpenMetadata	f01e837658	GitBook: [#146 ] No subject	2022-04-10 21:11:31 -07:00
OpenMetadata	b157766a0f	GitBook: [#145 ] No subject	2022-04-10 21:11:31 -07:00
OpenMetadata	61e0c453d3	GitBook: [#144 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	64ca190d25	GitBook: [#143 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	20765f145a	GitBook: [#142 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	837a5a7a04	GitBook: [#141 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	6287a7435e	GitBook: [#140 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	58d2572ee7	GitBook: [#139 ] No subject	2022-04-10 21:11:30 -07:00
OpenMetadata	7f770179cf	GitBook: [#138 ] Refactor BigQuery Ingestion Workflow	2022-04-10 21:11:30 -07:00
OpenMetadata	3ff97e6dde	GitBook: [#135 ] Remove tabs for metadata ingestion	2022-04-10 21:11:30 -07:00

1 2 3 4 5 ...

416 commits