* refactor(sampler): collapse SamplerInterface to a single config object
Replace the 9-parameter constructor/create() signature with a typed
SamplerConfig hierarchy (SamplerConfig / DatabaseSamplerConfig /
StorageSamplerConfig). Config resolution — partition_details,
sample_query, include/exclude columns, sample_config, sample_data_count
— now happens in callers (entity_adapters, profiler_source,
base_test_suite_source) before construction, so the interface only
receives already-resolved values.
- Add sampler_config.py with SamplerConfig dataclass hierarchy
- Remove database-specific imports from SamplerInterface base class
- Move SSL connection setup and column include/exclude filtering to
database-family subclasses (SQASampler, PandasSampler, NoSQLSampler)
- Simplify BigQuery/Postgres/Snowflake samplers to *args/**kwargs init
- Remove StorageSampler.create() override; base create() is sufficient
- Update profiler_source and base_test_suite_source to build
DatabaseSamplerConfig before calling sampler_class.create()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(sampler): fix build_sampler_kwargs example to use SamplerConfig
The non-database adapter example was showing the old flat kwargs pattern
(sample_config, sample_data_count) that SamplerInterface now silently
ignores via **__. Replace with the correct "config": SamplerConfig(...)
pattern that matches the actual ContainerAdapter implementation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(sampler): guard BigQuerySampler.tableType access with isinstance check
ClassifiableEntityType includes Container which has no tableType.
The *args/**kwargs init simplified the constructor but lost the
explicit Table type annotation, triggering a basedpyright error.
Guard with isinstance(self.entity, Table) so the type checker
knows tableType is only accessed on Table entities.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix tests
* Gitar bot feedback
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(api): make closeStream idempotent when log storage is not configured
closeStream used to throw IllegalStateException("Log storage is not
configured") which the resource layer translates to a 500 response.
That made the contract surprising for callers: any defensive cleanup
path (exit handlers, retry logic, generic teardown) had to know in
advance whether streaming was configured before calling close, or eat
spurious server errors.
Closing a stream is naturally idempotent — same shape as DELETE on a
non-existent resource. When log storage is not configured, return
silently with a debug log so callers can call close() defensively
without checking state first.
Adds a unit test covering the no-op path.
* Add design spec for streamable logs stability fix
Captures the design discussion for fixing partial.txt and logs.txt
clobber bugs in S3LogStorage when ingestion runs hit idle gaps longer
than the 5-minute stream timeout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add full design flow doc for streamable ingestion logs
End-to-end documentation of the streamable logs feature: architecture,
storage layout, run lifecycle, read paths, abandoned-run recovery,
configuration, concurrency model, and observability. Reflects the
post-fix design captured in the streamable-logs-stability spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add implementation plan for streamable-logs stability fix
Step-by-step TDD plan grouped into 8 PR-sized tasks: config schema
additions, per-stream lock, pendingFlush + merge-always flush, multipart
removal, sweeper rewrite, /close rewrite, read-path correction, and
integration tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(log-storage): add config fields for streamable-logs stability fix
Adds streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures. Deprecates
streamTimeoutMinutes in favor of streamTimeoutHours. Pure schema-only
change; no Java code consumes these fields yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): add deprecated:true keyword and clarify watermark unit
Addresses code review on Task 1: project convention uses the JSON Schema
deprecated keyword alongside description annotation. Also clarifies that
earlyFlushWatermarkBytes default (5242880) equals 5 MB.
* feat(log-storage): wire new stability-fix config fields into S3LogStorage
Reads streamTimeoutHours, cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures from
LogStorageConfiguration with sane defaults. No behavioral change yet —
values are stored but not consumed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): broaden streamTimeoutMinutes deprecation warning + drop FQN
Addresses code review on Task 2: warning now fires whenever
streamTimeoutMinutes is set (not only for values < 30 min), since the
field is deprecated for all deployments. Also imports java.lang.reflect.Field
in the test helper instead of using a fully-qualified name (CLAUDE.md
no-FQN rule).
* refactor(log-storage): add per-stream ReentrantLock for S3LogStorage
Introduces streamLocks map and acquire/release helpers. appendLogs,
writePartialLogsForStream, closeStream, and cleanupExpiredStreams all
serialize on the per-stream lock. No behavior change; locking is
pure mutual-exclusion at this point.
* fix(log-storage): close iterator.remove race in cleanupExpiredStreams
Move iterator.remove() inside the per-stream lock to prevent a window
where a concurrent appendLogs sees the still-present closed StreamContext
and writes to a closed stream. Also clarifies the comment on flush(fqn,runId)
ordering and documents that streamLocks accumulates monotonically until
Tasks 7 and 8 add cleanup.
* feat(log-storage): track pendingFlush queue and totalLinesAppended counter
Each appendLogs now also populates per-stream pendingFlush (lines awaiting
flush) and totalLinesAppended (monotonic logical line counter). State is
written but not yet consumed; the new flush logic in the next commit reads it.
* fix(log-storage): document thread-safety + lifecycle on Task 4 maps, add test
Addresses review on Task 4: documents that pendingFlush ArrayList values
may only be accessed under the per-stream lock; clarifies that
consecutiveFlushFailures is written and consumed in Task 5 (not just
consumed); aligns its type with AtomicInteger for consistency with
the other counters; adds a test for the trailing-newline trim path.
* fix(log-storage): merge-always partial.txt PUT and persist offset in S3 metadata
Replaces the old writePartialLogsForStream that skipped the read-merge step
when partialLogOffsets[streamKey] was 0 (the canonical 80MB->KB clobber bug).
The new flush always reads existing partial.txt, appends a snapshot of
pendingFlush, and PUTs with offset state in S3 user-defined metadata.
Also adds an early-flush watermark trigger so high-burst writes don't
pile up unbounded in pendingFlush.
Closes the partial.txt-clobber half of the streamable-logs-stability spec.
* fix(log-storage): replace task-number comments with intent-describing language
Addresses code review on Task 5: production code comments should describe
invariants, not the planning-doc task that filled the gap. Also clarifies
the parse-before-lock and the byte-counter atomicity assumption.
* refactor(log-storage): remove MultipartS3OutputStream, rewrite closeStream as server-side copy
appendLogs no longer initiates a multipart upload; bytes flow only through
pendingFlush -> partial.txt PUTs.
closeStream now: (1) drains pendingFlush via final partial.txt PUT,
(2) issues CopyObjectRequest from partial.txt to logs.txt server-side,
(3) deletes partial.txt and the .active marker, (4) drops in-memory state.
Idempotent: a second /close sees no partial.txt (NoSuchKeyException) and
returns gracefully.
Closes the logs.txt-clobber half of the streamable-logs-stability spec
and finalizes the canonical /close flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): plug listener/lock leaks, propagate SSE on copy, recover counter from metadata
Addresses code review on combined Tasks 6+8:
- dropStreamState now removes activeListeners entries (SSE listener leak fix).
- cleanupExpiredStreams now removes streamLocks entries on expire (lock leak fix).
- copyPartialToLogs applies SSE configuration to CopyObjectRequest (was unencrypted on copy).
- writePartialLogsForStreamLocked reads last-flushed-line metadata from existing
partial.txt and uses it to keep totalLinesAppended monotonic across restarts.
- consecutiveFlushFailures reset uses computeIfAbsent + set(0) instead of allocating
a new AtomicInteger every successful flush.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(log-storage): rewrite sweeper as cleanupAbandonedStreams (24h/1h)
Bumps the idle threshold from 5 min to streamTimeoutHours (default 24h)
and the poll interval from 1 min to cleanupIntervalMinutes (default 1h).
On expire, finalizes the abandoned run by copying partial.txt -> logs.txt
server-side, deleting partial.txt, and dropping in-memory state — same
end-state as closeStream.
Also wires partialFlushIntervalMinutes into the periodic flush schedule
and removes the legacy streamTimeoutMs field that no longer drives behavior.
* fix(log-storage): preserve streamLocks entry on cleanup retry path
Addresses code review on Task 7: streamLocks.remove was unconditionally
in the finally block of finalizeAbandonedStream, so it ran even when the
sweeper returned early to retry next tick on a copy failure. That meant
the next sweep tick would create a fresh ReentrantLock, and any
concurrent appendLogs in the meantime would contend on a different lock
object than the retry, defeating mutual exclusion.
Now we only remove the lock entry once finalization has succeeded
(after dropStreamState). The retry path leaves the lock in place so
the next tick and any concurrent appendLogs see the same lock identity.
* fix(log-storage): include pendingFlush snapshot in mid-run reads
getCombinedLogsForActiveStream now appends the in-memory pendingFlush
snapshot to the partial.txt body when reading mid-run, so the UI's
paginated GET surfaces the most recent tail even before the next
scheduled flush has happened.
Only appends pendingFlush when a partial.txt file exists, avoiding
duplication in the fallback path where recentLogsCache already
includes those lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): tighten Task 9 read path safety + invariant comment
Addresses review on Task 9: the unsafe null-lock fallback in the
pendingFlush append path is removed (it was structurally unreachable
but a latent hazard for future lifecycle changes). The pendingFlush
read now happens entirely under the per-stream lock, with a
conservative skip if no lock entry exists.
Also documents the recentLogsCache-vs-pendingFlush invariant in the
fallback path and adds a total-count assertion to the new test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(log-storage): add bug-reproducer ITs for streamable-logs stability
- testIdleGapDoesNotClobberPartial: two log bursts within an open run;
asserts both are present in the read response.
- testCloseProducesLogsTxtMatchingPartial: write, close, read; asserts
content survives the close.
- testCloseIsIdempotent: a second /close is a graceful no-op.
Tests are tolerant of the storage backend in the test environment
(DefaultLogStorage in CI may not persist; S3LogStorage in S3-configured
environments). Deep behavioral coverage is in S3LogStorageTest unit tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): address final-review critical bugs
- closeStream and finalizeAbandonedStream now propagate PUT failures
from writePartialLogsForStreamLocked (which returns boolean).
closeStream throws IOException; the sweeper retains state for retry.
Fixes silent data loss when the final flush PUT fails.
- streamLocks entries are no longer removed; this prevents an
acquire-vs-remove race that would break mutual exclusion. Memory
growth is bounded by maxConcurrentStreams in practice.
- cleanupAbandonedStreams re-checks expiration inside the per-stream
lock so a stream that was bumped by appendLogs between the scan
and the lock acquisition is not finalized.
- deleteLogs now acquires the per-stream lock before mutating state.
- getCombinedLogsForActiveStream appends pendingFlush in BOTH the
S3-found and memory-fallback branches, so reads aren't truncated
when recentLogsCache evicts oldest at its 1000-line cap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): use pendingFlush as canonical mid-run read source (no duplicates)
The previous Issue 5 fix appended pendingFlush unconditionally, which
caused duplicate lines in the read response when the fallback branch
used recentLogsCache (since both are populated by the same appendLogs).
Now: in the foundPartialFile branch, append pendingFlush AFTER the S3
body (non-overlapping by construction). In the fallback branch
(no partial.txt yet), use pendingFlush directly as the canonical
source — this is more complete than recentLogsCache (1000-line cap)
and avoids the duplicate issue. recentLogsCache remains a defensive
fallback for the rare case where pendingFlush is empty in the fallback
path.
* Update generated TypeScript types
* chore(log-storage): drop dead abortIncompleteMultipartUpload lifecycle rule
The multipart upload write path was removed; the bucket lifecycle's
abortIncompleteMultipartUpload(7 days) rule served only as migration
cleanup for in-flight uploads from the old code at deploy time. After
the migration window it does nothing.
Drops the rule from configureLifecyclePolicy, the AWS SDK import, the
"7 days multipart cleanup" string in the startup log, and the
corresponding bullet in docs/streamable-logs.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ignore docs/superpowers/
Local-only working notes (specs, plans) live there and shouldn't be tracked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(log-storage): tolerate DefaultLogStorage in CI for streamable-logs ITs
CI runs the integration tests against the bootstrap config which uses
DefaultLogStorage (delegates to k8s/Airflow which isn't running). The
storage returns:
- "No pods found for this pipeline" sentinel for getLogs
- non-2xx status (the SDK wraps it as statusCode -1) for /close
Adjustments:
- testIdleGapDoesNotClobberPartial: parse JSON, only assert when total>0.
When storage actually persists (S3 deployments), assert BOTH bursts
are present — that's the real "no clobber" check.
- postClose helper: tolerate any exception from the close call
(idempotency is the contract; transient errors are acceptable).
The deep behavioural coverage continues to live in S3LogStorageTest unit
tests where mock S3 is the storage backend.
* test
* fix
* Update generated TypeScript types
* fix
* Update generated TypeScript types
* fix(log-storage): record UTF-8 byte length in partial.txt total-bytes metadata
String.length() returns UTF-16 code units; for non-ASCII content this
diverged from the actual S3 object size, breaking the drift cross-check
documented in docs/streamable-logs.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(log-storage): address PR review findings on S3LogStorage
Plumbs the documented timing knobs (cleanupIntervalMinutes, partialFlushIntervalMinutes,
earlyFlushWatermarkBytes, pendingFlushAlertAfterFailures) through LogStorageConfiguration
so operators can actually tune them. Replaces the unbounded streamLocks ConcurrentHashMap
with a Guava Striped<Lock> capped at 256 stripes, eliminating the per-(fqn, runId) memory
leak and the acquire-vs-remove race that a per-key map would have. Adds a Multipart
Upload + UploadPartCopy concatenation path for partial.txt >= 5 MB, avoiding the O(n^2)
total transfer and full in-JVM body merge that the prior GET+PUT-everything strategy hit
on long-running pipelines. Realigns docs/streamable-logs.md with the actual schema and
implementation, drops the broken superpowers/* spec link, and renames the misleading
testIdleGapDoesNotClobberPartial IT (which posted bursts back-to-back without simulating
any gap).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* refactor(sampler): introduce EntityAdapter to centralise per-entity classification logic
Replace scattered isinstance(entity, Table/Container) branches across
processor.py, pii/base_processor.py, patch_mixin.py, and metadata_rest.py
with a single EntityAdapter strategy pattern in sampler/entity_adapters.py.
Each adapter encodes get_columns, set_columns, patch_fields,
build_sampler_kwargs, pipeline_config_class, and service_type for one entity
type. _BY_ENTITY and _BY_PIPELINE registries make lookup O(1). Adding a new
classifiable entity now requires changes to entity_adapters.py only — no
other ingestion files need to change.
Also extracts build_database_service_conn_config into sampler/config_utils.py
and updates the developer guide accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Apply PR feedback
* Typing and text fixes
* Apply gitar bot feedback
* Fix tests
* Apply Gitar bot suggestions
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Add schema support for container auto-classification
Extend container entity schema to support sample data storage, enabling
PII detection and classification workflows on storage service containers.
Changes:
- Add sampleData field to container.json for storing sample data
- Create storageServiceAutoClassificationPipeline.json schema defining
configuration for storage service auto-classification pipelines
- Update workflow.json to include StorageServiceAutoClassificationPipeline
as a supported pipeline type
This provides the schema foundation for running auto-classification
workflows on S3, GCS, and other storage service containers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add backend support for container sample data and classification
Implement Java backend functionality to handle sample data ingestion,
storage, and PII masking for container entities.
Changes:
- ContainerRepository: Add sample data retrieval and storage operations
- EntityRepository: Extend sample data support to container entities
- ContainerResource: Add REST endpoint for container sample data ingestion
- PIIMasker: Extend PII masking to support container entities
This enables the backend to process and store sample data from storage
service containers and apply PII masking rules during data retrieval.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Extend classifiable entity types to include containers
Add Container to the ClassifiableEntityType union, enabling PII detection
and auto-classification workflows to process storage service containers
alongside database tables.
Changes:
- Update ClassifiableEntityType from Table-only to Union[Table, Container]
- Import Container entity type
- Update module docstring to reflect current support
This type extension allows the PII processor to handle both database
tables and storage containers uniformly.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add container sample data ingestion to OpenMetadata API
Implement container-specific API mixin for sample data operations and
integrate it into the main OpenMetadata client.
Changes:
- Add OMetaContainerMixin with ingest_container_sample_data method
- Handle binary data encoding (base64) and serialization errors
- Register mixin in OpenMetadata class hierarchy
- Mirror table sample data ingestion patterns for consistency
This provides the Python API layer for ingesting sample data from
storage service containers into OpenMetadata.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Implement storage service samplers for S3 and GCS
Add sampler implementations for storage services to extract sample data
from structured containers (Parquet, CSV) for auto-classification.
Changes:
- Create base StorageSamplerInterface for storage service sampling
- Implement S3Sampler for AWS S3 containers with structured file support
- Implement GCSSampler for Google Cloud Storage containers
- Support column extraction and data sampling for structured formats
- Handle dataModel-based column definitions from containers
Storage samplers read container metadata, fetch file contents, and
generate sample datasets for downstream PII detection.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update PII processor to support container entities
Extend the base PII processor to handle both Table and Container
entities with unified column extraction logic.
Changes:
- Add _get_entity_columns helper to extract columns from Table or Container
- Handle Container entities with optional dataModel.columns structure
- Improve column matching with safe fallback for missing columns
- Use generic entity reference in error reporting
- Add early return when entity has no columns to process
This enables PII detection to run on storage containers the same way
it processes database tables.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add storage service support to sampler processor
Extend the sampler processor to handle both database and storage service
entities with appropriate sampler class selection.
Changes:
- Detect service type from source config (Database vs Storage)
- Import StorageServiceAutoClassificationPipeline
- Handle both Table and Container entity types in _run method
- Add column validation for Container entities (via dataModel.columns)
- Create storage-specific sampler interfaces for S3 and GCS
- Update sampler_interface to support Container entities
- Improve error messages with entity type context
The processor now dynamically selects database or storage samplers based
on the pipeline configuration type.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add storage fetcher strategy for container classification
Implement fetcher strategy pattern for storage services to retrieve
containers for auto-classification workflows.
Changes:
- Add StorageFetcherStrategy to handle storage service entity fetching
- Update EntityFetcher to select appropriate strategy based on service type
- Support both DatabaseService and StorageService in strategy selection
- Import StorageService type for service detection
- Improve error messages with specific service type information
The fetcher now dynamically creates database or storage-specific
strategies to retrieve entities based on pipeline configuration.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Register auto-classification pipeline in storage service specs
Add AutoClassification pipeline support to S3 and GCS storage service
specifications, enabling UI and workflow registration.
Changes:
- Add AutoClassification to S3ServiceSpec supported pipelines
- Add AutoClassification to GCSServiceSpec supported pipelines
- Import StorageServiceAutoClassificationPipeline in both specs
This registers the auto-classification workflow type for storage
services in the ingestion framework's service registry.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add container support to metadata sink and patch operations
Extend metadata sink and patch mixin to handle container entities,
enabling sample data ingestion and tag updates for containers.
Changes:
- Add Container to MetadataRestSink entity type handling
- Implement container sample data ingestion in sink._run
- Add Container to PatchMixin tag operations
- Import Container entity type in both modules
This completes the metadata ingestion pipeline by allowing the sink
to persist sample data and classification tags for container entities.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update classification workflow for storage service support
Extend the auto-classification workflow to handle both database and
storage service pipelines with unified step orchestration.
Changes:
- Import StorageServiceAutoClassificationPipeline
- Add type checking for both Database and Storage pipeline configs
- Remove unnecessary cast, use direct type checks
- Add validation warning for unsupported config types
- Preserve enableAutoClassification flag behavior for both types
The workflow now supports running PII detection and classification
on both database tables and storage containers based on config type.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add unit tests for container classification components
Add test coverage for container-specific fetcher and sampler components.
Changes:
- Add test_container_fetcher.py for StorageFetcherStrategy tests
- Add test_container_sampler_processor.py for container sampler tests
Tests validate:
- Storage service fetcher strategy selection and instantiation
- Container sampler processor initialization and execution
- Proper handling of Container entities vs Table entities
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Reorganize integration tests by entity type
Restructure auto-classification integration tests into separate
directories for databases and containers to improve organization.
Changes:
- Move database classification tests to databases/ subdirectory
- Move conftest.py, init.sql, and test_tag_processor.py into databases/
- Container tests already organized in containers/ subdirectory
- Remove old flat test structure
This organization makes it clearer which tests target database entities
vs storage container entities in classification workflows.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Properly retrieve sample data
* Update generated TypeScript types
* Apply Gitar bot
* Fix tests
* feat: Add supportsProfiler to storage connection schemas
Add supportsProfiler field to storage connection schemas (S3, GCS, ADLS,
Custom Storage) to enable auto-classification pipeline support for storage
services. This aligns with the backend changes in PR #26495 that added
container auto-classification functionality.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: Add UI support for storage service auto-classification
- Update IngestionWorkflowUtils to route storage services to storage-specific
auto-classification schema
- Modify getSupportedPipelineTypes to filter pipeline types based on service
category (storage services only show AutoClassification, not Profiler)
- Update AddIngestionButton to pass serviceCategory parameter
- Add unit test to verify storage services only get AutoClassification option
This enables users to configure and run auto-classification agents on storage
services (S3, GCS, ADLS) for PII detection on containers.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Add BucketArn field to S3BucketResponse model
AWS S3 API now returns a BucketArn field in list_buckets() responses.
Add this optional field to prevent Pydantic extra_forbidden validation errors.
Error: BucketArn Extra inputs are not permitted [type=extra_forbidden]
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Add Container permissions to AutoClassificationBotPolicy
Add Container entity permissions to AutoClassificationBotPolicy to allow the
autoClassification-bot to apply tags and sample data to storage containers.
Previously, the bot only had permissions for Table entities, causing
permission denied errors when running auto-classification on storage services.
Changes:
- Add Container rule with EditAll and ViewAll operations to policy seed data
- Create migrations for MySQL and PostgreSQL to update existing installations
Error fixed: Principal: CatalogPrincipal{name='autoclassification-bot'}
operations [EditTags] not allowed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update generated TypeScript types
* fix: Add fallback for storage service type detection in sampler
Add fallback logic to detect storage services by source type name when
the pipeline config type check fails. This handles cases where the Airflow
environment might not have the updated schema/package with
StorageServiceAutoClassificationPipeline.
Changes:
- Add fallback detection for s3, gcs, azuredatalake, customstorage
- Add debug logging for service type detection
- Preserve primary instanceof check for proper type detection
This fixes the "No module named 'metadata.ingestion.source.database.gcs'"
error when running storage auto-classification pipelines.
* Guide to support new entities in classification agent
* docs: Update auto-classification guide with debugging learnings
Add critical troubleshooting information discovered during container
classification debugging:
1. storeSampleData defaults to false
- Sample data NOT ingested unless explicitly enabled
- Document why this is by design (avoid large datasets)
- Add troubleshooting steps to verify flag is set
2. Service type detection fallback pattern
- Explain why fallback is needed (Airflow package caching)
- Show complete implementation with source type lists
- Add debug logging pattern
3. Troubleshooting section
- Sample data not appearing: check storeSampleData, database, logs
- Module import errors: service type detection issues
- PII tags not applied: config and data issues
4. Common pitfalls additions
- Emphasize storeSampleData default value
- Service type detection in cached environments
These updates reflect real debugging scenarios and will help future
developers avoid the same issues.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Apply gitar bot suggestions
* Fix suggestions, linting, and SonarCloud issues
* More gitar bot suggestions
* Fix compile error
* Fix linting
* Fix broken tests
* Fix unorganized import
* Improve config parsing
This is so that we rightly discover polymorphic properties of `source` when the config does not provide enough fields for Pydantic to correctly discriminate between models (e.g: confusing database source config with storage source config)
* Gitar bot comment
* Fix s3 source test
* Apply comments from reviews
* Extract cantidate column logic in samplers
* Fix tests
* Fix container customization test
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* Glossary Term Relations
* Add GlossaryTerm Relations
* Add GlossaryTerm Relations, Add custom relations, onotolgoy explorer
* Add Translations
* Update generated TypeScript types
* Address comments
* Address comments
* Address comments
* Update generated TypeScript types
* Update yarn.lock after merging cytoscape dependencies from glossary_relations
* fix zoom in and out functionality and added missing translate keys
* fix test
* Remove unwanted changes
* nit
* nit
* nit
* Remove conflict test
* nit
* fix test
* Add test for ontology explorer
* New yarn lock and 2.0.0 schema changes missed during merge conflicts
* Revamped glossary term relation settings
* Refactor code
* Addressed comments
* nit
* Update generated TypeScript types
* Java Checkstyle and Yarn lock
* Update generated TypeScript types
* fix unit test
* Remove 2.0.0 migration folders placed at wrong loc
* Merge main
* fix navigation to relation graph in glossary
* fix ontology explorer spec
* Added filter support in the data mode
* Fix glossary term relation CI failures
### Canonical Relation Storage (GlossaryTermRepository)
* Introduced `computeCanonicalRelationType()` to normalize relation direction
using UUID ordering (lower UUID is always treated as "from")
* Prevents duplicate and inconsistent relation rows when created from either side
* Updated `setTermRelations()` and `addRelation()` to store canonical relation types
* Fixed `setFields()` read logic:
* Invert relation type for `fromRecords` (entity is the TO side)
* Keep `toRecords` unchanged
* Updated `deleteBidirectionalRelatedTo()` to match canonical storage format
* Added `RequestEntityCache.invalidate()` after relation mutations to ensure consistency
### Lazy RDF Resource Initialization
* Added `RdfRepository.getInstanceOrNull()` for null-safe access without throwing
* Refactored `RdfResource` constructor to avoid eager `RdfRepository.getInstance()` call
* Enabled resource registration even when Fuseki is not initialized
* Introduced lazy getters:
* `getRdfRepository()`
* `getSemanticSearchEngine()`
* Updated all endpoints to guard with null checks before `isEnabled()`
* Return `503 Service Unavailable` when RDF is not ready
### Graceful Test Degradation (Fuseki-dependent tests)
* Added `TestSuiteBootstrap.isFusekiEnabled()` to detect Fuseki availability
* `GlossaryOntologyExportIT`:
* Falls back to Testcontainers-based local Fuseki when bootstrap Fuseki is unavailable
* `GlossaryTermRelationIT`:
* Skipped via `assumeTrue` when Fuseki is unavailable
* `MetricResourceIT`:
* Skips RDF-specific tests when Fuseki is unavailable
* fix package conflicts
* nit
* Fix merge conflicts, Python test, RDF reliability, and VectorDocBuilder tests
- Fix Python test_patch_glossary_term_related_terms to use TermRelation
instead of EntityReferenceList (schema changed relatedTerms type)
- Rewrite VectorDocBuilder tests for current buildEmbeddingFields API
- Improve JenaFusekiStorage retry logic to retry on all HTTP errors
- Increase Fuseki tmpfs size to prevent disk space exhaustion in tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix pycheck
* Address all 8 PR review findings
1. Add authorization check on getTermRelationGraph endpoint
2. Add null guard on getBaseUri() to prevent NPE
3. Add React key prop on RelatedTermTagButton in map renders
4. Mark RdfResource lazy-init fields as volatile for thread safety
5. Replace exception messages with generic errors in API responses
6. Unify DEFAULT_RELATION_TYPES between CSV and repository (10 types)
7. Add jitter backoff to deadlock retry in CollectionDAO
8. Replace N+1 queries in prefetchGraphTerms with batch fetch
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix Fuseki tmpfs exhaustion and GlossaryTermRelationIT double init
- Remove tmpfs size limit on Fuseki container to prevent disk exhaustion
- Guard RdfUpdater.initialize() in GlossaryTermRelationIT to skip if
already initialized by bootstrap
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix duplicate edges, null term NPE, and silent exception in graph builder
- Deduplicate edges in buildGraph() using edgesSeen set
- Skip TermRelation entries with null term references to prevent NPE
- Add warning log when glossary term relation settings fail to load
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix cardinality count after canonical swap and double-checked locking
- getRelationCount now matches inverse relation type for fromRecords
where the term is the target, fixing cardinality bypass after
bidirectional UUID canonicalization
- Use double-checked locking in RdfResource.getSemanticSearchEngine()
to prevent duplicate instance creation under concurrency
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: anuj-kumary <anujf0510@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com>
Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>
* Add design doc for search indexing stats redesign
Covers:
- Simplified 4-stage pipeline model (Reader, Process, Sink, Vector)
- Per-entity index promotion instead of batch promotion
- Alias management from indexMapping.json
- Payload-aware vector bulk processor
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add Support for Per Entity Index Promotion
* Add UI Bit
* Add Lang
* Add AppLog View Test coverage
* Add Bathced Vector index querying
* Add Improvements for Vector to be async and also stats to be better handled
* Use Virtual Thread
* Use Virtual Thread
* Fix Tests
* Make reading stats easier
* Fixed Stats to be accurate
* Fix Stats getting null
* Fix partition worker stats
* Fix Reader Stats - final
* Update generated TypeScript types
* Make updates in 1.12.0
* Revert "Use Virtual Thread"
This reverts commit 4eb23374d1.
* Revert "Use Virtual Thread"
This reverts commit efe8d03b5d.
* Reapply "Use Virtual Thread"
This reverts commit d59cde18b2.
* Reapply "Use Virtual Thread"
This reverts commit 769e5710c3.
* Fix Final Update on stat
* - Add atomic alias swap
- remove unnecessary migration
* Fix Sonar test jest
* Fix Final Update on stat
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>