OpenMetadata/docs/csv-relation-types-plan.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

276 lines
8.8 KiB
Markdown
Raw Permalink Normal View History

Glossary relations (#25886) * Glossary Term Relations * Add GlossaryTerm Relations * Add GlossaryTerm Relations, Add custom relations, onotolgoy explorer * Add Translations * Update generated TypeScript types * Address comments * Address comments * Address comments * Update generated TypeScript types * Update yarn.lock after merging cytoscape dependencies from glossary_relations * fix zoom in and out functionality and added missing translate keys * fix test * Remove unwanted changes * nit * nit * nit * Remove conflict test * nit * fix test * Add test for ontology explorer * New yarn lock and 2.0.0 schema changes missed during merge conflicts * Revamped glossary term relation settings * Refactor code * Addressed comments * nit * Update generated TypeScript types * Java Checkstyle and Yarn lock * Update generated TypeScript types * fix unit test * Remove 2.0.0 migration folders placed at wrong loc * Merge main * fix navigation to relation graph in glossary * fix ontology explorer spec * Added filter support in the data mode * Fix glossary term relation CI failures ### Canonical Relation Storage (GlossaryTermRepository) * Introduced `computeCanonicalRelationType()` to normalize relation direction using UUID ordering (lower UUID is always treated as "from") * Prevents duplicate and inconsistent relation rows when created from either side * Updated `setTermRelations()` and `addRelation()` to store canonical relation types * Fixed `setFields()` read logic: * Invert relation type for `fromRecords` (entity is the TO side) * Keep `toRecords` unchanged * Updated `deleteBidirectionalRelatedTo()` to match canonical storage format * Added `RequestEntityCache.invalidate()` after relation mutations to ensure consistency ### Lazy RDF Resource Initialization * Added `RdfRepository.getInstanceOrNull()` for null-safe access without throwing * Refactored `RdfResource` constructor to avoid eager `RdfRepository.getInstance()` call * Enabled resource registration even when Fuseki is not initialized * Introduced lazy getters: * `getRdfRepository()` * `getSemanticSearchEngine()` * Updated all endpoints to guard with null checks before `isEnabled()` * Return `503 Service Unavailable` when RDF is not ready ### Graceful Test Degradation (Fuseki-dependent tests) * Added `TestSuiteBootstrap.isFusekiEnabled()` to detect Fuseki availability * `GlossaryOntologyExportIT`: * Falls back to Testcontainers-based local Fuseki when bootstrap Fuseki is unavailable * `GlossaryTermRelationIT`: * Skipped via `assumeTrue` when Fuseki is unavailable * `MetricResourceIT`: * Skips RDF-specific tests when Fuseki is unavailable * fix package conflicts * nit * Fix merge conflicts, Python test, RDF reliability, and VectorDocBuilder tests - Fix Python test_patch_glossary_term_related_terms to use TermRelation instead of EntityReferenceList (schema changed relatedTerms type) - Rewrite VectorDocBuilder tests for current buildEmbeddingFields API - Improve JenaFusekiStorage retry logic to retry on all HTTP errors - Increase Fuseki tmpfs size to prevent disk space exhaustion in tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix pycheck * Address all 8 PR review findings 1. Add authorization check on getTermRelationGraph endpoint 2. Add null guard on getBaseUri() to prevent NPE 3. Add React key prop on RelatedTermTagButton in map renders 4. Mark RdfResource lazy-init fields as volatile for thread safety 5. Replace exception messages with generic errors in API responses 6. Unify DEFAULT_RELATION_TYPES between CSV and repository (10 types) 7. Add jitter backoff to deadlock retry in CollectionDAO 8. Replace N+1 queries in prefetchGraphTerms with batch fetch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix Fuseki tmpfs exhaustion and GlossaryTermRelationIT double init - Remove tmpfs size limit on Fuseki container to prevent disk exhaustion - Guard RdfUpdater.initialize() in GlossaryTermRelationIT to skip if already initialized by bootstrap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix duplicate edges, null term NPE, and silent exception in graph builder - Deduplicate edges in buildGraph() using edgesSeen set - Skip TermRelation entries with null term references to prevent NPE - Add warning log when glossary term relation settings fail to load Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix cardinality count after canonical swap and double-checked locking - getRelationCount now matches inverse relation type for fromRecords where the term is the target, fixing cardinality bypass after bidirectional UUID canonicalization - Use double-checked locking in RdfResource.getSemanticSearchEngine() to prevent duplicate instance creation under concurrency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anuj-kumary <anujf0510@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Ram Narayan Balaji <ramnarayanb3005@gmail.com> Co-authored-by: Ram Narayan Balaji <81347100+yan-3005@users.noreply.github.com>
2026-03-18 05:21:03 +00:00
# CSV Import/Export Enhancement for Glossary Term Relations
## Problem Statement
Currently, the glossary CSV import/export only captures related term FQNs without the relation type:
- **Export**: Only exports FQNs like `Glossary.Term1;Glossary.Term2`
- **Import**: Hardcodes all relations to `"relatedTo"`
This causes data loss when:
1. A term has `synonym`, `broader`, `narrower`, or custom relation types
2. CSV is exported and re-imported - all relation types become `"relatedTo"`
## Proposed Solution
### New CSV Format
**Format**: `relationType:termFQN` pairs separated by semicolons
**Examples**:
```csv
# New format with relation types
relatedTerms
synonym:Finance.Revenue;broader:Finance.Income;narrower:Finance.Net Revenue
# Backward compatible - no prefix defaults to "relatedTo"
relatedTerms
Finance.Revenue;Finance.Income
# Mixed format (new and legacy)
relatedTerms
synonym:Finance.Revenue;Finance.Income;broader:Finance.Gross Income
```
### Parsing Rules
1. If a value contains `:` and the part before `:` is a valid relation type → use that relation type
2. If no `:` or the prefix is not a valid relation type → default to `"relatedTo"`
3. Valid relation types are determined by checking `glossaryTermRelationSettings` or using defaults
### Default Relation Types
| Relation Type | Description |
|---------------|-------------|
| `relatedTo` | Generic related term (default) |
| `synonym` | Equivalent term |
| `broader` | More general term |
| `narrower` | More specific term |
| `antonym` | Opposite meaning |
| `partOf` | Component of |
| `hasPart` | Contains |
## Implementation Plan
### Phase 1: Backend Changes
#### 1.1 CsvUtil.java - Export Enhancement
**File**: `openmetadata-service/src/main/java/org/openmetadata/csv/CsvUtil.java`
**Current** (line 253-263):
```java
public static List<String> addTermRelations(
List<String> csvRecord, List<TermRelation> termRelations) {
csvRecord.add(
nullOrEmpty(termRelations)
? null
: termRelations.stream()
.map(tr -> tr.getTerm().getFullyQualifiedName())
.sorted()
.collect(Collectors.joining(FIELD_SEPARATOR)));
return csvRecord;
}
```
**New**:
```java
public static List<String> addTermRelations(
List<String> csvRecord, List<TermRelation> termRelations) {
csvRecord.add(
nullOrEmpty(termRelations)
? null
: termRelations.stream()
.map(tr -> {
String relationType = tr.getRelationType();
String fqn = tr.getTerm().getFullyQualifiedName();
// Only include relation type prefix if not the default "relatedTo"
if (relationType != null && !relationType.equals("relatedTo")) {
return relationType + ":" + fqn;
}
return fqn;
})
.sorted()
.collect(Collectors.joining(FIELD_SEPARATOR)));
return csvRecord;
}
```
#### 1.2 GlossaryRepository.java - Import Enhancement
**File**: `openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/GlossaryRepository.java`
**Current** (line 315-327):
```java
private List<TermRelation> getTermRelationsFromCsv(
CSVPrinter printer, CSVRecord csvRecord, int fieldNumber) throws IOException {
List<EntityReference> entityRefs =
getEntityReferences(printer, csvRecord, fieldNumber, GLOSSARY_TERM);
if (entityRefs == null) {
return null;
}
List<TermRelation> termRelations = new ArrayList<>();
for (EntityReference ref : entityRefs) {
termRelations.add(new TermRelation().withTerm(ref).withRelationType("relatedTo"));
}
return termRelations;
}
```
**New**:
```java
private static final Set<String> VALID_RELATION_TYPES = Set.of(
"relatedTo", "synonym", "broader", "narrower", "antonym", "partOf", "hasPart"
);
private List<TermRelation> getTermRelationsFromCsv(
CSVPrinter printer, CSVRecord csvRecord, int fieldNumber) throws IOException {
String fieldValue = csvRecord.get(fieldNumber);
if (nullOrEmpty(fieldValue)) {
return null;
}
List<TermRelation> termRelations = new ArrayList<>();
String[] entries = fieldValue.split(FIELD_SEPARATOR);
for (String entry : entries) {
String relationType = "relatedTo"; // Default
String termFqn = entry.trim();
// Check for relationType:fqn format
int colonIndex = entry.indexOf(':');
if (colonIndex > 0) {
String prefix = entry.substring(0, colonIndex).trim();
String suffix = entry.substring(colonIndex + 1).trim();
// Validate if prefix is a known relation type
if (VALID_RELATION_TYPES.contains(prefix) || isCustomRelationType(prefix)) {
relationType = prefix;
termFqn = suffix;
}
// If prefix is not a valid relation type, treat entire string as FQN
// (handles FQNs that contain colons like "Database:Schema.Table")
}
EntityReference termRef = getEntityReference(printer, csvRecord, GLOSSARY_TERM, termFqn);
if (termRef != null) {
termRelations.add(new TermRelation().withTerm(termRef).withRelationType(relationType));
}
}
return termRelations.isEmpty() ? null : termRelations;
}
private boolean isCustomRelationType(String relationType) {
// Check against glossaryTermRelationSettings for custom relation types
try {
// Fetch from settings cache or use default list
return false; // Implement based on settings lookup
} catch (Exception e) {
return false;
}
}
```
#### 1.3 Documentation Update
**File**: `openmetadata-service/src/main/resources/json/data/glossary/glossaryCsvDocumentation.json`
Update the `relatedTerms` field documentation:
```json
{
"name": "relatedTerms",
"required": false,
"description": "Related glossary terms with optional relation types. Format: 'relationType:FQN' or just 'FQN'. Multiple values separated by ';'. Valid relation types: relatedTo (default), synonym, broader, narrower, antonym, partOf, hasPart. Example: 'synonym:Glossary.Term1;broader:Glossary.Term2;Glossary.Term3'",
"examples": [
"Glossary.Term1;Glossary.Term2",
"synonym:Glossary.Term1;broader:Glossary.Term2",
"synonym:Glossary.Revenue;Glossary.Income;narrower:Glossary.Net Revenue"
]
}
```
### Phase 2: Testing
#### 2.1 Unit Tests
**File**: `openmetadata-service/src/test/java/org/openmetadata/csv/CsvUtilTest.java`
```java
@Test
void testAddTermRelationsWithRelationType() {
// Test that relation types are included in export
}
@Test
void testAddTermRelationsDefaultRelationType() {
// Test that "relatedTo" terms don't include prefix
}
```
#### 2.2 Integration Tests
**File**: `openmetadata-service/src/test/java/org/openmetadata/service/resources/glossary/GlossaryTermResourceTest.java`
```java
@Test
void testGlossaryTermCsvImportWithRelationTypes() {
// Test importing CSV with relation type prefixes
}
@Test
void testGlossaryTermCsvExportWithRelationTypes() {
// Test exporting terms with various relation types
}
@Test
void testGlossaryTermCsvBackwardCompatibility() {
// Test importing old format CSV (no relation types)
}
@Test
void testGlossaryTermCsvRoundTripWithRelationTypes() {
// Test that export -> import preserves relation types
}
```
### Phase 3: Edge Cases
1. **FQN contains colon**: Handle cases like `Database:Schema.Term` by validating the prefix against known relation types
2. **Invalid relation type**: If prefix is not a valid relation type, treat entire string as FQN with default `relatedTo`
3. **Empty relation type**: `":Glossary.Term"` should default to `relatedTo`
4. **Custom relation types**: Check against `glossaryTermRelationSettings` for user-defined relation types
### Backward Compatibility
| CSV Format | Import Behavior |
|------------|----------------|
| `Glossary.Term1;Glossary.Term2` | All relations → `relatedTo` |
| `synonym:Glossary.Term1;Glossary.Term2` | First → `synonym`, Second → `relatedTo` |
| `synonym:Glossary.Term1;broader:Glossary.Term2` | Preserves both relation types |
### Files to Modify
| File | Change |
|------|--------|
| `CsvUtil.java` | Update `addTermRelations()` to include relation type prefix |
| `GlossaryRepository.java` | Update `getTermRelationsFromCsv()` to parse relation types |
| `glossaryCsvDocumentation.json` | Update field documentation and examples |
| `GlossaryTermResourceTest.java` | Add tests for new format |
| `CsvUtilTest.java` | Add unit tests for parsing |
### Migration Notes
- **No database migration needed**: The database already stores relation types correctly
- **Existing CSVs**: Will continue to work (all imported as `relatedTo`)
- **New exports**: Will include relation type prefixes for non-default relations
## Summary
This enhancement:
1. ✅ Preserves relation types during CSV export/import
2. ✅ Maintains backward compatibility with existing CSVs
3. ✅ Defaults to `relatedTo` when no relation type specified
4. ✅ Follows existing OpenMetadata CSV patterns (`type:value`)
5. ✅ Supports custom relation types via settings