OpenMetadata/AGENTS.md
IceS2 e9c87c6adb
chore(ingestion): drop pylint, expand ruff (#27774)
* chore(ingestion): drop pylint, expand ruff to Stage 2c

Replace pylint with a coherent ruff-only stack (Stage 2c of the modernize
roadmap). Pylint is dropped from dev deps and CI workflows; ruff selected
ruleset expanded to ~22 families covering style, bug catchers, hygiene,
and the pylint port (PLE/PLC/PLW/PLR with the noisy "too-many-X"
complexity caps + magic-value disabled).

What's selected (with rationale in pyproject.toml):
  E, W, F, I, N         — style + correctness baseline + naming
  UP                    — pyupgrade (py>=3.10 modernizations)
  B, C4, C90, RET, SIM, TRY  — bug catchers
  PIE, ICN, T20, TC, TID, PTH, PERF  — hygiene
  PLE, PLC, PLW, PLR    — pylint port (PLR complexity caps ignored)
  RUF                   — ruff-native (incl. RUF100 unused-noqa)

What's removed:
  - .pylintrc (root) — duplicate of the ingestion pylint config
  - [tool.pylint.*] block in ingestion/pyproject.toml (~140 lines)
  - ingestion/plugins/{print_checker,import_checker}.py + tests + README
    (replaced by built-in T20 + TID251 banned-api respectively)
  - pylint dep from ingestion/setup.py and openmetadata-airflow-apis/pyproject.toml
  - `make lint` Makefile target + the pylint invocation in py_format_check
  - dead pylint TODO comment + ignored test entry in noxfile.py

Cwd-stable config: ruff is invoked both from the repo root (pre-commit,
CI) and from ingestion/ (`make py_format_check`). The `src`,
`extend-exclude`, and per-file-ignores entries are listed twice — once
relative to ingestion/ and once with the `ingestion/` prefix — so
first-party isort detection and exclusions match in both invocations.

Grandfathering: ran `ruff check --add-noqa` once + format-stable
iteration. ~12,130 noqa directives across ~1,400 files. Cleanup is
deferred to follow-up PRs that drop noqas one rule at a time.

Documentation sweep: replaced `make lint` references in CLAUDE.md,
AGENTS.md, DEVELOPER.md, copilot-instructions, and 6 SKILL files with
the apply+verify shape `make py_format && make py_format_check`.
`make py_format` is NOT a strict superset of pylint — it only applies
auto-fixable violations; `make py_format_check` catches the rest.

Basedpyright baseline regenerated: ruff format reflowed multi-line
signatures in ~70 files, shifting type-error column positions. The
basedpyright baseline matches by (file path, error code, range), so
column shifts caused 19 entries to mis-align. Net diff is small
(154 lines in/out of the 13MB baseline.json) — purely positional.

Verified locally:
  - make py_format_check         → All checks passed
  - nox --no-venv -s static-checks → 0 errors, 0 warnings, 0 notes

* chore(ingestion): finish ruff swap — nox lint session + skill docs

Three remaining stale-tooling references after Stage 2c:

  - `ingestion/noxfile.py` `lint` session was still calling `black --check`,
    `isort --check-only`, `pycln --diff`. Those tools aren't installed
    anywhere (we dropped them from dev deps). Replace with the ruff
    equivalents that mirror `make py_format_check`.
  - `skills/standards/code_style.md`: stack listed as `black + isort +
    pycln`; line length claimed 88 (black default). Both wrong: stack is
    ruff, line length is 120.
  - `skills/connector-building/SKILL.md`: `make py_format` comment said
    `# black + isort + pycln`. Same swap.

* chore(ingestion): keep main's baseline + globally ignore TRY400

Per gitar-bot's review on PR #27774:

1. Main's PR #27728 promoted ~60 `logger.warning()` → `logger.error()`
   inside `except` blocks. Those changes landed on main with their own
   baseline updates. Our PR doesn't promote anything — the merge from
   origin/main brought those `error` calls along with their baseline
   entries.

   The bot interpreted the `# noqa: TRY400` we added next to those lines
   as us silencing the rule case-by-case. Cleaner: globally ignore
   TRY400 in pyproject.toml, with a comment explaining why the codebase's
   `logger.error(...)` + separate `logger.debug(traceback.format_exc())`
   pattern is intentional. Strip ~430 per-line `# noqa: TRY400` markers
   from source.

2. Document that `S101` in `per-file-ignores` is a forward-looking
   entry — flake8-bandit (`S`) is not yet selected, so the rule is
   no-op today; the entry stays so when `S` lands later, tests don't
   immediately error.

Reverts the platform pin and Linux Docker–generated baseline. Keep
main's baseline intact and let CI surface the exact column-shifted
entries; the team will decide whether to fix in-place (revert format
on affected files) or add per-line `# pyright: ignore` markers.

* chore(ingestion): regen baseline for new connector type debt

Main's baseline was stale relative to recently-added connectors
(McpConnection, CustomDriveConnection) that lack common attributes
like `hostPort`, `database`, `catalog` etc. — all sites that access
those attributes via the union-typed `serviceConnection.root.config`
fire `reportAttributeAccessIssue` errors that aren't baselined.

71 errors + 58 warnings absorbed. Local macOS regen; pushing to see
CI's drift count. Per the basedpyright-baseline-and-ci PR experience,
macOS↔Linux column drift on this size of regen has historically been
1-7 residuals.
2026-04-28 07:21:59 +02:00

13 KiB

AGENTS.md

This file provides guidance to Codex (Codex.ai/code) when working with code in this repository.

About OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance. This is a multi-module project with Java backend services, React frontend, Python ingestion framework, and comprehensive Docker infrastructure.

Architecture Overview

  • Backend: Java 21 + Dropwizard REST API framework, multi-module Maven project
  • Frontend: React + TypeScript + Ant Design, built with Webpack and Yarn
  • Ingestion: Python 3.10-3.12 with Pydantic 2.x, 75+ data source connectors
  • Database: MySQL (default) or PostgreSQL with Flyway migrations
  • Search: Elasticsearch 7.17+ or OpenSearch 2.6+ for metadata discovery
  • Infrastructure: Apache Airflow for workflow orchestration

Essential Development Commands

Prerequisites and Setup

make prerequisites              # Check system requirements
make install_dev_env           # Install all development dependencies
make yarn_install_cache        # Install UI dependencies

Frontend Development

cd openmetadata-ui/src/main/resources/ui
yarn start                     # Start development server on localhost:3000
yarn test                      # Run Jest unit tests
yarn test path/to/test.spec.ts # Run a specific test file
yarn test:watch               # Run tests in watch mode
yarn playwright:run            # Run E2E tests
yarn lint                      # ESLint check
yarn lint:fix                  # ESLint with auto-fix
yarn build                     # Production build

Backend Development

mvn clean package -DskipTests  # Build without tests
mvn clean package -DonlyBackend -pl !openmetadata-ui  # Backend only
mvn test                       # Run unit tests
mvn verify                     # Run integration tests
mvn spotless:apply             # Format Java code

Python Ingestion Development

cd ingestion
make install_dev_env           # Install in development mode
make generate                  # Generate Pydantic models from JSON schemas
make unit_ingestion_dev_env    # Run unit tests
make py_format                 # Apply ruff lint-fix + format
make py_format_check           # Verify lint + format (matches CI; catches non-auto-fixable issues)
make static-checks             # Run type checking with basedpyright

Full Local Environment

./docker/run_local_docker.sh -m ui -d mysql        # Complete local setup with UI
./docker/run_local_docker.sh -m no-ui -d postgresql # Backend only with PostgreSQL
./docker/run_local_docker.sh -s true               # Skip Maven build step

Testing

make run_e2e_tests             # Full E2E test suite
make unit_ingestion            # Python unit tests with coverage
yarn test:coverage             # Frontend test coverage

Code Generation and Schemas

OpenMetadata uses a schema-first approach with JSON Schema definitions driving code generation:

make generate                  # Generate all models from schemas
make py_antlr                  # Generate Python ANTLR parsers
make js_antlr                  # Generate JavaScript ANTLR parsers
yarn parse-schema              # Parse JSON schemas for frontend (connection and ingestion schemas)

Schema Architecture

  • Source schemas in openmetadata-spec/ define the canonical data models
  • Connection schemas are pre-processed at build time via parseSchemas.js to resolve all $ref references
  • Application schemas in openmetadata-ui/.../ApplicationSchemas/ are resolved at runtime using schemaResolver.ts
  • JSON schemas with $ref references to external files require resolution before use in forms

Key Directories

  • openmetadata-service/ - Core Java backend services and REST APIs
  • openmetadata-ui/src/main/resources/ui/ - React frontend application
  • ingestion/ - Python ingestion framework with connectors
  • openmetadata-spec/ - JSON Schema specifications for all entities
  • bootstrap/sql/ - Database schema migrations and sample data
  • conf/ - Configuration files for different environments
  • docker/ - Docker configurations for local and production deployment

Development Workflow

  1. Schema Changes: Modify JSON schemas in openmetadata-spec/, then run mvn clean install on openmetadata-spec to update models
  2. Backend: Develop in Java using Dropwizard patterns, test with mvn test, format with mvn spotless:apply
  3. Frontend: Use React/TypeScript with Ant Design components, test with Jest/Playwright
  4. Ingestion: Python connectors follow plugin pattern, use make install_dev_env for development
  5. Full Testing: Use make run_e2e_tests before major changes

Frontend Architecture Patterns

React Component Patterns

  • File Naming: Components use ComponentName.component.tsx, interfaces use ComponentName.interface.ts
  • State Management: Use useState with proper typing, avoid any
  • Side Effects: Use useEffect with proper dependency arrays
  • Performance: Use useCallback for event handlers, useMemo for expensive computations
  • Custom Hooks: Prefix with use, place in src/hooks/, return typed objects
  • Internationalization: Use useTranslation hook from react-i18next, access with t('key')
  • Component Structure: Functional components only, no class components
  • Props: Define interfaces for all component props, place in .interface.ts files
  • Loading States: Use object state for multiple loading states: useState<Record<string, boolean>>({})
  • Error Handling: Use showErrorToast and showSuccessToast utilities from ToastUtils
  • Navigation: Use useNavigate from react-router-dom, not direct history manipulation
  • Data Fetching: Async functions with try-catch blocks, update loading states appropriately

State Management

  • Use Zustand stores for global state (e.g., useLimitStore, useWelcomeStore)
  • Keep component state local when possible with useState
  • Use context providers for feature-specific shared state (e.g., ApplicationsProvider)

Styling

  • MUI Migration: The project is gradually migrating from Ant Design to Material-UI (MUI) v7.3.1
  • Preferred Approach: Use MUI components v7.3.1 and styles wherever possible for new features
  • Theme and Styles: MUI theme data and styles are defined in openmetadata-ui-core-components
  • Colors and Design Tokens: Always reference theme colors and design tokens from the MUI theme, not hardcoded values
  • Legacy Components: Ant Design components remain in existing code but should be replaced with MUI equivalents when refactoring
  • Do not add unnecessary spacing between logs and code.
  • In Java, avoid wildcards imports (e.g., use import java.util.List; instead of import java.util.*;)
  • Custom styles in .less files with component-specific naming (legacy pattern)
  • Follow BEM naming convention for custom CSS classes
  • Use CSS modules where appropriate

UI considerations

  • Do not use string literals at any place. You should use useTranslation hook and use it like const {t} = useTranslation(). And for example if you want to have "Run" as string, you should be using { t('label.run') }, this label is defined in locales.

Application Configuration

  • Applications use ApplicationsClassBase for schema loading and configuration
  • Dynamic imports handle application-specific schemas and assets
  • Form schemas use React JSON Schema Form (RJSF) with custom UI widgets

Service Utilities

  • Each service type has dedicated utility files (e.g., DatabaseServiceUtils.tsx)
  • Connection schemas are imported statically and pre-resolved
  • Service configurations use switch statements to map types to schemas

Type Safety

  • All API responses have generated TypeScript interfaces in generated/
  • Custom types extend base interfaces when needed
  • Avoid type assertions unless absolutely necessary
  • Use discriminated unions for action types and state variants

Database and Migrations

  • Flyway handles schema migrations in bootstrap/sql/migrations/
  • Use Docker containers for local database setup
  • Default MySQL, PostgreSQL supported as alternative
  • Sample data loaded automatically in development environment

Security and Authentication

  • JWT-based authentication with OAuth2/SAML support
  • Role-based access control defined in Java entities
  • Security configurations in conf/openmetadata.yaml
  • Never commit secrets - use environment variables or secure vaults

Code Generation Standards

Comments Policy

  • Do NOT add unnecessary comments - write self-documenting code
  • NEVER add single-line comments that describe what the code obviously does
  • Only include comments for:
    • Complex business logic that isn't obvious
    • Non-obvious algorithms or workarounds
    • Public API JavaDoc documentation
    • TODO/FIXME with ticket references
  • Bad examples (NEVER do this):
    • // Create user before createUser()
    • // Get client before SdkClients.adminClient()
    • // Verify domain is set before assertNotNull(entity.getDomain())
    • // User names are lowercased when the code toLowerCase() makes it obvious
  • If the code needs a comment to be understood, refactor the code to be clearer instead

Java Code Requirements

  • Always mention running mvn spotless:apply when generating/modifying .java files
  • Use clear, descriptive variable and method names instead of comments
  • Follow existing project patterns and conventions
  • Generate production-ready code, not tutorial code
  • Create integration tests in openmetadata-integration-tests
  • Do not use Fully Qualified Names in the code such as org.openmetadata.schema.type.Status instead import the class name
  • Do not import wild-card packages instead import exactly required packages

TypeScript/Frontend Code Requirements

  • NEVER use any type in TypeScript code - always use proper types
  • Use unknown when the type is truly unknown and add type guards
  • Import types from existing type definitions (e.g., RJSFSchema from @rjsf/utils)
  • Follow ESLint rules strictly - the project enforces no-console, proper formatting
  • Add // eslint-disable-next-line comments only when absolutely necessary
  • Import Organization (in order):
    1. External libraries (React, Ant Design, etc.)
    2. Internal absolute imports from generated/, constants/, hooks/, etc.
    3. Relative imports for utilities and components
    4. Asset imports (SVGs, styles)
    5. Type imports grouped separately when needed

Python Code Requirements

  • Use pytest, not unittest - write tests using pytest style with plain assert statements
  • Use pytest fixtures for test setup instead of setUp/tearDown methods
  • Use unittest.mock for mocking (MagicMock, patch) - this is compatible with pytest
  • Test classes should not inherit from TestCase - use plain classes prefixed with Test
  • Use assert x == y instead of self.assertEqual(x, y)
  • Use assert x is None instead of self.assertIsNone(x)
  • Use assert "text" in string instead of self.assertIn("text", string)

Python Ingestion Connector Guidelines

  • Keep connector-specific logic in connector-specific files, not in generic/shared files like builders.py
  • Example: Redshift IAM auth should be in ingestion/src/metadata/ingestion/source/database/redshift/connection.py, not in ingestion/src/metadata/ingestion/connections/builders.py
  • This keeps the codebase modular and prevents generic utilities from becoming cluttered with connector-specific edge cases

Testing Philosophy

  • Test real behavior, not mock wiring - if a test requires mocking 3+ classes just to verify a method call, it's testing the wrong thing
  • Prefer integration tests over heavily-mocked unit tests. This project has full integration test infrastructure (OpenMetadataApplicationTest, Docker containers, real OpenSearch). Use it.
  • Mocks are for boundaries, not internals - mock external services (HTTP clients, third-party APIs), not your own classes. If you're mocking static methods left and right to test internal plumbing, write an integration test instead.
  • A test that mocks everything proves nothing - it only verifies that your mocks are wired correctly, not that the system works
  • Ask "what breaks if this test passes but the code is wrong?" - if the answer is "nothing, because everything real is mocked out", delete the test and write a better one
  • Test the outcome, not the implementation - assert on observable results (API responses, database state, stats values) rather than verifying internal method calls with verify()

Response Format

  • Provide clean code blocks without unnecessary explanations
  • Assume readers are experienced developers
  • Focus on functionality over education