mirror of https://github.com/open-metadata/OpenMetadata synced 2026-05-24 09:39:11 +00:00

chore(ingestion): drop pylint, expand ruff (#27774 )

* chore(ingestion): drop pylint, expand ruff to Stage 2c

Replace pylint with a coherent ruff-only stack (Stage 2c of the modernize
roadmap). Pylint is dropped from dev deps and CI workflows; ruff selected
ruleset expanded to ~22 families covering style, bug catchers, hygiene,
and the pylint port (PLE/PLC/PLW/PLR with the noisy "too-many-X"
complexity caps + magic-value disabled).

What's selected (with rationale in pyproject.toml):
  E, W, F, I, N         — style + correctness baseline + naming
  UP                    — pyupgrade (py>=3.10 modernizations)
  B, C4, C90, RET, SIM, TRY  — bug catchers
  PIE, ICN, T20, TC, TID, PTH, PERF  — hygiene
  PLE, PLC, PLW, PLR    — pylint port (PLR complexity caps ignored)
  RUF                   — ruff-native (incl. RUF100 unused-noqa)

What's removed:
  - .pylintrc (root) — duplicate of the ingestion pylint config
  - [tool.pylint.*] block in ingestion/pyproject.toml (~140 lines)
  - ingestion/plugins/{print_checker,import_checker}.py + tests + README
    (replaced by built-in T20 + TID251 banned-api respectively)
  - pylint dep from ingestion/setup.py and openmetadata-airflow-apis/pyproject.toml
  - `make lint` Makefile target + the pylint invocation in py_format_check
  - dead pylint TODO comment + ignored test entry in noxfile.py

Cwd-stable config: ruff is invoked both from the repo root (pre-commit,
CI) and from ingestion/ (`make py_format_check`). The `src`,
`extend-exclude`, and per-file-ignores entries are listed twice — once
relative to ingestion/ and once with the `ingestion/` prefix — so
first-party isort detection and exclusions match in both invocations.

Grandfathering: ran `ruff check --add-noqa` once + format-stable
iteration. ~12,130 noqa directives across ~1,400 files. Cleanup is
deferred to follow-up PRs that drop noqas one rule at a time.

Documentation sweep: replaced `make lint` references in CLAUDE.md,
AGENTS.md, DEVELOPER.md, copilot-instructions, and 6 SKILL files with
the apply+verify shape `make py_format && make py_format_check`.
`make py_format` is NOT a strict superset of pylint — it only applies
auto-fixable violations; `make py_format_check` catches the rest.

Basedpyright baseline regenerated: ruff format reflowed multi-line
signatures in ~70 files, shifting type-error column positions. The
basedpyright baseline matches by (file path, error code, range), so
column shifts caused 19 entries to mis-align. Net diff is small
(154 lines in/out of the 13MB baseline.json) — purely positional.

Verified locally:
  - make py_format_check         → All checks passed
  - nox --no-venv -s static-checks → 0 errors, 0 warnings, 0 notes

* chore(ingestion): finish ruff swap — nox lint session + skill docs

Three remaining stale-tooling references after Stage 2c:

  - `ingestion/noxfile.py` `lint` session was still calling `black --check`,
    `isort --check-only`, `pycln --diff`. Those tools aren't installed
    anywhere (we dropped them from dev deps). Replace with the ruff
    equivalents that mirror `make py_format_check`.
  - `skills/standards/code_style.md`: stack listed as `black + isort +
    pycln`; line length claimed 88 (black default). Both wrong: stack is
    ruff, line length is 120.
  - `skills/connector-building/SKILL.md`: `make py_format` comment said
    `# black + isort + pycln`. Same swap.

* chore(ingestion): keep main's baseline + globally ignore TRY400

Per gitar-bot's review on PR #27774:

1. Main's PR #27728 promoted ~60 `logger.warning()` → `logger.error()`
   inside `except` blocks. Those changes landed on main with their own
   baseline updates. Our PR doesn't promote anything — the merge from
   origin/main brought those `error` calls along with their baseline
   entries.

   The bot interpreted the `# noqa: TRY400` we added next to those lines
   as us silencing the rule case-by-case. Cleaner: globally ignore
   TRY400 in pyproject.toml, with a comment explaining why the codebase's
   `logger.error(...)` + separate `logger.debug(traceback.format_exc())`
   pattern is intentional. Strip ~430 per-line `# noqa: TRY400` markers
   from source.

2. Document that `S101` in `per-file-ignores` is a forward-looking
   entry — flake8-bandit (`S`) is not yet selected, so the rule is
   no-op today; the entry stays so when `S` lands later, tests don't
   immediately error.

Reverts the platform pin and Linux Docker–generated baseline. Keep
main's baseline intact and let CI surface the exact column-shifted
entries; the team will decide whether to fix in-place (revert format
on affected files) or add per-line `# pyright: ignore` markers.

* chore(ingestion): regen baseline for new connector type debt

Main's baseline was stale relative to recently-added connectors
(McpConnection, CustomDriveConnection) that lack common attributes
like `hostPort`, `database`, `catalog` etc. — all sites that access
those attributes via the union-typed `serviceConnection.root.config`
fire `reportAttributeAccessIssue` errors that aren't baselined.

71 errors + 58 warnings absorbed. Local macOS regen; pushing to see
CI's drift count. Per the basedpyright-baseline-and-ci PR experience,
macOS↔Linux column drift on this size of regen has historically been
1-7 residuals.

2026-04-28 07:21:59 +02:00

13 KiB

Raw Blame History

AGENTS.md

This file provides guidance to Codex (Codex.ai/code) when working with code in this repository.

About OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance. This is a multi-module project with Java backend services, React frontend, Python ingestion framework, and comprehensive Docker infrastructure.

Architecture Overview

Backend: Java 21 + Dropwizard REST API framework, multi-module Maven project
Frontend: React + TypeScript + Ant Design, built with Webpack and Yarn
Ingestion: Python 3.10-3.12 with Pydantic 2.x, 75+ data source connectors
Database: MySQL (default) or PostgreSQL with Flyway migrations
Search: Elasticsearch 7.17+ or OpenSearch 2.6+ for metadata discovery
Infrastructure: Apache Airflow for workflow orchestration

Essential Development Commands

Prerequisites and Setup

make prerequisites              # Check system requirements
make install_dev_env           # Install all development dependencies
make yarn_install_cache        # Install UI dependencies

Frontend Development

cd openmetadata-ui/src/main/resources/ui
yarn start                     # Start development server on localhost:3000
yarn test                      # Run Jest unit tests
yarn test path/to/test.spec.ts # Run a specific test file
yarn test:watch               # Run tests in watch mode
yarn playwright:run            # Run E2E tests
yarn lint                      # ESLint check
yarn lint:fix                  # ESLint with auto-fix
yarn build                     # Production build

Backend Development

mvn clean package -DskipTests  # Build without tests
mvn clean package -DonlyBackend -pl !openmetadata-ui  # Backend only
mvn test                       # Run unit tests
mvn verify                     # Run integration tests
mvn spotless:apply             # Format Java code

Python Ingestion Development

cd ingestion
make install_dev_env           # Install in development mode
make generate                  # Generate Pydantic models from JSON schemas
make unit_ingestion_dev_env    # Run unit tests
make py_format                 # Apply ruff lint-fix + format
make py_format_check           # Verify lint + format (matches CI; catches non-auto-fixable issues)
make static-checks             # Run type checking with basedpyright

Full Local Environment

./docker/run_local_docker.sh -m ui -d mysql        # Complete local setup with UI
./docker/run_local_docker.sh -m no-ui -d postgresql # Backend only with PostgreSQL
./docker/run_local_docker.sh -s true               # Skip Maven build step

Testing

make run_e2e_tests             # Full E2E test suite
make unit_ingestion            # Python unit tests with coverage
yarn test:coverage             # Frontend test coverage

Code Generation and Schemas

OpenMetadata uses a schema-first approach with JSON Schema definitions driving code generation:

make generate                  # Generate all models from schemas
make py_antlr                  # Generate Python ANTLR parsers
make js_antlr                  # Generate JavaScript ANTLR parsers
yarn parse-schema              # Parse JSON schemas for frontend (connection and ingestion schemas)

Schema Architecture

Source schemas in openmetadata-spec/ define the canonical data models
Connection schemas are pre-processed at build time via parseSchemas.js to resolve all $ref references
Application schemas in openmetadata-ui/.../ApplicationSchemas/ are resolved at runtime using schemaResolver.ts
JSON schemas with $ref references to external files require resolution before use in forms

Key Directories

openmetadata-service/ - Core Java backend services and REST APIs
openmetadata-ui/src/main/resources/ui/ - React frontend application
ingestion/ - Python ingestion framework with connectors
openmetadata-spec/ - JSON Schema specifications for all entities
bootstrap/sql/ - Database schema migrations and sample data
conf/ - Configuration files for different environments
docker/ - Docker configurations for local and production deployment

Development Workflow

Schema Changes: Modify JSON schemas in openmetadata-spec/, then run mvn clean install on openmetadata-spec to update models
Backend: Develop in Java using Dropwizard patterns, test with mvn test, format with mvn spotless:apply
Frontend: Use React/TypeScript with Ant Design components, test with Jest/Playwright
Ingestion: Python connectors follow plugin pattern, use make install_dev_env for development
Full Testing: Use make run_e2e_tests before major changes

Frontend Architecture Patterns

React Component Patterns

File Naming: Components use ComponentName.component.tsx, interfaces use ComponentName.interface.ts
State Management: Use useState with proper typing, avoid any
Side Effects: Use useEffect with proper dependency arrays
Performance: Use useCallback for event handlers, useMemo for expensive computations
Custom Hooks: Prefix with use, place in src/hooks/, return typed objects
Internationalization: Use useTranslation hook from react-i18next, access with t('key')
Component Structure: Functional components only, no class components
Props: Define interfaces for all component props, place in .interface.ts files
Loading States: Use object state for multiple loading states: useState<Record<string, boolean>>({})
Error Handling: Use showErrorToast and showSuccessToast utilities from ToastUtils
Navigation: Use useNavigate from react-router-dom, not direct history manipulation
Data Fetching: Async functions with try-catch blocks, update loading states appropriately

State Management

Use Zustand stores for global state (e.g., useLimitStore, useWelcomeStore)
Keep component state local when possible with useState
Use context providers for feature-specific shared state (e.g., ApplicationsProvider)

Styling

MUI Migration: The project is gradually migrating from Ant Design to Material-UI (MUI) v7.3.1
Preferred Approach: Use MUI components v7.3.1 and styles wherever possible for new features
Theme and Styles: MUI theme data and styles are defined in openmetadata-ui-core-components
Colors and Design Tokens: Always reference theme colors and design tokens from the MUI theme, not hardcoded values
Legacy Components: Ant Design components remain in existing code but should be replaced with MUI equivalents when refactoring
Do not add unnecessary spacing between logs and code.
In Java, avoid wildcards imports (e.g., use import java.util.List; instead of import java.util.*;)
Custom styles in .less files with component-specific naming (legacy pattern)
Follow BEM naming convention for custom CSS classes
Use CSS modules where appropriate

UI considerations

Do not use string literals at any place. You should use useTranslation hook and use it like const {t} = useTranslation(). And for example if you want to have "Run" as string, you should be using { t('label.run') }, this label is defined in locales.

Application Configuration

Applications use ApplicationsClassBase for schema loading and configuration
Dynamic imports handle application-specific schemas and assets
Form schemas use React JSON Schema Form (RJSF) with custom UI widgets

Service Utilities

Each service type has dedicated utility files (e.g., DatabaseServiceUtils.tsx)
Connection schemas are imported statically and pre-resolved
Service configurations use switch statements to map types to schemas

Type Safety

All API responses have generated TypeScript interfaces in generated/
Custom types extend base interfaces when needed
Avoid type assertions unless absolutely necessary
Use discriminated unions for action types and state variants

Database and Migrations

Flyway handles schema migrations in bootstrap/sql/migrations/
Use Docker containers for local database setup
Default MySQL, PostgreSQL supported as alternative
Sample data loaded automatically in development environment

Security and Authentication

JWT-based authentication with OAuth2/SAML support
Role-based access control defined in Java entities
Security configurations in conf/openmetadata.yaml
Never commit secrets - use environment variables or secure vaults

Code Generation Standards

Comments Policy

Do NOT add unnecessary comments - write self-documenting code
NEVER add single-line comments that describe what the code obviously does
Only include comments for:
- Complex business logic that isn't obvious
- Non-obvious algorithms or workarounds
- Public API JavaDoc documentation
- TODO/FIXME with ticket references
Bad examples (NEVER do this):
- // Create user before createUser()
- // Get client before SdkClients.adminClient()
- // Verify domain is set before assertNotNull(entity.getDomain())
- // User names are lowercased when the code toLowerCase() makes it obvious
If the code needs a comment to be understood, refactor the code to be clearer instead

Java Code Requirements

Always mention running mvn spotless:apply when generating/modifying .java files
Use clear, descriptive variable and method names instead of comments
Follow existing project patterns and conventions
Generate production-ready code, not tutorial code
Create integration tests in openmetadata-integration-tests
Do not use Fully Qualified Names in the code such as org.openmetadata.schema.type.Status instead import the class name
Do not import wild-card packages instead import exactly required packages

TypeScript/Frontend Code Requirements

NEVER use any type in TypeScript code - always use proper types
Use unknown when the type is truly unknown and add type guards
Import types from existing type definitions (e.g., RJSFSchema from @rjsf/utils)
Follow ESLint rules strictly - the project enforces no-console, proper formatting
Add // eslint-disable-next-line comments only when absolutely necessary
Import Organization (in order):
1. External libraries (React, Ant Design, etc.)
2. Internal absolute imports from generated/, constants/, hooks/, etc.
3. Relative imports for utilities and components
4. Asset imports (SVGs, styles)
5. Type imports grouped separately when needed

Python Code Requirements

Use pytest, not unittest - write tests using pytest style with plain assert statements
Use pytest fixtures for test setup instead of setUp/tearDown methods
Use unittest.mock for mocking (MagicMock, patch) - this is compatible with pytest
Test classes should not inherit from TestCase - use plain classes prefixed with Test
Use assert x == y instead of self.assertEqual(x, y)
Use assert x is None instead of self.assertIsNone(x)
Use assert "text" in string instead of self.assertIn("text", string)

Python Ingestion Connector Guidelines

Keep connector-specific logic in connector-specific files, not in generic/shared files like builders.py
Example: Redshift IAM auth should be in ingestion/src/metadata/ingestion/source/database/redshift/connection.py, not in ingestion/src/metadata/ingestion/connections/builders.py
This keeps the codebase modular and prevents generic utilities from becoming cluttered with connector-specific edge cases

Testing Philosophy

Test real behavior, not mock wiring - if a test requires mocking 3+ classes just to verify a method call, it's testing the wrong thing
Prefer integration tests over heavily-mocked unit tests. This project has full integration test infrastructure (OpenMetadataApplicationTest, Docker containers, real OpenSearch). Use it.
Mocks are for boundaries, not internals - mock external services (HTTP clients, third-party APIs), not your own classes. If you're mocking static methods left and right to test internal plumbing, write an integration test instead.
A test that mocks everything proves nothing - it only verifies that your mocks are wired correctly, not that the system works
Ask "what breaks if this test passes but the code is wrong?" - if the answer is "nothing, because everything real is mocked out", delete the test and write a better one
Test the outcome, not the implementation - assert on observable results (API responses, database state, stats values) rather than verifying internal method calls with verify()

Response Format

Provide clean code blocks without unnecessary explanations
Assume readers are experienced developers
Focus on functionality over education

13 KiB Raw Blame History