ring/platforms/opencode/standards/sre.md
Jefferson Rodrigues 635b15cb19
fix(standards): enforce zero panic policy across all standards and agents (#277)
Replace all 13 panic() positive examples with return error patterns, close the bootstrap exception loophole, resolve os.Exit() contradiction in domain.md, fix InitServers() caller to handle (*Service, error), update License Manager guidance to return error instead of panic, make grep commands recursive with regexp.MustCompile exclusion, and add post-generation self-checks to prevent agents from generating panic/log.Fatal/Must* code.

X-Lerian-Ref: 0x1
2026-03-23 13:47:31 -03:00

25 KiB

SRE Standards

⚠️ MAINTENANCE: This file is indexed in dev-team/skills/shared-patterns/standards-coverage-table.md. When adding/removing ## sections, follow FOUR-FILE UPDATE RULE in CLAUDE.md: (1) edit standards file, (2) update TOC, (3) update standards-coverage-table.md, (4) update agent file.

This file defines the specific standards for Site Reliability Engineering and observability.

Reference: Always consult docs/PROJECT_RULES.md for common project standards.


Table of Contents

# Section Description
1 Observability Logs, traces, APM tools
2 Logging Structured JSON format, log levels
3 Tracing OpenTelemetry configuration
4 OpenTelemetry with lib-commons Go service integration
5 Structured Logging with lib-common-js TypeScript service integration
6 Health Checks Liveness and readiness probes

Meta-sections (not checked by agents):

  • Checklist - Self-verification before deploying

Observability

Component Primary Alternatives
Logs Loki ELK Stack, Splunk, CloudWatch Logs
Traces Jaeger/Tempo Zipkin, X-Ray, Honeycomb
APM OpenTelemetry DataDog APM, New Relic APM

Logging

Structured Log Format

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "error",
  "logger": "api.handler",
  "message": "Failed to process request",
  "service": "api",
  "version": "1.2.3",
  "environment": "production",
  "trace_id": "abc123def456",
  "span_id": "789xyz",
  "request_id": "req-001",
  "user_id": "usr_456",
  "error": {
    "type": "ConnectionError",
    "message": "connection timeout after 30s",
    "stack": "..."
  },
  "context": {
    "method": "POST",
    "path": "/api/v1/users",
    "status": 500,
    "duration_ms": 30045
  }
}

Log Levels

Level Usage Examples
ERROR Failures requiring attention Database connection failed, API error
WARN Potential issues Retry attempt, connection pool low
INFO Normal operations Request completed, user logged in
DEBUG Detailed debugging Query parameters, internal state
TRACE Very detailed (rarely used) Full request/response bodies

What to Log

# DO log
- Request start/end with duration
- Error details with stack traces
- Authentication events (login, logout, failed attempts)
- Authorization failures
- External service calls (start, end, duration)
- Business events (order placed, payment processed)
- Configuration changes
- Deployment events

# DO not log
- Passwords or API keys
- Credit card numbers (full)
- Personal identifiable information (PII)
- Session tokens
- Internal security mechanisms
- Health check requests (too noisy)

Log Aggregation (Loki)

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

Tracing

OpenTelemetry Configuration

// Go - OpenTelemetry setup
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName("api"),
            semconv.ServiceVersion("1.0.0"),
            semconv.DeploymentEnvironment("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(res),
        trace.WithSampler(trace.TraceIDRatioBased(0.1)), // Sample 10%
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

// Usage
tracer := otel.Tracer("api")
ctx, span := tracer.Start(ctx, "processOrder")
defer span.End()

span.SetAttributes(
    attribute.String("order.id", orderID),
    attribute.Int("order.items", len(items)),
)

Span Naming Conventions

# Format: <operation>.<entity>

# HTTP handlers
GET /api/users         -> http.request
POST /api/orders       -> http.request

# Database
SELECT users           -> db.query
INSERT orders          -> db.query

# External calls
Payment API call       -> http.client.payment
Email service call     -> http.client.email

# Internal operations
Process order          -> order.process
Validate input         -> input.validate

Trace Context Propagation

// Propagate trace context in HTTP headers
import (
    "go.opentelemetry.io/otel/propagation"
)

// Client - inject context
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

// Server - extract context
ctx := otel.GetTextMapPropagator().Extract(
    r.Context(),
    propagation.HeaderCarrier(r.Header),
)

OpenTelemetry with lib-commons (MANDATORY for Go)

All Go services MUST integrate OpenTelemetry using lib-commons/v2. This ensures consistent observability patterns across all Lerian Studio services.

Reference: See dev-team/docs/standards/golang.md for complete lib-commons integration patterns.

Required Imports

import (
    libCommons "github.com/LerianStudio/lib-commons/v2/commons"
    libZap "github.com/LerianStudio/lib-commons/v2/commons/zap"           // Logger initialization (bootstrap only)
    libLog "github.com/LerianStudio/lib-commons/v2/commons/log"           // Logger interface (services, routes, consumers)
    libOpentelemetry "github.com/LerianStudio/lib-commons/v2/commons/opentelemetry"
    libHTTP "github.com/LerianStudio/lib-commons/v2/commons/net/http"
    libServer "github.com/LerianStudio/lib-commons/v2/commons/server"
)

Telemetry Flow (MANDATORY)

┌─────────────────────────────────────────────────────────────────┐
│ 1. BOOTSTRAP (config.go)                                        │
│    telemetry := libOpentelemetry.InitializeTelemetry(&config)   │
│    → Creates OpenTelemetry provider once at startup             │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. ROUTER (routes.go)                                           │
│    tlMid := libHTTP.NewTelemetryMiddleware(tl)                  │
│    f.Use(tlMid.WithTelemetry(tl))      ← Injects into context   │
│    ...routes...                                                  │
│    f.Use(tlMid.EndTracingSpans)        ← Closes root spans      │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. any layer (handlers, services, repositories)                 │
│    logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)│
│    ctx, span := tracer.Start(ctx, "operation_name")             │
│    defer span.End()                                              │
│    logger.Infof("Processing...")   ← Logger from same context   │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. SERVER LIFECYCLE (fiber.server.go)                           │
│    libServer.NewServerManager(nil, &s.telemetry, s.logger)      │
│        .WithHTTPServer(s.app, s.serverAddress)                  │
│        .StartWithGracefulShutdown()                             │
│    → Handles signal trapping + telemetry flush + clean shutdown │
└─────────────────────────────────────────────────────────────────┘

1. Bootstrap Initialization (MANDATORY)

// bootstrap/config.go
func InitServers() (*Service, error) {
    cfg := &Config{}
    if err := libCommons.SetConfigFromEnvVars(cfg); err != nil {
        return nil, fmt.Errorf("failed to load config: %w", err)
    }

    // Initialize logger FIRST (zap package for initialization in bootstrap)
    logger := libZap.InitializeLogger()

    // Initialize telemetry with config
    telemetry := libOpentelemetry.InitializeTelemetry(&libOpentelemetry.TelemetryConfig{
        LibraryName:               cfg.OtelLibraryName,
        ServiceName:               cfg.OtelServiceName,
        ServiceVersion:            cfg.OtelServiceVersion,
        DeploymentEnv:             cfg.OtelDeploymentEnv,
        CollectorExporterEndpoint: cfg.OtelColExporterEndpoint,
        EnableTelemetry:           cfg.EnableTelemetry,
        Logger:                    logger,
    })

    // Pass telemetry to router...
}

2. Router Middleware Setup (MANDATORY)

// adapters/http/in/routes.go
func NewRouter(lg libLog.Logger, tl *libOpentelemetry.Telemetry, ...) *fiber.App {
    f := fiber.New(fiber.Config{
        DisableStartupMessage: true,
        ErrorHandler: func(ctx *fiber.Ctx, err error) error {
            return libHTTP.HandleFiberError(ctx, err)
        },
    })

    // Create telemetry middleware
    tlMid := libHTTP.NewTelemetryMiddleware(tl)

    // MUST be first middleware - injects tracer+logger into context
    f.Use(tlMid.WithTelemetry(tl))
    f.Use(libHTTP.WithHTTPLogging(libHTTP.WithCustomLogger(lg)))

    // ... define routes ...

    // Version endpoint
    f.Get("/version", libHTTP.Version)

    // MUST be last middleware - closes root spans
    f.Use(tlMid.EndTracingSpans)

    return f
}

3. Recovering Logger & Tracer (MANDATORY)

// any file in any layer (handler, service, repository)
func (s *Service) ProcessEntity(ctx context.Context, id string) error {
    // Single call recovers BOTH logger and tracer from context
    logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)

    // Create child span for this operation
    ctx, span := tracer.Start(ctx, "service.process_entity")
    defer span.End()

    // Logger is automatically correlated with trace
    logger.Infof("Processing entity: %s", id)

    // Pass ctx to downstream calls - trace propagates automatically
    return s.repo.Update(ctx, id)
}

4. Error Handling with Spans (MANDATORY)

// For technical errors (unexpected failures)
if err != nil {
    libOpentelemetry.HandleSpanError(&span, "Failed to connect database", err)
    logger.Errorf("Database error: %v", err)
    return nil, err
}

// For business errors (expected validation failures)
if err != nil {
    libOpentelemetry.HandleSpanBusinessErrorEvent(&span, "Validation failed", err)
    logger.Warnf("Validation error: %v", err)
    return nil, err
}

5. Server Lifecycle with Graceful Shutdown (MANDATORY)

// bootstrap/fiber.server.go
type Server struct {
    app           *fiber.App
    serverAddress string
    logger        libLog.Logger
    telemetry     libOpentelemetry.Telemetry
}

func (s *Server) Run(l *libCommons.Launcher) error {
    libServer.NewServerManager(nil, &s.telemetry, s.logger).
        WithHTTPServer(s.app, s.serverAddress).
        StartWithGracefulShutdown()  // Handles: SIGINT/SIGTERM, telemetry flush, connections close
    return nil
}

Required Environment Variables

Variable Description Example
OTEL_RESOURCE_SERVICE_NAME Service name in traces service-name
OTEL_LIBRARY_NAME Library identifier service-name
OTEL_RESOURCE_SERVICE_VERSION Service version 1.0.0
OTEL_RESOURCE_DEPLOYMENT_ENVIRONMENT Environment production
OTEL_EXPORTER_OTLP_ENDPOINT Collector endpoint http://otel-collector:4317
ENABLE_TELEMETRY Enable/disable true

lib-commons Telemetry Checklist

Check What to Verify Status
Bootstrap Init libOpentelemetry.InitializeTelemetry() called in bootstrap Required
Middleware Order WithTelemetry() is FIRST, EndTracingSpans is LAST Required
Context Recovery All layers use libCommons.NewTrackingFromContext(ctx) Required
Span Creation Operations create spans via tracer.Start(ctx, "name") Required
Error Handling Uses HandleSpanError or HandleSpanBusinessErrorEvent Required
Graceful Shutdown libServer.NewServerManager().StartWithGracefulShutdown() Required
Env Variables All OTEL_* variables configured Required

What not to Do

// FORBIDDEN: Manual OpenTelemetry setup without lib-commons
import "go.opentelemetry.io/otel"
tp := trace.NewTracerProvider(...)  // DON'T do this manually

// FORBIDDEN: Creating loggers without context
logger := zap.NewLogger()  // DON'T do this in services

// FORBIDDEN: Not passing context to downstream calls
s.repo.Update(id)  // DON'T forget context

// CORRECT: Always use lib-commons patterns
telemetry := libOpentelemetry.InitializeTelemetry(&config)
logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)
s.repo.Update(ctx, id)  // Context propagates trace

Standards Compliance Categories

When evaluating a codebase for lib-commons telemetry compliance, check these categories:

Category Expected Pattern Evidence Location
Telemetry Init libOpentelemetry.InitializeTelemetry() internal/bootstrap/config.go
Logger Init libZap.InitializeLogger() (bootstrap only) internal/bootstrap/config.go
Middleware Setup NewTelemetryMiddleware() + WithTelemetry() internal/adapters/http/in/routes.go
Middleware Order WithTelemetry first, EndTracingSpans last internal/adapters/http/in/routes.go
Context Recovery libCommons.NewTrackingFromContext(ctx) All handlers, services, repositories
Span Creation tracer.Start(ctx, "operation") All significant operations
Error Spans HandleSpanError / HandleSpanBusinessErrorEvent Error handling paths
Graceful Shutdown libServer.NewServerManager().StartWithGracefulShutdown() internal/bootstrap/fiber.server.go

Structured Logging with lib-common-js (MANDATORY for TypeScript)

All TypeScript services MUST integrate structured logging using @LerianStudio/lib-common-js. This ensures consistent observability patterns across all Lerian Studio services.

Note

: lib-common-js currently provides logging infrastructure. Telemetry will be added in future versions.

Required Dependencies

{
  "dependencies": {
    "@LerianStudio/lib-common-js": "^1.0.0"
  }
}

Required Imports

import { initializeLogger, Logger } from '@LerianStudio/lib-common-js/logger';
import { loadConfigFromEnv } from '@LerianStudio/lib-common-js/config';
import { createLoggingMiddleware } from '@LerianStudio/lib-common-js/http';

Logging Flow (MANDATORY)

┌─────────────────────────────────────────────────────────────────┐
│ 1. BOOTSTRAP (config.ts)                                        │
│    const logger = initializeLogger()                            │
│    → Creates structured logger once at startup                  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. ROUTER (routes.ts)                                           │
│    const logMid = createLoggingMiddleware(logger)               │
│    app.use(logMid)            ← Injects logger into request     │
│    ...routes...                                                  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. any layer (handlers, services, repositories)                 │
│    const logger = req.logger || parentLogger                    │
│    logger.info('Processing...', { entityId, requestId })        │
│    → Structured JSON logs with correlation IDs                  │
└─────────────────────────────────────────────────────────────────┘

1. Bootstrap Initialization (MANDATORY)

// bootstrap/config.ts
import { initializeLogger } from '@LerianStudio/lib-common-js/logger';
import { loadConfigFromEnv } from '@LerianStudio/lib-common-js/config';

export async function initServers(): Promise<Service> {
    // Load configuration from environment
    const config = loadConfigFromEnv<Config>();

    // Initialize logger
    const logger = initializeLogger({
        level: config.logLevel,
        serviceName: config.serviceName,
        serviceVersion: config.serviceVersion,
    });

    logger.info('Service starting', {
        service: config.serviceName,
        version: config.serviceVersion,
        environment: config.envName,
    });

    // Pass logger to router...
}

2. Router Middleware Setup (MANDATORY)

// adapters/http/routes.ts
import { createLoggingMiddleware } from '@LerianStudio/lib-common-js/http';
import express from 'express';

export function createRouter(
    logger: Logger,
    handlers: Handlers
): express.Application {
    const app = express();

    // Create logging middleware - injects logger into request
    const logMid = createLoggingMiddleware(logger);
    app.use(logMid);
    app.use(express.json());

    // ... define routes ...

    return app;
}

3. Using Logger in Handlers/Services (MANDATORY)

// handlers/user-handler.ts
async function createUser(req: Request, res: Response): Promise<void> {
    const logger = req.logger;
    const requestId = req.headers['x-request-id'] as string;

    logger.info('Creating user', {
        requestId,
        email: req.body.email,
    });

    try {
        const user = await userService.create(req.body, logger);
        logger.info('User created successfully', {
            requestId,
            userId: user.id,
        });
        res.status(201).json(user);
    } catch (error) {
        logger.error('Failed to create user', {
            requestId,
            error: error.message,
            stack: error.stack,
        });
        throw error;
    }
}

Required Structured Log Format

All logs MUST be JSON formatted with these fields:

{
  "timestamp": "2024-01-15T10:30:00.000Z",
  "level": "info",
  "message": "Processing request",
  "service": "api-service",
  "version": "1.2.3",
  "environment": "production",
  "requestId": "req-001",
  "context": {
    "method": "POST",
    "path": "/api/v1/users",
    "userId": "usr_456"
  }
}

Required Environment Variables

Variable Description Example
LOG_LEVEL Logging level info
SERVICE_NAME Service identifier api-service
SERVICE_VERSION Service version 1.0.0
ENV_NAME Environment name production

lib-common-js Logging Checklist

Check What to Verify Status
Logger Init initializeLogger() called in bootstrap Required
Middleware createLoggingMiddleware(logger) configured Required
Request Correlation Logs include requestId from headers Required
Structured Format All logs are JSON formatted Required
Error Logging Errors include message, stack, and context Required
No Sensitive Data Passwords, tokens, PII not logged Required
Log Levels Appropriate levels used (info, warn, error) Required

What not to Do

// FORBIDDEN: Using console.log
console.log('Processing user'); // DON'T do this

// FORBIDDEN: Logging sensitive data
logger.info('User login', { password: user.password }); // never

// FORBIDDEN: Unstructured log messages
logger.info(`Processing user ${userId}`); // DON'T use string interpolation

// CORRECT: Always use lib-common-js structured logging
const logger = initializeLogger(config);
logger.info('Processing user', { userId, requestId }); // Structured fields

Standards Compliance Categories (TypeScript Logging)

When evaluating a codebase for lib-common-js logging compliance, check these categories:

Category Expected Pattern Evidence Location
Logger Init initializeLogger() src/bootstrap/config.ts
Middleware Setup createLoggingMiddleware(logger) src/adapters/http/routes.ts
Request Correlation requestId in all logs Handlers, services
JSON Format Structured JSON output All log statements
Error Logging Error object with stack trace Error handlers
No console.log No direct console usage Entire codebase
No Sensitive Data Passwords, tokens excluded All log statements

Health Checks

Required Endpoints

Implementation

// Go implementation for observability
type ObservabilityChecker struct {
    db    *sql.DB
    redis *redis.Client
}

// Liveness - is the process alive?
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

// Readiness - can we serve traffic?
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()

    checks := []struct {
        name string
        fn   func(context.Context) error
    }{
        {"database", func(ctx context.Context) error { return h.db.PingContext(ctx) }},
        {"redis", func(ctx context.Context) error { return h.redis.Ping(ctx).Err() }},
    }

    var failures []string
    for _, check := range checks {
        if err := check.fn(ctx); err != nil {
            failures = append(failures, fmt.Sprintf("%s: %v", check.name, err))
        }
    }

    if len(failures) > 0 {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status":  "unhealthy",
            "checks":  failures,
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]interface{}{
        "status": "healthy",
    })
}

Kubernetes Configuration

# Observability configuration
# JSON structured logging required
# OpenTelemetry tracing recommended for distributed systems

Checklist

Before deploying to production:

  • Logging: Structured JSON logs with trace correlation
  • Tracing: OpenTelemetry instrumentation (Go with lib-commons)
  • Structured Logging: lib-common-js integration (TypeScript)