- Add Table of Contents to devops.md, frontend.md, golang.md, sre.md, typescript.md - Update CLAUDE.md with Four-File Update Rule and TOC maintenance guidelines - Add checklist for adding/removing sections in standards files - Improve dev-refactor skill to use Skill tool for dev-cycle handoff - Add anti-rationalization patterns for gate execution shortcuts - Add Execution Report sections to dev-cycle, dev-feedback-loop, dev-refactor - Extract gap tracking rationalizations to shared-anti-rationalization.md Generated-by: Claude AI-Model: claude-opus-4-5-20251101
25 KiB
SRE Standards
⚠️ MAINTENANCE: This file is indexed in
dev-team/skills/shared-patterns/standards-coverage-table.md. When adding/removing##sections, update the coverage table AND agent files per THREE-FILE UPDATE RULE in CLAUDE.md.
This file defines the specific standards for Site Reliability Engineering and observability.
Reference: Always consult
docs/PROJECT_RULES.mdfor common project standards.
Table of Contents
| # | Section | Description |
|---|---|---|
| 1 | Observability Stack | Logs, traces, APM tools |
| 2 | Logging Standards | Structured JSON format, log levels |
| 3 | Tracing Standards | OpenTelemetry configuration |
| 4 | OpenTelemetry with lib-commons | Go service integration |
| 5 | Structured Logging with lib-common-js | TypeScript service integration |
| 6 | Health Checks | Liveness and readiness probes |
Meta-sections (not checked by agents):
- Checklist - Self-verification before deploying
Observability Stack
| Component | Primary | Alternatives |
|---|---|---|
| Logs | Loki | ELK Stack, Splunk, CloudWatch Logs |
| Traces | Jaeger/Tempo | Zipkin, X-Ray, Honeycomb |
| APM | OpenTelemetry | DataDog APM, New Relic APM |
Logging Standards
Structured Log Format
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "error",
"logger": "api.handler",
"message": "Failed to process request",
"service": "api",
"version": "1.2.3",
"environment": "production",
"trace_id": "abc123def456",
"span_id": "789xyz",
"request_id": "req-001",
"user_id": "usr_456",
"error": {
"type": "ConnectionError",
"message": "connection timeout after 30s",
"stack": "..."
},
"context": {
"method": "POST",
"path": "/api/v1/users",
"status": 500,
"duration_ms": 30045
}
}
Log Levels
| Level | Usage | Examples |
|---|---|---|
| ERROR | Failures requiring attention | Database connection failed, API error |
| WARN | Potential issues | Retry attempt, rate limit approaching |
| INFO | Normal operations | Request completed, user logged in |
| DEBUG | Detailed debugging | Query parameters, internal state |
| TRACE | Very detailed (rarely used) | Full request/response bodies |
What to Log
# DO log
- Request start/end with duration
- Error details with stack traces
- Authentication events (login, logout, failed attempts)
- Authorization failures
- External service calls (start, end, duration)
- Business events (order placed, payment processed)
- Configuration changes
- Deployment events
# DO NOT log
- Passwords or API keys
- Credit card numbers (full)
- Personal identifiable information (PII)
- Session tokens
- Internal security mechanisms
- Health check requests (too noisy)
Log Aggregation (Loki)
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
Tracing Standards
OpenTelemetry Configuration
// Go - OpenTelemetry setup
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)
func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("api"),
semconv.ServiceVersion("1.0.0"),
semconv.DeploymentEnvironment("production"),
),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
trace.WithSampler(trace.TraceIDRatioBased(0.1)), // Sample 10%
)
otel.SetTracerProvider(tp)
return tp, nil
}
// Usage
tracer := otel.Tracer("api")
ctx, span := tracer.Start(ctx, "processOrder")
defer span.End()
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Int("order.items", len(items)),
)
Span Naming Conventions
# Format: <operation>.<entity>
# HTTP handlers
GET /api/users -> http.request
POST /api/orders -> http.request
# Database
SELECT users -> db.query
INSERT orders -> db.query
# External calls
Payment API call -> http.client.payment
Email service call -> http.client.email
# Internal operations
Process order -> order.process
Validate input -> input.validate
Trace Context Propagation
// Propagate trace context in HTTP headers
import (
"go.opentelemetry.io/otel/propagation"
)
// Client - inject context
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Server - extract context
ctx := otel.GetTextMapPropagator().Extract(
r.Context(),
propagation.HeaderCarrier(r.Header),
)
OpenTelemetry with lib-commons (MANDATORY for Go)
All Go services MUST integrate OpenTelemetry using lib-commons/v2. This ensures consistent observability patterns across all Lerian Studio services.
Reference: See
dev-team/docs/standards/golang.mdfor complete lib-commons integration patterns.
Required Imports
import (
libCommons "github.com/LerianStudio/lib-commons/v2/commons"
libZap "github.com/LerianStudio/lib-commons/v2/commons/zap" // Logger initialization (bootstrap only)
libLog "github.com/LerianStudio/lib-commons/v2/commons/log" // Logger interface (services, routes, consumers)
libOpentelemetry "github.com/LerianStudio/lib-commons/v2/commons/opentelemetry"
libHTTP "github.com/LerianStudio/lib-commons/v2/commons/net/http"
libServer "github.com/LerianStudio/lib-commons/v2/commons/server"
)
Telemetry Flow (MANDATORY)
┌─────────────────────────────────────────────────────────────────┐
│ 1. BOOTSTRAP (config.go) │
│ telemetry := libOpentelemetry.InitializeTelemetry(&config) │
│ → Creates OpenTelemetry provider once at startup │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. ROUTER (routes.go) │
│ tlMid := libHTTP.NewTelemetryMiddleware(tl) │
│ f.Use(tlMid.WithTelemetry(tl)) ← Injects into context │
│ ...routes... │
│ f.Use(tlMid.EndTracingSpans) ← Closes root spans │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. ANY LAYER (handlers, services, repositories) │
│ logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)│
│ ctx, span := tracer.Start(ctx, "operation_name") │
│ defer span.End() │
│ logger.Infof("Processing...") ← Logger from same context │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 4. SERVER LIFECYCLE (fiber.server.go) │
│ libServer.NewServerManager(nil, &s.telemetry, s.logger) │
│ .WithHTTPServer(s.app, s.serverAddress) │
│ .StartWithGracefulShutdown() │
│ → Handles signal trapping + telemetry flush + clean shutdown │
└─────────────────────────────────────────────────────────────────┘
1. Bootstrap Initialization (MANDATORY)
// bootstrap/config.go
func InitServers() *Service {
cfg := &Config{}
if err := libCommons.SetConfigFromEnvVars(cfg); err != nil {
panic(err)
}
// Initialize logger FIRST (zap package for initialization in bootstrap)
logger := libZap.InitializeLogger()
// Initialize telemetry with config
telemetry := libOpentelemetry.InitializeTelemetry(&libOpentelemetry.TelemetryConfig{
LibraryName: cfg.OtelLibraryName,
ServiceName: cfg.OtelServiceName,
ServiceVersion: cfg.OtelServiceVersion,
DeploymentEnv: cfg.OtelDeploymentEnv,
CollectorExporterEndpoint: cfg.OtelColExporterEndpoint,
EnableTelemetry: cfg.EnableTelemetry,
Logger: logger,
})
// Pass telemetry to router...
}
2. Router Middleware Setup (MANDATORY)
// adapters/http/in/routes.go
func NewRouter(lg libLog.Logger, tl *libOpentelemetry.Telemetry, ...) *fiber.App {
f := fiber.New(fiber.Config{
DisableStartupMessage: true,
ErrorHandler: func(ctx *fiber.Ctx, err error) error {
return libHTTP.HandleFiberError(ctx, err)
},
})
// Create telemetry middleware
tlMid := libHTTP.NewTelemetryMiddleware(tl)
// MUST be first middleware - injects tracer+logger into context
f.Use(tlMid.WithTelemetry(tl))
f.Use(cors.New())
f.Use(libHTTP.WithHTTPLogging(libHTTP.WithCustomLogger(lg)))
// ... define routes ...
// Version endpoint
f.Get("/version", libHTTP.Version)
// MUST be last middleware - closes root spans
f.Use(tlMid.EndTracingSpans)
return f
}
3. Recovering Logger & Tracer (MANDATORY)
// ANY file in ANY layer (handler, service, repository)
func (s *Service) ProcessEntity(ctx context.Context, id string) error {
// Single call recovers BOTH logger AND tracer from context
logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)
// Create child span for this operation
ctx, span := tracer.Start(ctx, "service.process_entity")
defer span.End()
// Logger is automatically correlated with trace
logger.Infof("Processing entity: %s", id)
// Pass ctx to downstream calls - trace propagates automatically
return s.repo.Update(ctx, id)
}
4. Error Handling with Spans (MANDATORY)
// For technical errors (unexpected failures)
if err != nil {
libOpentelemetry.HandleSpanError(&span, "Failed to connect database", err)
logger.Errorf("Database error: %v", err)
return nil, err
}
// For business errors (expected validation failures)
if err != nil {
libOpentelemetry.HandleSpanBusinessErrorEvent(&span, "Validation failed", err)
logger.Warnf("Validation error: %v", err)
return nil, err
}
5. Server Lifecycle with Graceful Shutdown (MANDATORY)
// bootstrap/fiber.server.go
type Server struct {
app *fiber.App
serverAddress string
logger libLog.Logger
telemetry libOpentelemetry.Telemetry
}
func (s *Server) Run(l *libCommons.Launcher) error {
libServer.NewServerManager(nil, &s.telemetry, s.logger).
WithHTTPServer(s.app, s.serverAddress).
StartWithGracefulShutdown() // Handles: SIGINT/SIGTERM, telemetry flush, connections close
return nil
}
Required Environment Variables
| Variable | Description | Example |
|---|---|---|
OTEL_RESOURCE_SERVICE_NAME |
Service name in traces | service-name |
OTEL_LIBRARY_NAME |
Library identifier | service-name |
OTEL_RESOURCE_SERVICE_VERSION |
Service version | 1.0.0 |
OTEL_RESOURCE_DEPLOYMENT_ENVIRONMENT |
Environment | production |
OTEL_EXPORTER_OTLP_ENDPOINT |
Collector endpoint | http://otel-collector:4317 |
ENABLE_TELEMETRY |
Enable/disable | true |
lib-commons Telemetry Checklist
| Check | What to Verify | Status |
|---|---|---|
| Bootstrap Init | libOpentelemetry.InitializeTelemetry() called in bootstrap |
Required |
| Middleware Order | WithTelemetry() is FIRST, EndTracingSpans is LAST |
Required |
| Context Recovery | All layers use libCommons.NewTrackingFromContext(ctx) |
Required |
| Span Creation | Operations create spans via tracer.Start(ctx, "name") |
Required |
| Error Handling | Uses HandleSpanError or HandleSpanBusinessErrorEvent |
Required |
| Graceful Shutdown | libServer.NewServerManager().StartWithGracefulShutdown() |
Required |
| Env Variables | All OTEL_* variables configured | Required |
What NOT to Do
// FORBIDDEN: Manual OpenTelemetry setup without lib-commons
import "go.opentelemetry.io/otel"
tp := trace.NewTracerProvider(...) // DON'T do this manually
// FORBIDDEN: Creating loggers without context
logger := zap.NewLogger() // DON'T do this in services
// FORBIDDEN: Not passing context to downstream calls
s.repo.Update(id) // DON'T forget context
// CORRECT: Always use lib-commons patterns
telemetry := libOpentelemetry.InitializeTelemetry(&config)
logger, tracer, _, _ := libCommons.NewTrackingFromContext(ctx)
s.repo.Update(ctx, id) // Context propagates trace
Standards Compliance Categories
When evaluating a codebase for lib-commons telemetry compliance, check these categories:
| Category | Expected Pattern | Evidence Location |
|---|---|---|
| Telemetry Init | libOpentelemetry.InitializeTelemetry() |
internal/bootstrap/config.go |
| Logger Init | libZap.InitializeLogger() (bootstrap only) |
internal/bootstrap/config.go |
| Middleware Setup | NewTelemetryMiddleware() + WithTelemetry() |
internal/adapters/http/in/routes.go |
| Middleware Order | WithTelemetry first, EndTracingSpans last |
internal/adapters/http/in/routes.go |
| Context Recovery | libCommons.NewTrackingFromContext(ctx) |
All handlers, services, repositories |
| Span Creation | tracer.Start(ctx, "operation") |
All significant operations |
| Error Spans | HandleSpanError / HandleSpanBusinessErrorEvent |
Error handling paths |
| Graceful Shutdown | libServer.NewServerManager().StartWithGracefulShutdown() |
internal/bootstrap/fiber.server.go |
Structured Logging with lib-common-js (MANDATORY for TypeScript)
All TypeScript services MUST integrate structured logging using @LerianStudio/lib-common-js. This ensures consistent observability patterns across all Lerian Studio services.
Note
: lib-common-js currently provides logging infrastructure. Telemetry will be added in future versions.
Required Dependencies
{
"dependencies": {
"@LerianStudio/lib-common-js": "^1.0.0"
}
}
Required Imports
import { initializeLogger, Logger } from '@LerianStudio/lib-common-js/logger';
import { loadConfigFromEnv } from '@LerianStudio/lib-common-js/config';
import { createLoggingMiddleware } from '@LerianStudio/lib-common-js/http';
Logging Flow (MANDATORY)
┌─────────────────────────────────────────────────────────────────┐
│ 1. BOOTSTRAP (config.ts) │
│ const logger = initializeLogger() │
│ → Creates structured logger once at startup │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 2. ROUTER (routes.ts) │
│ const logMid = createLoggingMiddleware(logger) │
│ app.use(logMid) ← Injects logger into request │
│ ...routes... │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 3. ANY LAYER (handlers, services, repositories) │
│ const logger = req.logger || parentLogger │
│ logger.info('Processing...', { entityId, requestId }) │
│ → Structured JSON logs with correlation IDs │
└─────────────────────────────────────────────────────────────────┘
1. Bootstrap Initialization (MANDATORY)
// bootstrap/config.ts
import { initializeLogger } from '@LerianStudio/lib-common-js/logger';
import { loadConfigFromEnv } from '@LerianStudio/lib-common-js/config';
export async function initServers(): Promise<Service> {
// Load configuration from environment
const config = loadConfigFromEnv<Config>();
// Initialize logger
const logger = initializeLogger({
level: config.logLevel,
serviceName: config.serviceName,
serviceVersion: config.serviceVersion,
});
logger.info('Service starting', {
service: config.serviceName,
version: config.serviceVersion,
environment: config.envName,
});
// Pass logger to router...
}
2. Router Middleware Setup (MANDATORY)
// adapters/http/routes.ts
import { createLoggingMiddleware } from '@LerianStudio/lib-common-js/http';
import express from 'express';
export function createRouter(
logger: Logger,
handlers: Handlers
): express.Application {
const app = express();
// Create logging middleware - injects logger into request
const logMid = createLoggingMiddleware(logger);
app.use(logMid);
app.use(cors());
app.use(express.json());
// ... define routes ...
return app;
}
3. Using Logger in Handlers/Services (MANDATORY)
// handlers/user-handler.ts
async function createUser(req: Request, res: Response): Promise<void> {
const logger = req.logger;
const requestId = req.headers['x-request-id'] as string;
logger.info('Creating user', {
requestId,
email: req.body.email,
});
try {
const user = await userService.create(req.body, logger);
logger.info('User created successfully', {
requestId,
userId: user.id,
});
res.status(201).json(user);
} catch (error) {
logger.error('Failed to create user', {
requestId,
error: error.message,
stack: error.stack,
});
throw error;
}
}
Required Structured Log Format
All logs MUST be JSON formatted with these fields:
{
"timestamp": "2024-01-15T10:30:00.000Z",
"level": "info",
"message": "Processing request",
"service": "api-service",
"version": "1.2.3",
"environment": "production",
"requestId": "req-001",
"context": {
"method": "POST",
"path": "/api/v1/users",
"userId": "usr_456"
}
}
Required Environment Variables
| Variable | Description | Example |
|---|---|---|
LOG_LEVEL |
Logging level | info |
SERVICE_NAME |
Service identifier | api-service |
SERVICE_VERSION |
Service version | 1.0.0 |
ENV_NAME |
Environment name | production |
lib-common-js Logging Checklist
| Check | What to Verify | Status |
|---|---|---|
| Logger Init | initializeLogger() called in bootstrap |
Required |
| Middleware | createLoggingMiddleware(logger) configured |
Required |
| Request Correlation | Logs include requestId from headers |
Required |
| Structured Format | All logs are JSON formatted | Required |
| Error Logging | Errors include message, stack, and context | Required |
| No Sensitive Data | Passwords, tokens, PII NOT logged | Required |
| Log Levels | Appropriate levels used (info, warn, error) | Required |
What NOT to Do
// FORBIDDEN: Using console.log
console.log('Processing user'); // DON'T do this
// FORBIDDEN: Logging sensitive data
logger.info('User login', { password: user.password }); // NEVER
// FORBIDDEN: Unstructured log messages
logger.info(`Processing user ${userId}`); // DON'T use string interpolation
// CORRECT: Always use lib-common-js structured logging
const logger = initializeLogger(config);
logger.info('Processing user', { userId, requestId }); // Structured fields
Standards Compliance Categories (TypeScript Logging)
When evaluating a codebase for lib-common-js logging compliance, check these categories:
| Category | Expected Pattern | Evidence Location |
|---|---|---|
| Logger Init | initializeLogger() |
src/bootstrap/config.ts |
| Middleware Setup | createLoggingMiddleware(logger) |
src/adapters/http/routes.ts |
| Request Correlation | requestId in all logs |
Handlers, services |
| JSON Format | Structured JSON output | All log statements |
| Error Logging | Error object with stack trace | Error handlers |
| No console.log | No direct console usage | Entire codebase |
| No Sensitive Data | Passwords, tokens excluded | All log statements |
Health Checks
Required Endpoints
Implementation
// Go implementation for observability
type ObservabilityChecker struct {
db *sql.DB
redis *redis.Client
}
// Liveness - is the process alive?
func (h *HealthChecker) LivenessHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
// Readiness - can we serve traffic?
func (h *HealthChecker) ReadinessHandler(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
checks := []struct {
name string
fn func(context.Context) error
}{
{"database", func(ctx context.Context) error { return h.db.PingContext(ctx) }},
{"redis", func(ctx context.Context) error { return h.redis.Ping(ctx).Err() }},
}
var failures []string
for _, check := range checks {
if err := check.fn(ctx); err != nil {
failures = append(failures, fmt.Sprintf("%s: %v", check.name, err))
}
}
if len(failures) > 0 {
w.WriteHeader(http.StatusServiceUnavailable)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "unhealthy",
"checks": failures,
})
return
}
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "healthy",
})
}
Kubernetes Configuration
# Observability configuration
# JSON structured logging required
# OpenTelemetry tracing recommended for distributed systems
Checklist
Before deploying to production:
- Logging: Structured JSON logs with trace correlation
- Tracing: OpenTelemetry instrumentation (Go with lib-commons)
- Structured Logging: lib-common-js integration (TypeScript)