LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] 8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 ) * refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(distributed): route per inference request and cache probeHealth Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local Model cache when modelRouter is set. The cached Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-24 08:15:27 +00:00
..
backend_log_store.go	feat: react chat redesign (#9616 )	2026-04-29 22:33:26 +02:00
backend_log_store_test.go	feat: react chat redesign (#9616 )	2026-04-29 22:33:26 +02:00
connection_errors.go	fix(nodes): better detection if nodes goes down or model is not available (#9274 )	2026-04-08 12:11:02 +02:00
connection_evicting_client.go	feat: wire transcription for llama.cpp, add streaming support (#9353 )	2026-04-14 16:13:40 +02:00
filters.go	fix: make sure to close on errors (#7521 )	2025-12-11 14:03:20 +01:00
initializers.go	fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 )	2026-05-24 08:15:27 +00:00
initializers_retry_test.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00
loader.go	fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 )	2026-05-24 08:15:27 +00:00
loader_options.go	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
loader_test.go	fix(nodes): better detection if nodes goes down or model is not available (#9274 )	2026-04-08 12:11:02 +02:00
model.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
model_suite_test.go	tests: add template tests (#2063 )	2024-04-18 10:57:24 +02:00
process.go	feat: Log backend exit code (#9581 )	2026-04-27 14:19:18 +02:00
store.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
store_test.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
watchdog.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00
watchdog_options.go	feat: disable force eviction (#7725 )	2025-12-25 14:26:18 +01:00
watchdog_options_test.go	chore: fixup tests with defaults from constants	2025-12-16 21:26:55 +00:00
watchdog_test.go	feat(concurrency-groups): per-model exclusive groups for backend loading (#9662 )	2026-05-05 08:42:50 +02:00