LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] 8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 ) * refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(distributed): route per inference request and cache probeHealth Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local Model cache when modelRouter is set. The cached Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-24 08:15:27 +00:00
..
advisorylock	feat(distributed): sync state with frontends, better backend management reporting (#9426 )	2026-04-19 17:55:53 +02:00
agentpool	fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811 )	2026-05-13 23:58:43 +02:00
agents	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
dbutil	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
distributed	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
facerecognition	feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480 )	2026-04-22 21:55:41 +02:00
finetune	chore: Security hardening (#9719 )	2026-05-08 16:25:45 +02:00
galleryop	fix(distributed): make admin backend installs resilient and observable (#9958 )	2026-05-23 12:35:44 +02:00
jobs	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
mcp	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
messaging	fix(distributed): make admin backend installs resilient and observable (#9958 )	2026-05-23 12:35:44 +02:00
modeladmin	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
monitoring	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
nodes	fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 )	2026-05-24 08:15:27 +00:00
quantization	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
skills	refactor(agents): bump skillserver, drop redundant Name from list_skills output (#9916 )	2026-05-21 14:45:53 +02:00
storage	feat: track files being staged (#9275 )	2026-04-08 14:33:58 +02:00
testutil	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
voicerecognition	feat: voice recognition (#9500 )	2026-04-23 12:07:14 +02:00
worker	fix(distributed): make admin backend installs resilient and observable (#9958 )	2026-05-23 12:35:44 +02:00