LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] 8bbe89a537 fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 ) * refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(distributed): route per inference request and cache probeHealth Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local Model cache when modelRouter is set. The cached Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-24 08:15:27 +00:00
..
audio	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
concurrency	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
downloader	feat(gallery): verify backend OCI images with keyless cosign (#9823 )	2026-05-18 08:02:20 +02:00
functions	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 )	2026-05-13 21:57:27 +02:00
grpc	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 )	2026-05-13 21:57:27 +02:00
huggingface-api	fix(importer): emit all shards for multi-part GGUF models (#9513 )	2026-04-23 15:00:02 +02:00
mcp/localaitools	feat(branding): admin-configurable instance name, tagline, and assets (#9635 )	2026-05-02 15:51:36 +02:00
model	fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 )	2026-05-24 08:15:27 +00:00
oci	feat(gallery): verify backend OCI images with keyless cosign (#9823 )	2026-05-18 08:02:20 +02:00
reasoning	fix(reasoning): suppress partial tag tokens during autoparser warm-up	2026-04-04 20:45:57 +00:00
sanitize	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
signals	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
sound	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
store	chore: fix go.mod module (#2635 )	2024-06-23 08:24:36 +00:00
system	feat(importer): expand importer flow to almost all backends (#9466 )	2026-04-22 22:42:37 +02:00
utils	[utils] Fail immediately on extraction errors (#9926 )	2026-05-21 19:00:33 +02:00
vram	feat(gallery): Speed up load times and clean gallery entries (#9211 )	2026-05-06 14:51:38 +02:00
xio	feat(ui): allow to cancel ops (#7264 )	2025-11-13 18:41:47 +01:00
xsync	chore: fix go.mod module (#2635 )	2024-06-23 08:24:36 +00:00
xsysinfo	feat: also parse VRAM budget/usage from vulkaninfo (#9800 )	2026-05-13 21:43:12 +02:00