LocalAI/pkg
LocalAI [bot] 8bbe89a537
fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(distributed): route per inference request and cache probeHealth

Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 08:15:27 +00:00
..
audio feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
concurrency feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
downloader feat(gallery): verify backend OCI images with keyless cosign (#9823) 2026-05-18 08:02:20 +02:00
functions feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801) 2026-05-13 21:57:27 +02:00
grpc feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801) 2026-05-13 21:57:27 +02:00
huggingface-api fix(importer): emit all shards for multi-part GGUF models (#9513) 2026-04-23 15:00:02 +02:00
mcp/localaitools feat(branding): admin-configurable instance name, tagline, and assets (#9635) 2026-05-02 15:51:36 +02:00
model fix(distributed): route per request across loaded replicas + cache probeHealth (#9968) 2026-05-24 08:15:27 +00:00
oci feat(gallery): verify backend OCI images with keyless cosign (#9823) 2026-05-18 08:02:20 +02:00
reasoning fix(reasoning): suppress partial tag tokens during autoparser warm-up 2026-04-04 20:45:57 +00:00
sanitize feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
signals feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
sound feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
store chore: fix go.mod module (#2635) 2024-06-23 08:24:36 +00:00
system feat(importer): expand importer flow to almost all backends (#9466) 2026-04-22 22:42:37 +02:00
utils [utils] Fail immediately on extraction errors (#9926) 2026-05-21 19:00:33 +02:00
vram feat(gallery): Speed up load times and clean gallery entries (#9211) 2026-05-06 14:51:38 +02:00
xio feat(ui): allow to cancel ops (#7264) 2025-11-13 18:41:47 +01:00
xsync chore: fix go.mod module (#2635) 2024-06-23 08:24:36 +00:00
xsysinfo feat: also parse VRAM budget/usage from vulkaninfo (#9800) 2026-05-13 21:43:12 +02:00