LocalAI/pkg/model
LocalAI [bot] 8bbe89a537
fix(distributed): route per request across loaded replicas + cache probeHealth (#9968)
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel

Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.

A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.

No behavior change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(distributed): route per inference request and cache probeHealth

Two related fixes that together restore load balancing across loaded
replicas of the same model.

1. ModelLoader.Load and LoadModel bypass the local *Model cache when
   modelRouter is set. The cached *Model wraps an InFlightTrackingClient
   bound to a single (nodeID, replicaIndex) — reusing it pinned every
   subsequent request to whichever node won the very first pick, so
   FindAndLockNodeWithModel's round-robin never got a chance to run
   even after the reconciler scaled the model out to a second node. In
   distributed mode SmartRouter.Route now runs per request, and
   PickBestReplica picks the least-loaded replica each time.

   SmartRouter has its own coalescing (advisory DB lock for first-time
   loads + singleflight on backend.install RPC) so concurrent first
   requests for a not-yet-loaded model still produce a single worker
   side install.

2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
   in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
   routing every inference call hits probeHealth, and llama.cpp-style
   backends serialize HealthCheck behind active Predict — so a burst of
   incoming requests stalled on the probe to a node already mid-stream,
   tripping the 2s timeout and falling through to the install path.
   singleflight collapses N concurrent first-time probes for the same
   (node, addr) into one round-trip, failed probes invalidate the entry
   so the staleness-recovery path still triggers, and the TTL matches
   pkg/model/model.go's healthCheckTTL so the single-process and
   distributed paths share a staleness budget. The background
   HealthMonitor still reaps actually-dead backends within ~45s.

The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-24 08:15:27 +00:00
..
backend_log_store.go feat: react chat redesign (#9616) 2026-04-29 22:33:26 +02:00
backend_log_store_test.go feat: react chat redesign (#9616) 2026-04-29 22:33:26 +02:00
connection_errors.go fix(nodes): better detection if nodes goes down or model is not available (#9274) 2026-04-08 12:11:02 +02:00
connection_evicting_client.go feat: wire transcription for llama.cpp, add streaming support (#9353) 2026-04-14 16:13:40 +02:00
filters.go fix: make sure to close on errors (#7521) 2025-12-11 14:03:20 +01:00
initializers.go fix(distributed): route per request across loaded replicas + cache probeHealth (#9968) 2026-05-24 08:15:27 +00:00
initializers_retry_test.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00
loader.go fix(distributed): route per request across loaded replicas + cache probeHealth (#9968) 2026-05-24 08:15:27 +00:00
loader_options.go feat: refactor build process, drop embedded backends (#5875) 2025-07-22 16:31:04 +02:00
loader_test.go fix(nodes): better detection if nodes goes down or model is not available (#9274) 2026-04-08 12:11:02 +02:00
model.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
model_suite_test.go tests: add template tests (#2063) 2024-04-18 10:57:24 +02:00
process.go feat: Log backend exit code (#9581) 2026-04-27 14:19:18 +02:00
store.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
store_test.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
watchdog.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00
watchdog_options.go feat: disable force eviction (#7725) 2025-12-25 14:26:18 +01:00
watchdog_options_test.go chore: fixup tests with defaults from constants 2025-12-16 21:26:55 +00:00
watchdog_test.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00