LocalAI/core/cli
LocalAI [bot] 447c186089
fix(distributed): make backend upgrade actually re-install on workers (#9708)
* fix(distributed): make backend upgrade actually re-install on workers

UpgradeBackend dispatched a vanilla backend.install NATS event to every
node hosting the backend. The worker's installBackend short-circuits on
"already running for this (model, replica) slot" and returns the
existing address — so the gallery install path was skipped, no artifact
was re-downloaded, no metadata was written. The frontend's drift
detection then re-flagged the same backends every cycle (installedDigest
stays empty → mismatch → "Backend upgrade available (new build)") while
"Backend upgraded successfully" landed in the logs at the same time.
The user-visible symptom: clicking "Upgrade All" silently does nothing
and the same N backends sit on the upgrade list forever.

Two coupled fixes, one PR:

1. Force flag on backend.install. Add `Force bool` to
   BackendInstallRequest and thread it through NodeCommandSender ->
   RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op
   drain when retrying an upgrade) sets force=true; routine load events
   and admin install endpoints keep force=false. On the worker, force=true
   stops every live process that uses this backend (resolveProcessKeys
   for peer replicas, plus the exact request processKey), skips the
   findBackend short-circuit, and passes force=true into
   gallery.InstallBackendFromGallery so the on-disk artifact is
   overwritten. After the gallery install completes, startBackend brings
   up a fresh process at the same processKey on a new port.

2. Liveness check on the fast path. installBackend's "already running"
   branch read getAddr without verifying the process was alive, so a
   gRPC backend that died without the supervisor noticing left a stale
   (key, addr) entry. The reconciler then dialed that address, got
   ECONNREFUSED, marked the replica failed, retried install — and the
   supervisor said "already running addr=…" again. Loop forever, exactly
   what we observed on a node whose llama-cpp process had died but whose
   supervisor record persisted. Verify s.isRunning(processKey) before
   trusting getAddr; if the entry is stale, stopBackendExact cleans up
   and we fall through to a real install.

Backwards-compatible: the new Force field is omitempty, older workers
ignore it (their default behavior matches force=false). The signature
change on NodeCommandSender.InstallBackend is internal-only.

Verified: unit tests in core/services/nodes pass (108s suite). The
pre-existing core/backend build break (proto regen pending for
word-level timestamps) blocks core/cli and core/http/endpoints/localai
package tests but is unrelated to this change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* test(e2e/distributed): pass force=false to adapter.InstallBackend

NodeCommandSender.InstallBackend gained a final force bool in the
upgrade-force commit; the e2e distributed lifecycle tests still called
the old 8-arg signature and broke compilation. These tests exercise the
routine install path (single replica, default behavior), so force=false
preserves their existing semantics.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-07 17:28:14 +02:00
..
context feat: Merge repeated log lines in the terminal (#9141) 2026-03-26 22:16:13 +01:00
worker feat(vllm, distributed): tensor parallel distributed workers (#9612) 2026-05-06 00:22:50 +02:00
workerregistry feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
agent.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
agent_test.go feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186) 2026-03-31 08:28:56 +02:00
agent_worker.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
backends.go feat: backend versioning, upgrade detection and auto-upgrade (#9315) 2026-04-11 22:31:15 +02:00
cli.go feat: localai assistant chat modality (#9602) 2026-04-28 19:29:27 +02:00
cli_suite_test.go feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186) 2026-03-31 08:28:56 +02:00
completion.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
completion_test.go feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186) 2026-03-31 08:28:56 +02:00
deprecations.go chore: Standardize CLI flag naming to kebab-case (M12) (#8912) 2026-03-09 22:15:39 +01:00
explorer.go chore(refactor): move logging to common package based on slog (#7668) 2025-12-21 19:33:13 +01:00
federated.go chore: Standardize CLI flag naming to kebab-case (M12) (#8912) 2026-03-09 22:15:39 +01:00
mcp_server.go feat: localai assistant chat modality (#9602) 2026-04-28 19:29:27 +02:00
models.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
run.go feat: localai assistant chat modality (#9602) 2026-04-28 19:29:27 +02:00
soundgeneration.go fix: Remove debug print statement from soundgeneration.go (C2) (#8843) 2026-03-08 08:49:29 +01:00
transcript.go feat: support word-level timestamps for faster-whisper (#9621) 2026-05-06 00:32:52 +02:00
tts.go chore(refactor): move logging to common package based on slog (#7668) 2025-12-21 19:33:13 +01:00
util.go feat: improve CLI error messages with actionable guidance (#8880) 2026-04-21 11:53:26 +02:00
worker.go fix(distributed): make backend upgrade actually re-install on workers (#9708) 2026-05-07 17:28:14 +02:00
worker_addr_test.go feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186) 2026-03-31 08:28:56 +02:00
worker_replica_test.go fix(distributed): worker stopBackend/isRunning resolve bare modelID to replica keys 2026-04-27 21:43:15 +00:00