LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] 447c186089 fix(distributed): make backend upgrade actually re-install on workers (#9708 ) * fix(distributed): make backend upgrade actually re-install on workers UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * test(e2e/distributed): pass force=false to adapter.InstallBackend NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-07 17:28:14 +02:00
..
context	feat: Merge repeated log lines in the terminal (#9141 )	2026-03-26 22:16:13 +01:00
worker	feat(vllm, distributed): tensor parallel distributed workers (#9612 )	2026-05-06 00:22:50 +02:00
workerregistry	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
agent.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
agent_test.go	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 )	2026-03-31 08:28:56 +02:00
agent_worker.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
backends.go	feat: backend versioning, upgrade detection and auto-upgrade (#9315 )	2026-04-11 22:31:15 +02:00
cli.go	feat: localai assistant chat modality (#9602 )	2026-04-28 19:29:27 +02:00
cli_suite_test.go	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 )	2026-03-31 08:28:56 +02:00
completion.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
completion_test.go	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 )	2026-03-31 08:28:56 +02:00
deprecations.go	chore: Standardize CLI flag naming to kebab-case (M12) (#8912 )	2026-03-09 22:15:39 +01:00
explorer.go	chore(refactor): move logging to common package based on slog (#7668 )	2025-12-21 19:33:13 +01:00
federated.go	chore: Standardize CLI flag naming to kebab-case (M12) (#8912 )	2026-03-09 22:15:39 +01:00
mcp_server.go	feat: localai assistant chat modality (#9602 )	2026-04-28 19:29:27 +02:00
models.go	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
run.go	feat: localai assistant chat modality (#9602 )	2026-04-28 19:29:27 +02:00
soundgeneration.go	fix: Remove debug print statement from soundgeneration.go (C2) (#8843 )	2026-03-08 08:49:29 +01:00
transcript.go	feat: support word-level timestamps for faster-whisper (#9621 )	2026-05-06 00:32:52 +02:00
tts.go	chore(refactor): move logging to common package based on slog (#7668 )	2025-12-21 19:33:13 +01:00
util.go	feat: improve CLI error messages with actionable guidance (#8880 )	2026-04-21 11:53:26 +02:00
worker.go	fix(distributed): make backend upgrade actually re-install on workers (#9708 )	2026-05-07 17:28:14 +02:00
worker_addr_test.go	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 )	2026-03-31 08:28:56 +02:00
worker_replica_test.go	fix(distributed): worker stopBackend/isRunning resolve bare modelID to replica keys	2026-04-27 21:43:15 +00:00