LocalAI/core/config
LocalAI [bot] b4fdb41dcc
fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754)
* fix(distributed): cascade-clean stale node_models on drain and filter routing by healthy status

Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.

Two changes here, plus a third in a follow-up commit:

- MarkDraining now cascade-deletes node_models rows for the affected
  node, mirroring MarkOffline. Drains are explicit operator actions —
  the box has been intentionally taken out of rotation — so clearing
  the rows stops the Models UI from misreporting and prevents the
  routing layer from picking those rows if scheduling logic is ever
  relaxed. In-flight requests already hold their gRPC client through
  Route() and finish normally; the only observable effect is a
  non-fatal IncrementInFlight warning, acceptable for a drain.

  MarkUnhealthy is deliberately left status-only: it fires from
  managers_distributed / reconciler on a single nats.ErrNoResponders
  with no retry, so a transient NATS hiccup must not nuke every loaded
  model and force a full reload on recovery.

- FindAndLockNodeWithModel's inner JOIN now filters on
  backend_nodes.status = healthy in addition to node_models.state =
  loaded. The previous version relied on the second node-fetch step to
  reject non-healthy nodes, but a concurrent reader could still pick
  the same stale row in the same window. Belt-and-braces.

- DistributedConfig.PerModelHealthCheck renamed to
  DisablePerModelHealthCheck and inverted at the call site so
  per-model gRPC probing is on by default. The probe (now made
  consecutive-miss aware in a follow-up commit) independently health-
  checks each model's gRPC address and removes stale node_models rows
  when the backend has crashed even though the worker's node-level
  heartbeat is still arriving.

  Migration: the field had no CLI flag, env var binding, or YAML key
  in tree (only the bare struct field), so there is no user-facing
  migration. Anything constructing DistributedConfig in code needs to
  drop the assignment (default now does the right thing) or invert it.

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): require consecutive misses before per-model probe removes a row

The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.

Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.

If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.

Tests:
- removes stale model via per-model health check after consecutive
  failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
  success (covers the reset-on-success path and verifies the counter
  reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
  test helpers don't nil-map-panic in the probe path

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:57:50 +02:00
..
gen_inference_defaults feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092) 2026-03-22 00:57:15 +01:00
meta feat(ui): Interactive model config editor with autocomplete (#9149) 2026-04-07 14:42:23 +02:00
application_config.go feat(branding): admin-configurable instance name, tagline, and assets (#9635) 2026-05-02 15:51:36 +02:00
application_config_test.go feat: backend versioning, upgrade detection and auto-upgrade (#9315) 2026-04-11 22:31:15 +02:00
backend_capabilities.go feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801) 2026-05-13 21:57:27 +02:00
backend_capabilities_test.go feat(gallery): Speed up load times and clean gallery entries (#9211) 2026-05-06 14:51:38 +02:00
backend_hooks.go feat(vllm): parity with llama.cpp backend (#9328) 2026-04-13 11:00:29 +02:00
config_suite_test.go dependencies(grpcio): bump to fix CI issues (#2362) 2024-05-21 14:33:47 +02:00
distributed_config.go fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754) 2026-05-13 21:57:50 +02:00
gallery.go refactor: gallery inconsistencies (#2647) 2024-06-24 17:32:12 +02:00
gguf.go Respect explicit reasoning config during GGUF thinking probe (#9463) 2026-04-21 21:53:10 +02:00
gguf_reasoning_test.go Respect explicit reasoning config during GGUF thinking probe (#9463) 2026-04-21 21:53:10 +02:00
hooks_llamacpp.go feat(vllm): parity with llama.cpp backend (#9328) 2026-04-13 11:00:29 +02:00
hooks_test.go feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563) 2026-04-29 00:49:28 +02:00
hooks_vllm.go feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563) 2026-04-29 00:49:28 +02:00
inference_defaults.go feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092) 2026-03-22 00:57:15 +01:00
inference_defaults.json chore: bump inference defaults from unsloth (#9396) 2026-04-17 09:05:55 +02:00
inference_defaults_test.go feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092) 2026-03-22 00:57:15 +01:00
model_config.go feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801) 2026-05-13 21:57:27 +02:00
model_config_filter.go feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
model_config_loader.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00
model_config_loader_test.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00
model_config_test.go feat(concurrency-groups): per-model exclusive groups for backend loading (#9662) 2026-05-05 08:42:50 +02:00
model_test.go fix(tests): inline model_test fixtures after tests/models_fixtures removal 2026-04-28 12:58:49 +00:00
parser_defaults.json feat(vllm): parity with llama.cpp backend (#9328) 2026-04-13 11:00:29 +02:00
runtime_settings.go feat(branding): admin-configurable instance name, tagline, and assets (#9635) 2026-05-02 15:51:36 +02:00
runtime_settings_persist.go feat(branding): admin-configurable instance name, tagline, and assets (#9635) 2026-05-02 15:51:36 +02:00
runtime_settings_persist_test.go feat(branding): admin-configurable instance name, tagline, and assets (#9635) 2026-05-02 15:51:36 +02:00