LocalAI/backend/cpp
LocalAI [bot] 959de86761
feat(llama-cpp): make server-side prompt cache work by default (#9925)
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.

Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.

Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
  the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
  checkpoint_every_nt / checkpoint_every_n_tokens

Docs:
- features/text-generation.md: fix misleading `cache_ram` description
  (it's the host-side prompt cache, not the KV cache), document the
  kv_unified + cache_ram + cache_idle_slots interaction, add rows for
  the two newly-exposed options, and add a worked example for the
  agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
  / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
  llama-cpp gRPC backend (they target upstream's CLI completion tool
  and are not consumed by grpc-server.cpp) and point readers at the
  new prompt-cache explainer.

Closes #9921

Assisted-by: claude:opus-4.7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:31:48 +02:00
..
ds4 chore: ⬆️ Update antirez/ds4 to 2606543be7a8c125a32cee37f5d1d85dc78f2fcf (#9909) 2026-05-20 21:22:26 +00:00
grpc fix: speedup git submodule update with --single-branch (#2847) 2024-07-13 22:32:25 +02:00
ik-llama-cpp chore: ⬆️ Update ikawrakow/ik_llama.cpp to 11a1fea9e291f12ce2c803a9d7812c30ca806bcf (#9914) 2026-05-20 22:04:06 +00:00
llama-cpp feat(llama-cpp): make server-side prompt cache work by default (#9925) 2026-05-21 16:31:48 +02:00
turboquant chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 (#9740) 2026-05-13 21:58:33 +02:00