LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

LocalAI [bot] 959de86761 feat(llama-cpp): make server-side prompt cache work by default (#9925 ) Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) skip prefill on subsequent calls without any YAML changes. Reported in #9921. Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4) when slot count is auto, which unlocks `cache_idle_slots`. LocalAI hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`, which silently force-disables idle-slot saving at server init. The host prompt cache was allocated but never written across requests. Changes in backend/cpp/llama-cpp/grpc-server.cpp: - params.kv_unified: false -> true (single-slot path now benefits from the prompt cache; users can opt out with `kv_unified:false`) - params.n_ctx_checkpoints: 8 -> 32 (match upstream default) - params.cache_idle_slots = true initialized explicitly (upstream default) - params.checkpoint_every_nt = 8192 initialized explicitly (upstream default) - New option parsers: cache_idle_slots / idle_slots_cache, checkpoint_every_nt / checkpoint_every_n_tokens Docs: - features/text-generation.md: fix misleading `cache_ram` description (it's the host-side prompt cache, not the KV cache), document the kv_unified + cache_ram + cache_idle_slots interaction, add rows for the two newly-exposed options, and add a worked example for the agent/CLI workload from the issue. - advanced/model-configuration.md: mark the legacy `prompt_cache_path` / `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the llama-cpp gRPC backend (they target upstream's CLI completion tool and are not consumed by grpc-server.cpp) and point readers at the new prompt-cache explainer. Closes #9921 Assisted-by: claude:opus-4.7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-21 16:31:48 +02:00
..
CMakeLists.txt	fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )	2026-04-18 20:30:28 +02:00
grpc-server.cpp	feat(llama-cpp): make server-side prompt cache work by default (#9925 )	2026-05-21 16:31:48 +02:00
Makefile	chore: ⬆️ Update ggml-org/llama.cpp to `ad277572619fcfb6ddd38f4c6437283a4b2b8636` (#9915 )	2026-05-21 09:07:31 +02:00
package.sh	fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend (#9099 )	2026-03-22 00:58:14 +01:00
prepare.sh	chore: ⬆️ Update ggml-org/llama.cpp to `7f8ef50cce40e3e7e4526a3696cb45658190e69a` (#7402 )	2025-12-01 07:50:40 +01:00
run.sh	feat(rocm): bump to 7.x (#9323 )	2026-04-12 08:51:30 +02:00