LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

History

Ettore Di Giacinto cd6079b2f3 feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache) spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds two independent features on top: DFlash block-diffusion speculative decoding (via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4. Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp grpc-server sources verbatim, patches only the build copy to extend the KV allow-list and wire up buun-exclusive tree_budget / draft_topk options. DraftModel is already wired end-to-end (proto field 39 → params.speculative), so DFlash activation only needs the existing options passthrough (spec_type:dflash) plus the drafter path in draft_model. CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown shows them — benefits turboquant too (previously users had to type them in YAML manually). Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>		2026-04-24 12:52:53 +00:00
..
buun-llama-cpp	feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)	2026-04-24 12:52:53 +00:00
grpc	fix: speedup `git submodule update` with `--single-branch` (#2847 )	2024-07-13 22:32:25 +02:00
ik-llama-cpp	fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531 )	2026-04-24 13:07:26 +02:00
llama-cpp	chore: ⬆️ Update ggml-org/llama.cpp to `187a45637054881ecacf17f8e2f6f8f2ba7df1c7` (#9520 )	2026-04-24 09:17:06 +02:00
turboquant	fix(turboquant): drop ignore-eos patch, bump fork to b8967-627ebbc (#9423 )	2026-04-19 21:05:21 +02:00