mirror of
https://github.com/mudler/LocalAI
synced 2026-05-24 09:28:23 +00:00
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds two independent features on top: DFlash block-diffusion speculative decoding (via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4. Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp grpc-server sources verbatim, patches only the build copy to extend the KV allow-list and wire up buun-exclusive tree_budget / draft_topk options. DraftModel is already wired end-to-end (proto field 39 → params.speculative), so DFlash activation only needs the existing options passthrough (spec_type:dflash) plus the drafter path in draft_model. CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown shows them — benefits turboquant too (previously users had to type them in YAML manually). Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|---|---|---|
| .. | ||
| gen_inference_defaults | ||
| meta | ||
| application_config.go | ||
| application_config_test.go | ||
| backend_hooks.go | ||
| config_suite_test.go | ||
| distributed_config.go | ||
| gallery.go | ||
| gguf.go | ||
| gguf_reasoning_test.go | ||
| hooks_llamacpp.go | ||
| hooks_test.go | ||
| hooks_vllm.go | ||
| inference_defaults.go | ||
| inference_defaults.json | ||
| inference_defaults_test.go | ||
| model_config.go | ||
| model_config_filter.go | ||
| model_config_loader.go | ||
| model_config_test.go | ||
| model_test.go | ||
| parser_defaults.json | ||
| runtime_settings.go | ||