LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-04-21 13:27:21 +00:00

History

Ettore Di Giacinto 95efb8a562 feat(backend): add turboquant llama.cpp-fork backend (#9355 ) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the copied grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>		2026-04-15 01:25:04 +02:00
..
disabled	chore(ci): disable CI actions	2026-03-02 14:48:00 +01:00
backend.yml	feat(backend): add turboquant llama.cpp-fork backend (#9355 )	2026-04-15 01:25:04 +02:00
backend_build.yml	chore(deps): bump docker/login-action from 3 to 4 (#8918 )	2026-03-09 22:30:11 +01:00
backend_build_darwin.yml	chore(deps): bump docker/metadata-action from 5 to 6 (#8917 )	2026-03-09 22:27:02 +01:00
backend_pr.yml	Change runner from macOS-14 to macos-latest	2025-12-13 10:11:27 +01:00
build-test.yaml	chore(deps): bump actions/upload-artifact from 6 to 7 (#8730 )	2026-03-02 21:43:39 +01:00
bump-inference-defaults.yml	chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#9114 )	2026-03-24 08:50:50 +01:00
bump_deps.yaml	feat(backend): add turboquant llama.cpp-fork backend (#9355 )	2026-04-15 01:25:04 +02:00
bump_docs.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
checksum_checker.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
deploy-explorer.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
gallery-agent.yaml	fix(ci): small fixups	2026-04-14 09:27:27 +00:00
generate_grpc_cache.yaml	chore(deps): bump docker/build-push-action from 6 to 7 (#8919 )	2026-03-09 22:29:51 +01:00
generate_intel_image.yaml	chore(deps): bump docker/login-action from 3 to 4 (#8918 )	2026-03-09 22:30:11 +01:00
gh-pages.yml	chore(deps): bump actions/upload-pages-artifact from 4 to 5 (#9337 )	2026-04-13 21:53:47 +02:00
image-pr.yml	feat(rocm): bump to 7.x (#9323 )	2026-04-12 08:51:30 +02:00
image.yml	feat(rocm): bump to 7.x (#9323 )	2026-04-12 08:51:30 +02:00
image_build.yml	chore: drop AIO images (#9004 )	2026-03-14 17:49:36 +01:00
notify-releases.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
release.yaml	chore(deps): bump softprops/action-gh-release from 2 to 3 (#9336 )	2026-04-13 21:53:28 +02:00
secscan.yaml	Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" (#7789 )	2025-12-30 09:58:13 +01:00
stalebot.yml	chore(deps): bump actions/stale from 10.1.1 to 10.2.0 (#8633 )	2026-02-23 23:27:20 +01:00
test-extra.yml	feat(backend): add turboquant llama.cpp-fork backend (#9355 )	2026-04-15 01:25:04 +02:00
test.yml	feat: add distributed mode (#9124 )	2026-03-30 00:47:27 +02:00
tests-e2e.yml	feat(realtime): WebRTC support (#8790 )	2026-03-13 21:37:15 +01:00
tests-ui-e2e.yml	chore(deps): bump actions/upload-artifact from 4 to 7 (#9030 )	2026-03-17 11:42:49 +01:00
update_swagger.yaml	fix(api)!: Stop model prior to deletion (#8422 )	2026-02-06 09:22:10 +01:00
yaml-check.yml	chore(backend gallery): add description for remaining backends (#5679 )	2025-06-17 22:21:44 +02:00