LocalAI/.github/workflows
Ettore Di Giacinto 95efb8a562
feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend

turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.

Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.

scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.

* feat(turboquant): carry upstream patches against fork API drift

turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.

Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).

Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.

* docs: add turboquant backend section + clarify cache_type_k/v

Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.

Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.

* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion

The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.

* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e

The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.

Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.

Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.

Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.

* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3

The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-15 01:25:04 +02:00
..
disabled chore(ci): disable CI actions 2026-03-02 14:48:00 +01:00
backend.yml feat(backend): add turboquant llama.cpp-fork backend (#9355) 2026-04-15 01:25:04 +02:00
backend_build.yml chore(deps): bump docker/login-action from 3 to 4 (#8918) 2026-03-09 22:30:11 +01:00
backend_build_darwin.yml chore(deps): bump docker/metadata-action from 5 to 6 (#8917) 2026-03-09 22:27:02 +01:00
backend_pr.yml Change runner from macOS-14 to macos-latest 2025-12-13 10:11:27 +01:00
build-test.yaml chore(deps): bump actions/upload-artifact from 6 to 7 (#8730) 2026-03-02 21:43:39 +01:00
bump-inference-defaults.yml chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#9114) 2026-03-24 08:50:50 +01:00
bump_deps.yaml feat(backend): add turboquant llama.cpp-fork backend (#9355) 2026-04-15 01:25:04 +02:00
bump_docs.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
checksum_checker.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
deploy-explorer.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
gallery-agent.yaml fix(ci): small fixups 2026-04-14 09:27:27 +00:00
generate_grpc_cache.yaml chore(deps): bump docker/build-push-action from 6 to 7 (#8919) 2026-03-09 22:29:51 +01:00
generate_intel_image.yaml chore(deps): bump docker/login-action from 3 to 4 (#8918) 2026-03-09 22:30:11 +01:00
gh-pages.yml chore(deps): bump actions/upload-pages-artifact from 4 to 5 (#9337) 2026-04-13 21:53:47 +02:00
image-pr.yml feat(rocm): bump to 7.x (#9323) 2026-04-12 08:51:30 +02:00
image.yml feat(rocm): bump to 7.x (#9323) 2026-04-12 08:51:30 +02:00
image_build.yml chore: drop AIO images (#9004) 2026-03-14 17:49:36 +01:00
notify-releases.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
release.yaml chore(deps): bump softprops/action-gh-release from 2 to 3 (#9336) 2026-04-13 21:53:28 +02:00
secscan.yaml Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" (#7789) 2025-12-30 09:58:13 +01:00
stalebot.yml chore(deps): bump actions/stale from 10.1.1 to 10.2.0 (#8633) 2026-02-23 23:27:20 +01:00
test-extra.yml feat(backend): add turboquant llama.cpp-fork backend (#9355) 2026-04-15 01:25:04 +02:00
test.yml feat: add distributed mode (#9124) 2026-03-30 00:47:27 +02:00
tests-e2e.yml feat(realtime): WebRTC support (#8790) 2026-03-13 21:37:15 +01:00
tests-ui-e2e.yml chore(deps): bump actions/upload-artifact from 4 to 7 (#9030) 2026-03-17 11:42:49 +01:00
update_swagger.yaml fix(api)!: Stop model prior to deletion (#8422) 2026-02-06 09:22:10 +01:00
yaml-check.yml chore(backend gallery): add description for remaining backends (#5679) 2025-06-17 22:21:44 +02:00