LocalAI/backend/cpp/turboquant/apply-patches.sh

#!/bin/bash
# Apply the turboquant patch series to a cloned llama-cpp-turboquant checkout.
#
# The turboquant fork branched from upstream llama.cpp before a few API changes
# that the shared backend/cpp/llama-cpp/grpc-server.cpp depends on. We carry
# those upstream commits as patch files under backend/cpp/turboquant/patches/
# and apply them here so the reused grpc-server source compiles against the
# fork unmodified.
#
# Drop the corresponding patch from patches/ whenever the fork catches up with
# upstream — the build will fail fast if a patch stops applying, which is the
# signal to retire it.

set -euo pipefail

if [[ $# -ne 2 ]]; then
    echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
    exit 2
fi

SRC_DIR=$1
PATCHES_DIR=$2

if [[ ! -d "$SRC_DIR" ]]; then
    echo "source dir does not exist: $SRC_DIR" >&2
    exit 2
fi

if [[ ! -d "$PATCHES_DIR" ]]; then
    echo "no patches dir at $PATCHES_DIR, nothing to apply"
    exit 0
fi

shopt -s nullglob
patches=("$PATCHES_DIR"/*.patch)
shopt -u nullglob

if [[ ${#patches[@]} -eq 0 ]]; then
    echo "no .patch files in $PATCHES_DIR, nothing to apply"
    exit 0
fi

cd "$SRC_DIR"

for patch in "${patches[@]}"; do
    echo "==> applying $patch"
    git apply --verbose "$patch"
done

echo "all turboquant patches applied successfully"
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the copied grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> 2026-04-14 23:25:04 +00:00			`#!/bin/bash`
			`# Apply the turboquant patch series to a cloned llama-cpp-turboquant checkout.`
			`#`
			`# The turboquant fork branched from upstream llama.cpp before a few API changes`
			`# that the shared backend/cpp/llama-cpp/grpc-server.cpp depends on. We carry`
			`# those upstream commits as patch files under backend/cpp/turboquant/patches/`
			`# and apply them here so the reused grpc-server source compiles against the`
			`# fork unmodified.`
			`#`
			`# Drop the corresponding patch from patches/ whenever the fork catches up with`
			`# upstream — the build will fail fast if a patch stops applying, which is the`
			`# signal to retire it.`

			`set -euo pipefail`

			`if [[ $# -ne 2 ]]; then`
			`echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2`
			`exit 2`
			`fi`

			`SRC_DIR=$1`
			`PATCHES_DIR=$2`

			`if [[ ! -d "$SRC_DIR" ]]; then`
			`echo "source dir does not exist: $SRC_DIR" >&2`
			`exit 2`
			`fi`

			`if [[ ! -d "$PATCHES_DIR" ]]; then`
			`echo "no patches dir at $PATCHES_DIR, nothing to apply"`
			`exit 0`
			`fi`

			`shopt -s nullglob`
			`patches=("$PATCHES_DIR"/*.patch)`
			`shopt -u nullglob`

			`if [[ ${#patches[@]} -eq 0 ]]; then`
			`echo "no .patch files in $PATCHES_DIR, nothing to apply"`
			`exit 0`
			`fi`

			`cd "$SRC_DIR"`

			`for patch in "${patches[@]}"; do`
			`echo "==> applying $patch"`
			`git apply --verbose "$patch"`
			`done`

			`echo "all turboquant patches applied successfully"`