* ci(bump-deps): register ds4 + move version pin into the Makefile The initial ds4 PR (#9758) put the upstream commit pin in backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at .github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION was invisible to it - other backends (llama-cpp, ik-llama-cpp, turboquant, voxtral, etc.) all pin in their Makefile. This change: - Moves DS4_VERSION?= and DS4_REPO?= to the top of backend/cpp/ds4/Makefile. - Inlines the git init/fetch/checkout recipe into the 'ds4:' target (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts as the target so make only re-clones when missing. - Deletes the now-redundant prepare.sh. - Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to the .github/workflows/bump_deps.yaml matrix so the daily bot opens PRs against this pin. - Updates .agents/ds4-backend.md to point at the Makefile. Verified: $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0 $ make -C backend/cpp/ds4 ds4 # clones into ds4/ at the pin $ make -C backend/cpp/ds4 ds4 # no-op on second invocation make: 'ds4' is up to date. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: route backend/cpp/ds4/ changes through changed-backends.js scripts/changed-backends.js:inferBackendPath has an explicit branch per cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a matching branch the function returns null, the backend never lands in the path map, and PR change-detection cannot map "backend/cpp/ds4/X changed" -> "rebuild ds4 image". This is why PR #9761 produced zero ds4 jobs even though it directly edits backend/cpp/ds4/Makefile. Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed before the llama-cpp branch (since both share the .cpp ancestry but ds4 is more specific - same ordering rule documented in .agents/adding-backends.md). Verified with a local Node simulation of the script against this PR's diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a 'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend in the rebuild set. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(adding-backends): harden the two gotchas that bit ds4 Both omissions are silent at the time you ADD a backend - the failure mode only appears later (the bump bot stays silent forever, or the path filter shows up on the next PR that touches your backend with zero CI jobs and looks broken for unrelated reasons). Expanding the `scripts/changed-backends.js` paragraph from a one-liner to a fully worked example, and adding a new sibling paragraph for the `bump_deps.yaml` + Makefile-pin contract. Both call out the specific mistakes from the ds4 timeline (#9758 → #9761) so future contributors can pattern-match on the cause. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
3.7 KiB
Working on the ds4 Backend
antirez/ds4 is a single-model inference engine for DeepSeek V4 Flash.
LocalAI wraps the engine's C API (ds4/ds4.h) with a fresh C++ gRPC server at
backend/cpp/ds4/ - NOT a fork of llama-cpp's grpc-server.cpp.
Pin
backend/cpp/ds4/Makefile pins DS4_VERSION?=<sha> at the top. The ds4
target in the Makefile clones antirez/ds4 at that commit (mirroring the
llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
(.github/workflows/bump_deps.yaml) finds this pin via grep and opens a
daily PR to update it. To bump manually: edit the DS4_VERSION?= line,
then make purge && make (or rely on CI's clean build).
Wire shape
| RPC | Implementation |
|---|---|
| Health, Free, Status | Trivial; no engine dependency for Health |
| LoadModel | ds4_engine_open + ds4_session_create; backend is compile-time (DS4_NO_GPU → CPU, APPLE → Metal, otherwise CUDA) |
| TokenizeString | ds4_tokenize_text |
| Predict | ds4_engine_generate_argmax + DsmlParser → one ChatDelta with content / reasoning_content / tool_calls[] |
| PredictStream | Same, per-token ChatDelta writes |
DSML
ds4 emits tool calls as literal text markers (<|DSML|tool_calls> etc.) -
NOT special tokens. dsml_parser.{h,cpp} is our streaming state machine that
classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
events. dsml_renderer.{h,cpp} does the prompt direction: turns
OpenAI tool_calls + role=tool messages back into DSML for the next turn.
Thinking modes
PredictOptions.Metadata["enable_thinking"] gates thinking on/off (default ON).
["reasoning_effort"] == "max" | "xhigh" selects DS4_THINK_MAX; anything else
maps to DS4_THINK_HIGH. We pass the chosen mode to ds4_chat_append_assistant_prefix.
Disk KV cache
kv_cache.{h,cpp} implements an SHA1-keyed file cache using ds4's public
ds4_session_save_payload / ds4_session_load_payload API. Enable per request
via ModelOptions.Options[] = "kv_cache_dir:/some/path". Format is our own -
NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
Build matrix
| Build | Where | Notes |
|---|---|---|
cpu-ds4 (amd64 + arm64) |
Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
cuda13-ds4 (amd64 + arm64) |
Linux GHA + DGX Spark validation | Primary production path on Linux |
ds4-darwin (arm64) |
macOS GHA runners | Metal; uses scripts/build/ds4-darwin.sh like llama-cpp-darwin |
cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
Hardware-gated validation
tests/e2e-backends/backend_test.go in BACKEND_BINARY mode:
BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
CI does not load the model; the suite is opt-in via env vars.
Importer
core/gallery/importers/ds4.go (DS4Importer) auto-detects ds4 weights by
matching the antirez/deepseek-v4-gguf repo URI or the
DeepSeek-V4-Flash-*.gguf filename pattern. Registered BEFORE
LlamaCPPImporter in defaultImporters - both match .gguf but ds4 is more
specific, and first-match-wins. The importer emits backend: ds4, uses
ds4flash.gguf as the local filename (matches ds4's own CLI default), and
disables the Go-side automatic tool-parsing fallback (the C++ backend emits
ChatDelta.tool_calls natively via DsmlParser).
ds4 is also listed in core/http/endpoints/localai/backend.go's pref-only
slice so the /import-model UI surfaces it as a manual choice for users who
want to force the backend on a non-canonical URI.