LocalAI/.agents/ds4-backend.md
LocalAI [bot] 621c612b2d
ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00

3.7 KiB
Raw Blame History

Working on the ds4 Backend

antirez/ds4 is a single-model inference engine for DeepSeek V4 Flash. LocalAI wraps the engine's C API (ds4/ds4.h) with a fresh C++ gRPC server at backend/cpp/ds4/ - NOT a fork of llama-cpp's grpc-server.cpp.

Pin

backend/cpp/ds4/Makefile pins DS4_VERSION?=<sha> at the top. The ds4 target in the Makefile clones antirez/ds4 at that commit (mirroring the llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot (.github/workflows/bump_deps.yaml) finds this pin via grep and opens a daily PR to update it. To bump manually: edit the DS4_VERSION?= line, then make purge && make (or rely on CI's clean build).

Wire shape

RPC Implementation
Health, Free, Status Trivial; no engine dependency for Health
LoadModel ds4_engine_open + ds4_session_create; backend is compile-time (DS4_NO_GPU → CPU, APPLE → Metal, otherwise CUDA)
TokenizeString ds4_tokenize_text
Predict ds4_engine_generate_argmax + DsmlParser → one ChatDelta with content / reasoning_content / tool_calls[]
PredictStream Same, per-token ChatDelta writes

DSML

ds4 emits tool calls as literal text markers (<DSMLtool_calls> etc.) - NOT special tokens. dsml_parser.{h,cpp} is our streaming state machine that classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. dsml_renderer.{h,cpp} does the prompt direction: turns OpenAI tool_calls + role=tool messages back into DSML for the next turn.

Thinking modes

PredictOptions.Metadata["enable_thinking"] gates thinking on/off (default ON). ["reasoning_effort"] == "max" | "xhigh" selects DS4_THINK_MAX; anything else maps to DS4_THINK_HIGH. We pass the chosen mode to ds4_chat_append_assistant_prefix.

Disk KV cache

kv_cache.{h,cpp} implements an SHA1-keyed file cache using ds4's public ds4_session_save_payload / ds4_session_load_payload API. Enable per request via ModelOptions.Options[] = "kv_cache_dir:/some/path". Format is our own - NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).

Build matrix

Build Where Notes
cpu-ds4 (amd64 + arm64) Linux GHA ds4 considers CPU debug-only; useful only for wiring tests
cuda13-ds4 (amd64 + arm64) Linux GHA + DGX Spark validation Primary production path on Linux
ds4-darwin (arm64) macOS GHA runners Metal; uses scripts/build/ds4-darwin.sh like llama-cpp-darwin

cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.

Hardware-gated validation

tests/e2e-backends/backend_test.go in BACKEND_BINARY mode:

BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...

CI does not load the model; the suite is opt-in via env vars.

Importer

core/gallery/importers/ds4.go (DS4Importer) auto-detects ds4 weights by matching the antirez/deepseek-v4-gguf repo URI or the DeepSeek-V4-Flash-*.gguf filename pattern. Registered BEFORE LlamaCPPImporter in defaultImporters - both match .gguf but ds4 is more specific, and first-match-wins. The importer emits backend: ds4, uses ds4flash.gguf as the local filename (matches ds4's own CLI default), and disables the Go-side automatic tool-parsing fallback (the C++ backend emits ChatDelta.tool_calls natively via DsmlParser).

ds4 is also listed in core/http/endpoints/localai/backend.go's pref-only slice so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI.