LocalAI/.agents/vllm-backend.md
Ettore Di Giacinto daa0272f2e
docs(agents): capture vllm backend lessons + runtime lib packaging (#9333)
New .agents/vllm-backend.md with everything that's easy to get wrong
on the vllm/vllm-omni backends:

- Use vLLM's native ToolParserManager / ReasoningParserManager — do
  not write regex-based parsers. Selection is explicit via Options[],
  defaults live in core/config/parser_defaults.json.
- Concrete parsers don't always accept the tools= kwarg the abstract
  base declares; try/except TypeError is mandatory.
- ChatDelta.tool_calls is the contract — Reply.message text alone
  won't surface tool calls in /v1/chat/completions.
- vllm version pin trap: 0.14.1+cpu pairs with torch 2.9.1+cpu.
  Newer wheels declare torch==2.10.0+cpu which only exists on the
  PyTorch test channel and pulls an incompatible torchvision.
- SIMD baseline: prebuilt wheel needs AVX-512 VNNI/BF16. SIGILL
  symptom + FROM_SOURCE=true escape hatch are documented.
- libnuma.so.1 + libgomp.so.1 must be bundled because vllm._C
  silently fails to register torch ops if they're missing.
- backend_hooks system: hooks_llamacpp / hooks_vllm split + the
  '*' / '' / named-backend keys.
- ToProto() must serialize ToolCallID and Reasoning — easy to miss
  when adding fields to schema.Message.

Also extended .agents/adding-backends.md with a generic 'Bundling
runtime shared libraries' section: Dockerfile.python is FROM scratch,
package.sh is the mechanism, libbackend.sh adds ${EDIR}/lib to
LD_LIBRARY_PATH, and how to verify packaging without trusting the
host (extract image, boot in fresh ubuntu container).

Index in AGENTS.md updated.
2026-04-13 11:09:57 +02:00

8.4 KiB
Raw Blame History

Working on the vLLM Backend

The vLLM backend lives at backend/python/vllm/backend.py (async gRPC) and the multimodal variant at backend/python/vllm-omni/backend.py (sync gRPC). Both wrap vLLM's AsyncLLMEngine / Omni and translate the LocalAI gRPC PredictOptions into vLLM SamplingParams + outputs into Reply.chat_deltas.

This file captures the non-obvious bits — most of the bring-up was a single PR (feat/vllm-parity) and the things below are easy to get wrong.

Tool calling and reasoning use vLLM's native parsers

Do not write regex-based tool-call extractors for vLLM. vLLM ships:

  • vllm.tool_parsers.ToolParserManager — 50+ registered parsers (hermes, llama3_json, llama4_pythonic, mistral, qwen3_xml, deepseek_v3, granite4, openai, kimi_k2, glm45, …)
  • vllm.reasoning.ReasoningParserManager — 25+ registered parsers (deepseek_r1, qwen3, mistral, gemma4, …)

Both can be used standalone: instantiate with a tokenizer, call extract_tool_calls(text, request=None) / extract_reasoning(text, request=None). The backend stores the parser classes on self.tool_parser_cls / self.reasoning_parser_cls at LoadModel time and instantiates them per request.

Selection: vLLM does not auto-detect parsers from model name — neither does the LocalAI backend. The user (or core/config/hooks_vllm.go) must pick one and pass it via Options[]:

options:
  - tool_parser:hermes
  - reasoning_parser:qwen3

Auto-defaults for known model families live in core/config/parser_defaults.json and are applied:

  • at gallery import time by core/gallery/importers/vllm.go
  • at model load time by the vllm / vllm-omni backend hook in core/config/hooks_vllm.go

User-supplied tool_parser:/reasoning_parser: in the config wins over defaults — the hook checks for existing entries before appending.

When to update parser_defaults.json: any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by family pattern matched against normalizeModelID(cfg.Model) (lowercase, org-prefix stripped, _-). Patterns are checked longest-first — keep qwen3.5 before qwen3, llama-3.3 before llama-3, etc., or the wrong family wins. Add a covering test in core/config/hooks_test.go.

Sister file — core/config/inference_defaults.json: same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by core/config/inference_defaults.go and applied by ApplyInferenceDefaults(). The schema is map[string]float64 only — strings don't fit, which is why parser defaults needed their own JSON file. The inference file is auto-generated from unsloth via go generate ./core/config/ (see core/config/gen_inference_defaults/) — don't hand-edit it; instead update the upstream source or regenerate. Both files share normalizeModelID() and the longest-first pattern ordering.

Constructor compatibility gotcha: the abstract ToolParser.__init__ accepts tools=, but several concrete parsers (Hermes2ProToolParser, etc.) override __init__ and only accept tokenizer. Always:

try:
    tp = self.tool_parser_cls(self.tokenizer, tools=tools)
except TypeError:
    tp = self.tool_parser_cls(self.tokenizer)

ChatDelta is the streaming contract

The Go side (core/backend/llm.go, pkg/functions/chat_deltas.go) consumes Reply.chat_deltas to assemble the OpenAI response. For tool calls to surface in chat/completions, the Python backend must populate Reply.chat_deltas[].tool_calls with ToolCallDelta{index, id, name, arguments}. Returning the raw <tool_call>...</tool_call> text in Reply.message is not enough — the Go regex fallback exists for llama.cpp, not for vllm.

Same story for reasoning_content — emit it on ChatDelta.reasoning_content, not as part of content.

Message conversion to chat templates

tokenizer.apply_chat_template() expects a list of dicts, not proto Messages. The shared helper in backend/python/common/vllm_utils.py (messages_to_dicts) handles the mapping including:

  • tool_call_id and name for role="tool" messages
  • tool_calls JSON-string field → parsed Python list for role="assistant"
  • reasoning_content for thinking models

Pass tools=json.loads(request.Tools) and (when request.Metadata.get("enable_thinking") == "true") enable_thinking=True to apply_chat_template. Wrap in try/except TypeError because not every tokenizer template accepts those kwargs.

CPU support and the SIMD/library minefield

vLLM publishes prebuilt CPU wheels at https://github.com/vllm-project/vllm/releases/.... The pin lives in backend/python/vllm/requirements-cpu-after.txt.

Version compatibility — important: newer vllm CPU wheels (≥ 0.15) declare torch==2.10.0+cpu as a hard dep, but torch==2.10.0 only exists on the PyTorch test channel and pulls in an incompatible torchvision. Stay on vllm 0.14.1+cpu + torch 2.9.1+cpu until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.

requirements-cpu.txt uses --extra-index-url https://download.pytorch.org/whl/cpu. install.sh adds --index-strategy=unsafe-best-match for the cpu profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.

SIMD baseline: the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing vllm.model_executor.models.registry SIGILLs at _run_in_subprocess time during model inspection. There is no runtime flag to disable it. Workarounds:

  1. Run on a host with the right SIMD baseline (default — fast)
  2. Build from source with FROM_SOURCE=true env var. Plumbing exists end-to-end:
    • install.sh hides requirements-cpu-after.txt, runs installRequirements for the base deps, then clones vllm and VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .
    • backend/Dockerfile.python declares ARG FROM_SOURCE + ENV FROM_SOURCE
    • Makefile docker-build-backend macro forwards --build-arg FROM_SOURCE=$(FROM_SOURCE) when set
    • Source build takes 3050 minutes — too slow for per-PR CI but fine for local.

Runtime shared libraries: vLLM's vllm._C extension dlopens libnuma.so.1 at import time. If missing, the C extension silently fails and torch.ops._C_utils.init_cpu_threads_env is never registered → EngineCore crashes on init_device with:

AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'

backend/python/vllm/package.sh bundles libnuma.so.1 and libgomp.so.1 into ${BACKEND}/lib/, which libbackend.sh adds to LD_LIBRARY_PATH at run time. The builder stage in backend/Dockerfile.python installs libnuma1/libgomp1 so package.sh has something to copy. Do not assume the production host has these — backend images are FROM scratch.

Backend hook system (core/config/backend_hooks.go)

Per-backend defaults that used to be hardcoded in ModelConfig.Prepare() now live in core/config/hooks_*.go files and self-register via init():

  • hooks_llamacpp.go → GGUF metadata parsing, context size, GPU layers, jinja template
  • hooks_vllm.go → tool/reasoning parser auto-selection from parser_defaults.json

Hook keys:

  • "llama-cpp", "vllm", "vllm-omni", … — backend-specific
  • "" — runs only when cfg.Backend is empty (auto-detect case)
  • "*" — global catch-all, runs for every backend before specific hooks

Multiple hooks per key are supported and run in registration order. Adding a new backend default:

// core/config/hooks_<backend>.go
func init() {
    RegisterBackendHook("<backend>", myDefaults)
}
func myDefaults(cfg *ModelConfig, modelPath string) {
    // only fill in fields the user didn't set
}

The Messages.ToProto() fields you need to set

core/schema/message.go:ToProto() must serialize:

  • ToolCallIDproto.Message.ToolCallId (for role="tool" messages — links result back to the call)
  • Reasoningproto.Message.ReasoningContent
  • ToolCallsproto.Message.ToolCalls (JSON-encoded string)

These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to schema.Message and proto.Message needs a matching line in ToProto().