New .agents/vllm-backend.md with everything that's easy to get wrong
on the vllm/vllm-omni backends:
- Use vLLM's native ToolParserManager / ReasoningParserManager — do
not write regex-based parsers. Selection is explicit via Options[],
defaults live in core/config/parser_defaults.json.
- Concrete parsers don't always accept the tools= kwarg the abstract
base declares; try/except TypeError is mandatory.
- ChatDelta.tool_calls is the contract — Reply.message text alone
won't surface tool calls in /v1/chat/completions.
- vllm version pin trap: 0.14.1+cpu pairs with torch 2.9.1+cpu.
Newer wheels declare torch==2.10.0+cpu which only exists on the
PyTorch test channel and pulls an incompatible torchvision.
- SIMD baseline: prebuilt wheel needs AVX-512 VNNI/BF16. SIGILL
symptom + FROM_SOURCE=true escape hatch are documented.
- libnuma.so.1 + libgomp.so.1 must be bundled because vllm._C
silently fails to register torch ops if they're missing.
- backend_hooks system: hooks_llamacpp / hooks_vllm split + the
'*' / '' / named-backend keys.
- ToProto() must serialize ToolCallID and Reasoning — easy to miss
when adding fields to schema.Message.
Also extended .agents/adding-backends.md with a generic 'Bundling
runtime shared libraries' section: Dockerfile.python is FROM scratch,
package.sh is the mechanism, libbackend.sh adds ${EDIR}/lib to
LD_LIBRARY_PATH, and how to verify packaging without trusting the
host (extract image, boot in fresh ubuntu container).
Index in AGENTS.md updated.
8.4 KiB
Working on the vLLM Backend
The vLLM backend lives at backend/python/vllm/backend.py (async gRPC) and the multimodal variant at backend/python/vllm-omni/backend.py (sync gRPC). Both wrap vLLM's AsyncLLMEngine / Omni and translate the LocalAI gRPC PredictOptions into vLLM SamplingParams + outputs into Reply.chat_deltas.
This file captures the non-obvious bits — most of the bring-up was a single PR (feat/vllm-parity) and the things below are easy to get wrong.
Tool calling and reasoning use vLLM's native parsers
Do not write regex-based tool-call extractors for vLLM. vLLM ships:
vllm.tool_parsers.ToolParserManager— 50+ registered parsers (hermes,llama3_json,llama4_pythonic,mistral,qwen3_xml,deepseek_v3,granite4,openai,kimi_k2,glm45, …)vllm.reasoning.ReasoningParserManager— 25+ registered parsers (deepseek_r1,qwen3,mistral,gemma4, …)
Both can be used standalone: instantiate with a tokenizer, call extract_tool_calls(text, request=None) / extract_reasoning(text, request=None). The backend stores the parser classes on self.tool_parser_cls / self.reasoning_parser_cls at LoadModel time and instantiates them per request.
Selection: vLLM does not auto-detect parsers from model name — neither does the LocalAI backend. The user (or core/config/hooks_vllm.go) must pick one and pass it via Options[]:
options:
- tool_parser:hermes
- reasoning_parser:qwen3
Auto-defaults for known model families live in core/config/parser_defaults.json and are applied:
- at gallery import time by
core/gallery/importers/vllm.go - at model load time by the
vllm/vllm-omnibackend hook incore/config/hooks_vllm.go
User-supplied tool_parser:/reasoning_parser: in the config wins over defaults — the hook checks for existing entries before appending.
When to update parser_defaults.json: any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by family pattern matched against normalizeModelID(cfg.Model) (lowercase, org-prefix stripped, _→-). Patterns are checked longest-first — keep qwen3.5 before qwen3, llama-3.3 before llama-3, etc., or the wrong family wins. Add a covering test in core/config/hooks_test.go.
Sister file — core/config/inference_defaults.json: same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by core/config/inference_defaults.go and applied by ApplyInferenceDefaults(). The schema is map[string]float64 only — strings don't fit, which is why parser defaults needed their own JSON file. The inference file is auto-generated from unsloth via go generate ./core/config/ (see core/config/gen_inference_defaults/) — don't hand-edit it; instead update the upstream source or regenerate. Both files share normalizeModelID() and the longest-first pattern ordering.
Constructor compatibility gotcha: the abstract ToolParser.__init__ accepts tools=, but several concrete parsers (Hermes2ProToolParser, etc.) override __init__ and only accept tokenizer. Always:
try:
tp = self.tool_parser_cls(self.tokenizer, tools=tools)
except TypeError:
tp = self.tool_parser_cls(self.tokenizer)
ChatDelta is the streaming contract
The Go side (core/backend/llm.go, pkg/functions/chat_deltas.go) consumes Reply.chat_deltas to assemble the OpenAI response. For tool calls to surface in chat/completions, the Python backend must populate Reply.chat_deltas[].tool_calls with ToolCallDelta{index, id, name, arguments}. Returning the raw <tool_call>...</tool_call> text in Reply.message is not enough — the Go regex fallback exists for llama.cpp, not for vllm.
Same story for reasoning_content — emit it on ChatDelta.reasoning_content, not as part of content.
Message conversion to chat templates
tokenizer.apply_chat_template() expects a list of dicts, not proto Messages. The shared helper in backend/python/common/vllm_utils.py (messages_to_dicts) handles the mapping including:
tool_call_idandnameforrole="tool"messagestool_callsJSON-string field → parsed Python list forrole="assistant"reasoning_contentfor thinking models
Pass tools=json.loads(request.Tools) and (when request.Metadata.get("enable_thinking") == "true") enable_thinking=True to apply_chat_template. Wrap in try/except TypeError because not every tokenizer template accepts those kwargs.
CPU support and the SIMD/library minefield
vLLM publishes prebuilt CPU wheels at https://github.com/vllm-project/vllm/releases/.... The pin lives in backend/python/vllm/requirements-cpu-after.txt.
Version compatibility — important: newer vllm CPU wheels (≥ 0.15) declare torch==2.10.0+cpu as a hard dep, but torch==2.10.0 only exists on the PyTorch test channel and pulls in an incompatible torchvision. Stay on vllm 0.14.1+cpu + torch 2.9.1+cpu until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
requirements-cpu.txt uses --extra-index-url https://download.pytorch.org/whl/cpu. install.sh adds --index-strategy=unsafe-best-match for the cpu profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
SIMD baseline: the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing vllm.model_executor.models.registry SIGILLs at _run_in_subprocess time during model inspection. There is no runtime flag to disable it. Workarounds:
- Run on a host with the right SIMD baseline (default — fast)
- Build from source with
FROM_SOURCE=trueenv var. Plumbing exists end-to-end:install.shhidesrequirements-cpu-after.txt, runsinstallRequirementsfor the base deps, then clones vllm andVLLM_TARGET_DEVICE=cpu uv pip install --no-deps .backend/Dockerfile.pythondeclaresARG FROM_SOURCE+ENV FROM_SOURCEMakefiledocker-build-backendmacro forwards--build-arg FROM_SOURCE=$(FROM_SOURCE)when set- Source build takes 30–50 minutes — too slow for per-PR CI but fine for local.
Runtime shared libraries: vLLM's vllm._C extension dlopens libnuma.so.1 at import time. If missing, the C extension silently fails and torch.ops._C_utils.init_cpu_threads_env is never registered → EngineCore crashes on init_device with:
AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
backend/python/vllm/package.sh bundles libnuma.so.1 and libgomp.so.1 into ${BACKEND}/lib/, which libbackend.sh adds to LD_LIBRARY_PATH at run time. The builder stage in backend/Dockerfile.python installs libnuma1/libgomp1 so package.sh has something to copy. Do not assume the production host has these — backend images are FROM scratch.
Backend hook system (core/config/backend_hooks.go)
Per-backend defaults that used to be hardcoded in ModelConfig.Prepare() now live in core/config/hooks_*.go files and self-register via init():
hooks_llamacpp.go→ GGUF metadata parsing, context size, GPU layers, jinja templatehooks_vllm.go→ tool/reasoning parser auto-selection fromparser_defaults.json
Hook keys:
"llama-cpp","vllm","vllm-omni", … — backend-specific""— runs only whencfg.Backendis empty (auto-detect case)"*"— global catch-all, runs for every backend before specific hooks
Multiple hooks per key are supported and run in registration order. Adding a new backend default:
// core/config/hooks_<backend>.go
func init() {
RegisterBackendHook("<backend>", myDefaults)
}
func myDefaults(cfg *ModelConfig, modelPath string) {
// only fill in fields the user didn't set
}
The Messages.ToProto() fields you need to set
core/schema/message.go:ToProto() must serialize:
ToolCallID→proto.Message.ToolCallId(forrole="tool"messages — links result back to the call)Reasoning→proto.Message.ReasoningContentToolCalls→proto.Message.ToolCalls(JSON-encoded string)
These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to schema.Message and proto.Message needs a matching line in ToProto().