LocalAI/backend/python/vllm/package.sh

154 lines
6.8 KiB
Bash
Raw Normal View History

feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
#!/bin/bash
# Script to package runtime shared libraries for the vllm backend.
#
# The final Dockerfile.python stage is FROM scratch, so system libraries
# must be explicitly copied into ${BACKEND}/lib so the backend can run on
# any host without installing them. libbackend.sh automatically adds that
# directory to LD_LIBRARY_PATH at run time.
#
# vllm's CPU C++ extension (vllm._C) dlopens libnuma.so.1 at import time;
# if it's missing, the _C_utils torch ops are never registered and the
# engine crashes with AttributeError on init_cpu_threads_env. libgomp is
# used by torch's CPU kernels; on some stripped-down hosts it's also
# absent, so we bundle it too.
set -e
CURDIR=$(dirname "$(realpath "$0")")
LIB_DIR="${CURDIR}/lib"
mkdir -p "${LIB_DIR}"
copy_with_symlinks() {
local soname="$1"
local hit=""
for dir in /usr/lib/x86_64-linux-gnu /usr/lib/aarch64-linux-gnu /lib/x86_64-linux-gnu /lib/aarch64-linux-gnu /usr/lib /lib; do
if [ -e "${dir}/${soname}" ]; then
hit="${dir}/${soname}"
break
fi
done
if [ -z "${hit}" ]; then
echo "warning: ${soname} not found in standard lib paths" >&2
return 0
fi
# Follow the symlink to the real file, copy it, then recreate the symlink.
local real
real=$(readlink -f "${hit}")
cp -v "${real}" "${LIB_DIR}/"
local real_base
real_base=$(basename "${real}")
if [ "${real_base}" != "${soname}" ]; then
ln -sf "${real_base}" "${LIB_DIR}/${soname}"
fi
}
copy_with_symlinks libnuma.so.1
copy_with_symlinks libgomp.so.1
feat(vllm, distributed): tensor parallel distributed workers (#9612) * feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-05 22:22:50 +00:00
# CPU profile only: bundle a g++ toolchain so torch._inductor's
# ISA probe (always run at vllm engine startup, regardless of
# enforce_eager) finds a C++ compiler. The LocalAI runtime image
# is FROM ubuntu:24.04 with a minimal apt list that does not
# include build-essential, and the backend image itself is FROM
# scratch -- so without this, cpu-vllm crashes with
# torch._inductor.exc.InvalidCxxCompiler at first inference
# unless the operator manually sets TORCH_COMPILE_DISABLE=1.
#
# We snapshot every file owned by the toolchain packages, mirroring
# the /usr/... layout into ${BACKEND}/toolchain/ so g++ can find
# cc1plus, headers, libs etc. via GCC_EXEC_PREFIX / CPATH /
# LIBRARY_PATH at runtime (libbackend.sh wires those up). Adds
# ~400 MB to the cpu-vllm image, which is tolerable -- cpu-vllm is
# already a niche profile.
if [ "${BUILD_TYPE:-}" = "" ] && command -v dpkg-query >/dev/null 2>&1; then
TOOLCHAIN_DIR="${CURDIR}/toolchain"
mkdir -p "${TOOLCHAIN_DIR}"
# The unversioned g++/gcc packages on Debian/Ubuntu only ship
# symlinks; the actual binaries live in g++-${VER}/gcc-${VER}.
# Discover the active version so the symlink targets get bundled
# along with their owners.
GCC_VER=$(gcc -dumpversion 2>/dev/null | cut -d. -f1 || true)
# `g++-${VER}` itself is just another symlink layer on Debian/
# Ubuntu — the real binary `x86_64-linux-gnu-g++-${VER}` lives
# in `g++-${VER}-x86-64-linux-gnu` (a separate package pulled in
# as a dependency). Same story for gcc/cpp. Compute the dpkg
# arch-triplet to find the right package name for both amd64 and
# arm64 hosts.
case "$(dpkg --print-architecture 2>/dev/null)" in
amd64) HOST_TRIPLET="x86-64-linux-gnu" ;;
arm64) HOST_TRIPLET="aarch64-linux-gnu" ;;
*) HOST_TRIPLET="" ;;
esac
PKGS=(g++ gcc cpp libstdc++-${GCC_VER}-dev libgcc-${GCC_VER}-dev libc6 libc6-dev binutils binutils-common libbinutils libc-dev-bin linux-libc-dev libcrypt-dev libgomp1 libstdc++6 libgcc-s1 libisl23 libmpc3 libmpfr6 libjansson4 libctf0 libctf-nobfd0 libsframe1)
if [ -n "${GCC_VER}" ]; then
PKGS+=("g++-${GCC_VER}" "gcc-${GCC_VER}" "cpp-${GCC_VER}" "gcc-${GCC_VER}-base")
if [ -n "${HOST_TRIPLET}" ]; then
PKGS+=(
"g++-${GCC_VER}-${HOST_TRIPLET}"
"gcc-${GCC_VER}-${HOST_TRIPLET}"
"cpp-${GCC_VER}-${HOST_TRIPLET}"
"binutils-${HOST_TRIPLET}"
)
fi
fi
for pkg in "${PKGS[@]}"; do
if ! dpkg-query -W "${pkg}" >/dev/null 2>&1; then
continue
fi
# Copy each owned path, preserving symlinks and mode. We
# tolerate dpkg listing directories alongside files.
dpkg -L "${pkg}" | while IFS= read -r path; do
if [ -L "${path}" ] || [ -f "${path}" ]; then
mkdir -p "${TOOLCHAIN_DIR}$(dirname "${path}")"
cp -aP "${path}" "${TOOLCHAIN_DIR}${path}" 2>/dev/null || true
fi
done
done
# Ubuntu's filesystem layout has /lib -> /usr/lib (UsrMerge) and
# /lib64 -> /usr/lib64. ld scripts (e.g. libm.so) hardcode
# `/lib/x86_64-linux-gnu/libm.so.6`; with --sysroot the linker
# looks for that path under the sysroot, which means we need
# the same symlinks under TOOLCHAIN_DIR.
[ -e "${TOOLCHAIN_DIR}/lib" ] || ln -s usr/lib "${TOOLCHAIN_DIR}/lib"
[ -e "${TOOLCHAIN_DIR}/lib64" ] || ln -s usr/lib64 "${TOOLCHAIN_DIR}/lib64"
# Replace the unversioned g++/gcc/cpp symlinks with wrapper
# scripts that pass --sysroot=<toolchain> and -B <gcc-exec-prefix>.
# Without these flags gcc would fall back to its compiled-in
# /usr search and fail to find headers (the runtime image has no
# libc6-dev) or fail to invoke `as`/`ld` (binutils not on PATH at
# /usr/bin). Wrappers self-resolve their location at runtime so
# they work from any BackendsPath.
BIN_DIR="${TOOLCHAIN_DIR}/usr/bin"
if [ -n "${GCC_VER}" ] && [ -n "${HOST_TRIPLET}" ]; then
# HOST_TRIPLET in package names uses dashes ("x86-64-linux-gnu");
# the binary suffix uses underscores in the arch part
# ("x86_64-linux-gnu-g++-13"). Translate.
BIN_TRIPLET=${HOST_TRIPLET//x86-64/x86_64}
for tool in g++ gcc cpp; do
real="${BIN_DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}"
if [ -x "${real}" ]; then
rm -f "${BIN_DIR}/${tool}" "${BIN_DIR}/${tool}-${GCC_VER}"
cat > "${BIN_DIR}/${tool}" <<EOF
#!/bin/bash
# Auto-generated by package.sh. Passes --sysroot and -B so the
# bundled toolchain works from any BackendsPath without depending
# on libc6-dev / binutils being installed at /usr in the runtime
# image. See backend/python/vllm/package.sh.
DIR="\$(dirname "\$(readlink -f "\$0")")" # …/toolchain/usr/bin
SYSROOT="\$(dirname "\$(dirname "\${DIR}")")" # …/toolchain
exec "\${DIR}/${BIN_TRIPLET}-${tool}-${GCC_VER}" \\
-B "\${SYSROOT}/usr/lib/gcc/${BIN_TRIPLET}/${GCC_VER}/" \\
--sysroot="\${SYSROOT}" \\
"\$@"
EOF
chmod +x "${BIN_DIR}/${tool}"
fi
done
fi
echo "Bundled g++ toolchain (gcc-${GCC_VER}) into ${TOOLCHAIN_DIR} ($(du -sh "${TOOLCHAIN_DIR}" | cut -f1))"
fi
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
echo "vllm packaging completed successfully"
ls -liah "${LIB_DIR}/"