LocalAI/Makefile

931 lines
40 KiB
Makefile
Raw Permalink Normal View History

chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
GOCMD=go
GOTEST=$(GOCMD) test
GOVET=$(GOCMD) vet
BINARY_NAME=local-ai
feat(launcher): add LocalAI launcher app (#6127) * Add launcher (WIP) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gomod Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup, focus on systray Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Separate launcher from main Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add a way to identify the binary version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Implement save config, and start on boot Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Save installed version as metadata Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Stop LocalAI on quit Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Check first if binary is there Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not show version if we don't have it Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use fyne package Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to release Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fyne.Do Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * show WEBUI button only if LocalAI is started Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Default to localhost Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Show rel notes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update logo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small improvements and fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-26 12:22:04 +00:00
LAUNCHER_BINARY_NAME=local-ai-launcher
2023-05-04 10:26:59 +00:00
UBUNTU_VERSION?=2404
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
UBUNTU_CODENAME?=noble
GORELEASER?=
export BUILD_TYPE?=
export CUDA_MAJOR_VERSION?=13
export CUDA_MINOR_VERSION?=0
GO_TAGS?=
BUILD_ID?=
NATIVE?=false
TEST_DIR=/tmp/test
TEST_FLAKES?=5
RANDOM := $(shell bash -c 'echo $$RANDOM')
VERSION?=$(shell git describe --always --tags || echo "dev" )
# go tool nm ./local-ai | grep Commit
LD_FLAGS?=-s -w
override LD_FLAGS += -X "github.com/mudler/LocalAI/internal.Version=$(VERSION)"
override LD_FLAGS += -X "github.com/mudler/LocalAI/internal.Commit=$(shell git rev-parse HEAD)"
OPTIONAL_TARGETS?=
export OS := $(shell uname -s)
ARCH := $(shell uname -m)
GREEN := $(shell tput -Txterm setaf 2)
YELLOW := $(shell tput -Txterm setaf 3)
WHITE := $(shell tput -Txterm setaf 7)
CYAN := $(shell tput -Txterm setaf 6)
RESET := $(shell tput -Txterm sgr0)
# Default Docker bridge IP
E2E_BRIDGE_IP?=172.17.0.1
ifndef UNAME_S
UNAME_S := $(shell uname -s)
endif
ifeq ($(OS),Darwin)
ifeq ($(OSX_SIGNING_IDENTITY),)
OSX_SIGNING_IDENTITY := $(shell security find-identity -v -p codesigning | grep '"' | head -n 1 | sed -E 's/.*"(.*)"/\1/')
endif
2023-05-24 14:42:24 +00:00
endif
# check if goreleaser exists
ifeq (, $(shell which goreleaser))
GORELEASER=curl -sfL https://goreleaser.com/static/run | bash -s --
else
GORELEASER=$(shell which goreleaser)
endif
TEST_PATHS?=./api/... ./pkg/... ./core/...
.PHONY: all test build vendor
all: help
## GENERIC
rebuild: ## Rebuilds the project
2023-07-05 16:24:55 +00:00
$(GOCMD) clean -cache
$(MAKE) build
clean: ## Remove build related file
2023-07-05 16:24:55 +00:00
$(GOCMD) clean -cache
rm -f prepare
rm -rf $(BINARY_NAME)
rm -rf release/
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
$(MAKE) protogen-clean
rmdir pkg/grpc/proto || true
clean-tests:
rm -rf test-models
rm -rf test-dir
## Install Go tools
install-go-tools:
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
## React UI:
react-ui:
ifneq ($(wildcard core/http/react-ui/dist),)
@echo "react-ui dist already exists, skipping build"
else
cd core/http/react-ui && npm install && npm run build
endif
react-ui-docker:
docker run --entrypoint /bin/bash -v $(CURDIR):/app:z oven/bun:1 \
-c "cd /app/core/http/react-ui && bun install && bun run build"
core/http/react-ui/dist: react-ui
## Build:
build: protogen-go generate install-go-tools core/http/react-ui/dist ## Build the project
$(info ${GREEN}I local-ai build info:${RESET})
$(info ${GREEN}I BUILD_TYPE: ${YELLOW}$(BUILD_TYPE)${RESET})
$(info ${GREEN}I GO_TAGS: ${YELLOW}$(GO_TAGS)${RESET})
$(info ${GREEN}I LD_FLAGS: ${YELLOW}$(LD_FLAGS)${RESET})
$(info ${GREEN}I UPX: ${YELLOW}$(UPX)${RESET})
rm -rf $(BINARY_NAME) || true
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o $(BINARY_NAME) ./cmd/local-ai
feat(launcher): add LocalAI launcher app (#6127) * Add launcher (WIP) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gomod Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup, focus on systray Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Separate launcher from main Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add a way to identify the binary version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Implement save config, and start on boot Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Save installed version as metadata Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Stop LocalAI on quit Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Check first if binary is there Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not show version if we don't have it Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use fyne package Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to release Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fyne.Do Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * show WEBUI button only if LocalAI is started Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Default to localhost Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Show rel notes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update logo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small improvements and fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-26 12:22:04 +00:00
build-launcher: ## Build the launcher application
$(info ${GREEN}I local-ai launcher build info:${RESET})
$(info ${GREEN}I BUILD_TYPE: ${YELLOW}$(BUILD_TYPE)${RESET})
$(info ${GREEN}I GO_TAGS: ${YELLOW}$(GO_TAGS)${RESET})
$(info ${GREEN}I LD_FLAGS: ${YELLOW}$(LD_FLAGS)${RESET})
rm -rf $(LAUNCHER_BINARY_NAME) || true
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o $(LAUNCHER_BINARY_NAME) ./cmd/launcher
feat(launcher): add LocalAI launcher app (#6127) * Add launcher (WIP) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gomod Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup, focus on systray Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Separate launcher from main Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add a way to identify the binary version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Implement save config, and start on boot Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Save installed version as metadata Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Stop LocalAI on quit Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Check first if binary is there Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not show version if we don't have it Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use fyne package Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to release Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fyne.Do Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * show WEBUI button only if LocalAI is started Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Default to localhost Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Show rel notes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update logo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small improvements and fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-26 12:22:04 +00:00
build-all: build build-launcher ## Build both server and launcher
build-dev: ## Run LocalAI in dev mode with live reload
@command -v air >/dev/null 2>&1 || go install github.com/air-verse/air@latest
air -c .air.toml
dev-dist:
$(GORELEASER) build --snapshot --clean
feat: auto select llama-cpp cuda runtime (#2306) * auto select cpu variant Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * remove cuda target for now Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * fix metal Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * fix path Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * cuda Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * auto select cuda Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * update test Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * select CUDA backend only if present Signed-off-by: mudler <mudler@localai.io> * ci: keep cuda bin in path Signed-off-by: mudler <mudler@localai.io> * Makefile: make dist now builds also cuda Signed-off-by: mudler <mudler@localai.io> * Keep pushing fallback in case auto-flagset/nvidia fails There could be other reasons for which the default binary may fail. For example we might have detected an Nvidia GPU, however the user might not have the drivers/cuda libraries installed in the system, and so it would fail to start. We keep the fallback of llama.cpp at the end of the llama.cpp backends to try to fallback loading in case things go wrong Signed-off-by: mudler <mudler@localai.io> * Do not build cuda on MacOS Signed-off-by: mudler <mudler@localai.io> * cleanup Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Sertac Ozercan <sozercan@gmail.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Signed-off-by: mudler <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: mudler <mudler@localai.io>
2024-05-14 17:40:18 +00:00
dist:
$(GORELEASER) build --clean
osx-signed: build
codesign --deep --force --sign "$(OSX_SIGNING_IDENTITY)" --entitlements "./Entitlements.plist" "./$(BINARY_NAME)"
## Run
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
test-models/testmodel.ggml:
mkdir -p test-models
mkdir -p test-dir
wget -q https://huggingface.co/mradermacher/gpt2-alpaca-gpt4-GGUF/resolve/main/gpt2-alpaca-gpt4.Q4_K_M.gguf -O test-models/testmodel.ggml
feat(conda): conda environments (#1144) * feat(autogptq): add a separate conda environment for autogptq (#1137) **Description** This PR related to #1117 **Notes for Reviewers** Here we lock down the version of the dependencies. Make sure it can be used all the time without failed if the version of dependencies were upgraded. I change the order of importing packages according to the pylint, and no change the logic of code. It should be ok. I will do more investigate on writing some test cases for every backend. I can run the service in my environment, but there is not exist a way to test it. So, I am not confident on it. Add a README.md in the `grpc` root. This is the common commands for creating `conda` environment. And it can be used to the reference file for creating extral gRPC backend document. Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * [Extra backend] Add seperate environment for ttsbark (#1141) **Description** This PR relates to #1117 **Notes for Reviewers** Same to the latest PR: * The code is also changed, but only the order of the import package parts. And some code comments are also added. * Add a configuration of the `conda` environment * Add a simple test case for testing if the service can be startup in current `conda` environment. It is succeed in VSCode, but the it is not out of box on terminal. So, it is hard to say the test case really useful. **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): add make target and entrypoints for the dockerfile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): Add seperate conda env for diffusers (#1145) **Description** This PR relates to #1117 **Notes for Reviewers** * Add `conda` env `diffusers.yml` * Add Makefile to create it automatically * Add `run.sh` to support running as a extra backend * Also adding it to the main Dockerfile * Add make command in the root Makefile * Testing the server, it can start up under the env Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate env for vllm (#1148) **Description** This PR is related to #1117 **Notes for Reviewers** * The gRPC server can be started as normal * The test case can be triggered in VSCode * Same to other this kind of PRs, add `vllm.yml` Makefile and add `run.sh` to the main Dockerfile, and command to the main Makefile **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate env for huggingface (#1146) **Description** This PR is related to #1117 **Notes for Reviewers** * Add conda env `huggingface.yml` * Change the import order, and also remove the no-used packages * Add `run.sh` and `make command` to the main Dockerfile and Makefile * Add test cases for it. It can be triggered and succeed under VSCode Python extension but it is hang by using `python -m unites test_huggingface.py` in the terminal ``` Running tests (unittest): /workspaces/LocalAI/extra/grpc/huggingface Running tests: /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_embedding /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_load_model /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_server_startup ./test_huggingface.py::TestBackendServicer::test_embedding Passed ./test_huggingface.py::TestBackendServicer::test_load_model Passed ./test_huggingface.py::TestBackendServicer::test_server_startup Passed Total number of tests expected to run: 3 Total number of tests run: 3 Total number of tests passed: 3 Total number of tests failed: 0 Total number of tests failed with errors: 0 Total number of tests skipped: 0 Finished running tests! ``` **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): Add the seperate conda env for VALL-E X (#1147) **Description** This PR is related to #1117 **Notes for Reviewers** * The gRPC server cannot start up ``` (ttsvalle) @Aisuko ➜ /workspaces/LocalAI (feat/vall-e-x) $ /opt/conda/envs/ttsvalle/bin/python /workspaces/LocalAI/extra/grpc/vall-e-x/ttsvalle.py Traceback (most recent call last): File "/workspaces/LocalAI/extra/grpc/vall-e-x/ttsvalle.py", line 14, in <module> from utils.generation import SAMPLE_RATE, generate_audio, preload_models ModuleNotFoundError: No module named 'utils' ``` The installation steps follow https://github.com/Plachtaa/VALL-E-X#-installation below: * Under the `ttsvalle` conda env ``` git clone https://github.com/Plachtaa/VALL-E-X.git cd VALL-E-X pip install -r requirements.txt ``` **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: set image type Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate conda env for exllama (#1149) Add seperate env for exllama Signed-off-by: Aisuko <urakiny@gmail.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Setup conda Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Set image_type arg Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: prepare only conda env in tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Dockerfile: comment manual pip calls Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * conda: add conda to PATH Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixes * add shebang * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * file perms Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * debug * Install new conda in the worker * Disable GPU tests for now until the worker is back * Rename workflows * debug * Fixup conda install * fixup(wrapper): pass args Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Aisuko <urakiny@gmail.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Aisuko <urakiny@gmail.com>
2023-11-04 14:30:32 +00:00
wget -q https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O test-models/whisper-en
wget -q https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav -O test-dir/audio.wav
cp tests/models_fixtures/* test-models
prepare-test: protogen-go
cp tests/models_fixtures/* test-models
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## Tests
########################################################
## Test targets
test: test-models/testmodel.ggml protogen-go
@echo 'Running tests'
export GO_TAGS="debug"
$(MAKE) prepare-test
OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
HUGGINGFACE_GRPC=$(abspath ./)/backend/python/transformers/run.sh TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="!llama-gguf" --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
$(MAKE) test-llama-gguf
$(MAKE) test-tts
$(MAKE) test-stablediffusion
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## E2E AIO tests (uses standard image with pre-configured models)
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
docker-build-e2e:
docker build \
--build-arg MAKEFLAGS="--jobs=5 --output-sync=target" \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg CUDA_MAJOR_VERSION=$(CUDA_MAJOR_VERSION) \
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg GO_TAGS="$(GO_TAGS)" \
-t local-ai:tests -f Dockerfile .
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
e2e-aio:
LOCALAI_BACKEND_DIR=$(abspath ./backends) \
LOCALAI_MODELS_DIR=$(abspath ./tests/e2e-aio/models) \
LOCALAI_IMAGE_TAG=tests \
LOCALAI_IMAGE=local-ai \
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
$(MAKE) run-e2e-aio
run-e2e-aio: protogen-go
@echo 'Running e2e AIO tests'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio
########################################################
## E2E tests
########################################################
prepare-e2e:
docker build \
--build-arg IMAGE_TYPE=core \
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg CUDA_MAJOR_VERSION=$(CUDA_MAJOR_VERSION) \
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
-t localai-tests .
run-e2e-image:
docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests
test-e2e: build-mock-backend prepare-e2e run-e2e-image
@echo 'Running e2e tests'
BUILD_TYPE=$(BUILD_TYPE) \
LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
$(MAKE) clean-mock-backend
$(MAKE) teardown-e2e
docker rmi localai-tests
teardown-e2e:
rm -rf $(TEST_DIR) || true
docker stop $$(docker ps -q --filter ancestor=localai-tests)
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## Integration and unit tests
########################################################
test-llama-gguf: prepare-test
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="llama-gguf" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-tts: prepare-test
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="tts" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stablediffusion: prepare-test
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stablediffusion" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stores:
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stores" --flake-attempts $(TEST_FLAKES) -v -r tests/integration
test-opus:
@echo 'Running opus backend tests'
$(MAKE) -C backend/go/opus libopusshim.so
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./backend/go/opus/...
test-opus-docker:
@echo 'Running opus backend tests in Docker'
docker build --target builder \
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),) \
--build-arg BASE_IMAGE=$(or $(BASE_IMAGE),ubuntu:24.04) \
--build-arg BACKEND=opus \
-t localai-opus-test -f backend/Dockerfile.golang .
docker run --rm localai-opus-test \
bash -c 'cd /LocalAI && go run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./backend/go/opus/...'
test-realtime: build-mock-backend
@echo 'Running realtime e2e tests (mock backend)'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime && !real-models" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# Real-model realtime tests. Set REALTIME_TEST_MODEL to use your own pipeline,
# or leave unset to auto-build one from the component env vars below.
REALTIME_VAD?=silero-vad-ggml
REALTIME_STT?=whisper-1
REALTIME_LLM?=qwen3-0.6b
REALTIME_TTS?=tts-1
REALTIME_BACKENDS_PATH?=$(abspath ./)/backends
test-realtime-models: build-mock-backend
@echo 'Running realtime e2e tests (real models)'
REALTIME_TEST_MODEL=$${REALTIME_TEST_MODEL:-realtime-test-pipeline} \
REALTIME_VAD=$(REALTIME_VAD) \
REALTIME_STT=$(REALTIME_STT) \
REALTIME_LLM=$(REALTIME_LLM) \
REALTIME_TTS=$(REALTIME_TTS) \
REALTIME_BACKENDS_PATH=$(REALTIME_BACKENDS_PATH) \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# --- Container-based real-model testing ---
REALTIME_BACKEND_NAMES ?= silero-vad whisper llama-cpp kokoro
REALTIME_MODELS_DIR ?= $(abspath ./models)
REALTIME_BACKENDS_DIR ?= $(abspath ./local-backends)
REALTIME_DOCKER_FLAGS ?= --gpus all
local-backends:
mkdir -p local-backends
extract-backend-%: docker-build-% local-backends
@echo "Extracting backend $*..."
@CID=$$(docker create local-ai-backend:$*) && \
rm -rf local-backends/$* && mkdir -p local-backends/$* && \
docker cp $$CID:/ - | tar -xf - -C local-backends/$* && \
docker rm $$CID > /dev/null
extract-realtime-backends: $(addprefix extract-backend-,$(REALTIME_BACKEND_NAMES))
test-realtime-models-docker: build-mock-backend
docker build --target build-requirements \
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),cublas) \
--build-arg CUDA_MAJOR_VERSION=$(or $(CUDA_MAJOR_VERSION),13) \
--build-arg CUDA_MINOR_VERSION=$(or $(CUDA_MINOR_VERSION),0) \
-t localai-test-runner .
docker run --rm \
$(REALTIME_DOCKER_FLAGS) \
-v $(abspath ./):/build \
-v $(REALTIME_MODELS_DIR):/models:ro \
-v $(REALTIME_BACKENDS_DIR):/backends \
-v localai-go-cache:/root/go/pkg/mod \
-v localai-go-build-cache:/root/.cache/go-build \
-e REALTIME_TEST_MODEL=$${REALTIME_TEST_MODEL:-realtime-test-pipeline} \
-e REALTIME_VAD=$(REALTIME_VAD) \
-e REALTIME_STT=$(REALTIME_STT) \
-e REALTIME_LLM=$(REALTIME_LLM) \
-e REALTIME_TTS=$(REALTIME_TTS) \
-e REALTIME_BACKENDS_PATH=/backends \
-e REALTIME_MODELS_PATH=/models \
-w /build \
localai-test-runner \
bash -c 'git config --global --add safe.directory /build && \
make protogen-go && make build-mock-backend && \
go run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e'
test-container:
docker build --target requirements -t local-ai-test-container .
docker run -ti --rm --entrypoint /bin/bash -ti -v $(abspath ./):/build local-ai-test-container
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## Help
########################################################
## Help:
help: ## Show this help.
@echo ''
@echo 'Usage:'
@echo ' ${YELLOW}make${RESET} ${GREEN}<target>${RESET}'
@echo ''
@echo 'Targets:'
@awk 'BEGIN {FS = ":.*?## "} { \
if (/^[a-zA-Z_-]+:.*?##.*$$/) {printf " ${YELLOW}%-20s${GREEN}%s${RESET}\n", $$1, $$2} \
else if (/^## .*$$/) {printf " ${CYAN}%s${RESET}\n", substr($$1,4)} \
}' $(MAKEFILE_LIST)
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## Backends
########################################################
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
.PHONY: protogen
protogen: protogen-go
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
protoc:
@OS_NAME=$$(uname -s | tr '[:upper:]' '[:lower:]'); \
ARCH_NAME=$$(uname -m); \
if [ "$$OS_NAME" = "darwin" ]; then \
if [ "$$ARCH_NAME" = "arm64" ]; then \
FILE=protoc-31.1-osx-aarch_64.zip; \
elif [ "$$ARCH_NAME" = "x86_64" ]; then \
FILE=protoc-31.1-osx-x86_64.zip; \
else \
echo "Unsupported macOS architecture: $$ARCH_NAME"; exit 1; \
fi; \
elif [ "$$OS_NAME" = "linux" ]; then \
if [ "$$ARCH_NAME" = "x86_64" ]; then \
FILE=protoc-31.1-linux-x86_64.zip; \
elif [ "$$ARCH_NAME" = "aarch64" ] || [ "$$ARCH_NAME" = "arm64" ]; then \
FILE=protoc-31.1-linux-aarch_64.zip; \
elif [ "$$ARCH_NAME" = "ppc64le" ]; then \
FILE=protoc-31.1-linux-ppcle_64.zip; \
elif [ "$$ARCH_NAME" = "s390x" ]; then \
FILE=protoc-31.1-linux-s390_64.zip; \
elif [ "$$ARCH_NAME" = "i386" ] || [ "$$ARCH_NAME" = "x86" ]; then \
FILE=protoc-31.1-linux-x86_32.zip; \
else \
echo "Unsupported Linux architecture: $$ARCH_NAME"; exit 1; \
fi; \
else \
echo "Unsupported OS: $$OS_NAME"; exit 1; \
fi; \
URL=https://github.com/protocolbuffers/protobuf/releases/download/v31.1/$$FILE; \
curl -L $$URL -o protoc.zip && \
unzip -j -d $(CURDIR) protoc.zip bin/protoc && rm protoc.zip
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
.PHONY: protogen-go
protogen-go: protoc install-go-tools
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
mkdir -p pkg/grpc/proto
./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
backend/backend.proto
core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
$(GOCMD) generate ./core/config/...
.PHONY: generate
generate: core/config/inference_defaults.json ## Ensure inference defaults exist
.PHONY: generate-force
generate-force: ## Re-fetch inference defaults from unsloth (always)
$(GOCMD) generate ./core/config/...
fix: dont commit generated files to git (#1993) * fix: initial work towards not committing generated files to the repository Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * feat: improve build docs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove unused folder from .dockerignore and .gitignore Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix extra backend tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: attempt to fix other tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more test fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: fix apple tests Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more extras tests fixes Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add GOBIN to PATH in docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: extra tests and Dockerfile corrections Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: remove build dependency checks Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add golang protobuf compilers to tests-linux action Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: ensure protogen is run for extra backend installs Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: use newer protobuf Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: more missing protoc binaries Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: missing dependencies during docker build Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: don't install grpc compilers in the final stage if they aren't needed Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: python-grpc-tools in 22.04 repos is too old Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: add a couple of extra build dependencies to Makefile Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> * fix: unbreak container rebuild functionality Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com> --------- Signed-off-by: Chris Jowett <421501+cryptk@users.noreply.github.com>
2024-04-13 07:37:32 +00:00
.PHONY: protogen-go-clean
protogen-go-clean:
$(RM) pkg/grpc/proto/backend.pb.go pkg/grpc/proto/backend_grpc.pb.go
$(RM) bin/*
prepare-test-extra: protogen-python
$(MAKE) -C backend/python/transformers
$(MAKE) -C backend/python/outetts
2023-12-11 07:20:34 +00:00
$(MAKE) -C backend/python/diffusers
$(MAKE) -C backend/python/chatterbox
$(MAKE) -C backend/python/vllm
$(MAKE) -C backend/python/vllm-omni
$(MAKE) -C backend/python/sglang
$(MAKE) -C backend/python/vibevoice
$(MAKE) -C backend/python/moonshine
$(MAKE) -C backend/python/pocket-tts
$(MAKE) -C backend/python/qwen-tts
$(MAKE) -C backend/python/fish-speech
$(MAKE) -C backend/python/faster-qwen3-tts
$(MAKE) -C backend/python/qwen-asr
$(MAKE) -C backend/python/nemo
$(MAKE) -C backend/python/voxcpm
$(MAKE) -C backend/python/faster-whisper
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
$(MAKE) -C backend/python/whisperx
$(MAKE) -C backend/python/ace-step
$(MAKE) -C backend/python/trl
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
$(MAKE) -C backend/python/tinygrad
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
$(MAKE) -C backend/python/transformers test
$(MAKE) -C backend/python/outetts test
2023-12-11 07:20:34 +00:00
$(MAKE) -C backend/python/diffusers test
$(MAKE) -C backend/python/chatterbox test
$(MAKE) -C backend/python/vllm test
$(MAKE) -C backend/python/vllm-omni test
$(MAKE) -C backend/python/vibevoice test
$(MAKE) -C backend/python/moonshine test
$(MAKE) -C backend/python/pocket-tts test
$(MAKE) -C backend/python/qwen-tts test
$(MAKE) -C backend/python/fish-speech test
$(MAKE) -C backend/python/faster-qwen3-tts test
$(MAKE) -C backend/python/qwen-asr test
$(MAKE) -C backend/python/nemo test
$(MAKE) -C backend/python/voxcpm test
$(MAKE) -C backend/python/faster-whisper test
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
$(MAKE) -C backend/python/whisperx test
$(MAKE) -C backend/python/ace-step test
$(MAKE) -C backend/python/trl test
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
$(MAKE) -C backend/python/tinygrad test
$(MAKE) -C backend/rust/kokoros test
feat(conda): conda environments (#1144) * feat(autogptq): add a separate conda environment for autogptq (#1137) **Description** This PR related to #1117 **Notes for Reviewers** Here we lock down the version of the dependencies. Make sure it can be used all the time without failed if the version of dependencies were upgraded. I change the order of importing packages according to the pylint, and no change the logic of code. It should be ok. I will do more investigate on writing some test cases for every backend. I can run the service in my environment, but there is not exist a way to test it. So, I am not confident on it. Add a README.md in the `grpc` root. This is the common commands for creating `conda` environment. And it can be used to the reference file for creating extral gRPC backend document. Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * [Extra backend] Add seperate environment for ttsbark (#1141) **Description** This PR relates to #1117 **Notes for Reviewers** Same to the latest PR: * The code is also changed, but only the order of the import package parts. And some code comments are also added. * Add a configuration of the `conda` environment * Add a simple test case for testing if the service can be startup in current `conda` environment. It is succeed in VSCode, but the it is not out of box on terminal. So, it is hard to say the test case really useful. **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): add make target and entrypoints for the dockerfile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): Add seperate conda env for diffusers (#1145) **Description** This PR relates to #1117 **Notes for Reviewers** * Add `conda` env `diffusers.yml` * Add Makefile to create it automatically * Add `run.sh` to support running as a extra backend * Also adding it to the main Dockerfile * Add make command in the root Makefile * Testing the server, it can start up under the env Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate env for vllm (#1148) **Description** This PR is related to #1117 **Notes for Reviewers** * The gRPC server can be started as normal * The test case can be triggered in VSCode * Same to other this kind of PRs, add `vllm.yml` Makefile and add `run.sh` to the main Dockerfile, and command to the main Makefile **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate env for huggingface (#1146) **Description** This PR is related to #1117 **Notes for Reviewers** * Add conda env `huggingface.yml` * Change the import order, and also remove the no-used packages * Add `run.sh` and `make command` to the main Dockerfile and Makefile * Add test cases for it. It can be triggered and succeed under VSCode Python extension but it is hang by using `python -m unites test_huggingface.py` in the terminal ``` Running tests (unittest): /workspaces/LocalAI/extra/grpc/huggingface Running tests: /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_embedding /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_load_model /workspaces/LocalAI/extra/grpc/huggingface/test_huggingface.py::TestBackendServicer::test_server_startup ./test_huggingface.py::TestBackendServicer::test_embedding Passed ./test_huggingface.py::TestBackendServicer::test_load_model Passed ./test_huggingface.py::TestBackendServicer::test_server_startup Passed Total number of tests expected to run: 3 Total number of tests run: 3 Total number of tests passed: 3 Total number of tests failed: 0 Total number of tests failed with errors: 0 Total number of tests skipped: 0 Finished running tests! ``` **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda): Add the seperate conda env for VALL-E X (#1147) **Description** This PR is related to #1117 **Notes for Reviewers** * The gRPC server cannot start up ``` (ttsvalle) @Aisuko ➜ /workspaces/LocalAI (feat/vall-e-x) $ /opt/conda/envs/ttsvalle/bin/python /workspaces/LocalAI/extra/grpc/vall-e-x/ttsvalle.py Traceback (most recent call last): File "/workspaces/LocalAI/extra/grpc/vall-e-x/ttsvalle.py", line 14, in <module> from utils.generation import SAMPLE_RATE, generate_audio, preload_models ModuleNotFoundError: No module named 'utils' ``` The installation steps follow https://github.com/Plachtaa/VALL-E-X#-installation below: * Under the `ttsvalle` conda env ``` git clone https://github.com/Plachtaa/VALL-E-X.git cd VALL-E-X pip install -r requirements.txt ``` **[Signed commits](../CONTRIBUTING.md#signing-off-on-commits-developer-certificate-of-origin)** - [x] Yes, I signed my commits. <!-- Thank you for contributing to LocalAI! Contributing Conventions ------------------------- The draft above helps to give a quick overview of your PR. Remember to remove this comment and to at least: 1. Include descriptive PR titles with [<component-name>] prepended. We use [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/). 2. Build and test your changes before submitting a PR (`make build`). 3. Sign your commits 4. **Tag maintainer:** for a quicker response, tag the relevant maintainer (see below). 5. **X/Twitter handle:** we announce bigger features on X/Twitter. If your PR gets announced, and you'd like a mention, we'll gladly shout you out! By following the community's contribution conventions upfront, the review process will be accelerated and your PR merged more quickly. If no one reviews your PR within a few days, please @-mention @mudler. --> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: set image type Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(conda):Add seperate conda env for exllama (#1149) Add seperate env for exllama Signed-off-by: Aisuko <urakiny@gmail.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Setup conda Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Set image_type arg Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: prepare only conda env in tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Dockerfile: comment manual pip calls Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * conda: add conda to PATH Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixes * add shebang * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * file perms Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * debug * Install new conda in the worker * Disable GPU tests for now until the worker is back * Rename workflows * debug * Fixup conda install * fixup(wrapper): pass args Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Aisuko <urakiny@gmail.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Aisuko <urakiny@gmail.com>
2023-11-04 14:30:32 +00:00
##
## End-to-end gRPC tests that exercise a built backend container image.
##
## The test suite in tests/e2e-backends is backend-agnostic. You drive it via env
## vars (see tests/e2e-backends/backend_test.go for the full list) and the
## capability-driven harness picks which gRPC RPCs to exercise:
##
## BACKEND_IMAGE Required. Docker image to test, e.g. local-ai-backend:llama-cpp.
## BACKEND_TEST_MODEL_URL URL of a model file to download and load.
## BACKEND_TEST_MODEL_FILE Path to an already-downloaded model (skips download).
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
## BACKEND_TEST_MODEL_NAME HuggingFace repo id (e.g. Qwen/Qwen2.5-0.5B-Instruct).
## Use this instead of MODEL_URL for backends that
## resolve HF model ids natively (vllm, vllm-omni).
## BACKEND_TEST_CAPS Comma-separated capabilities, default "health,load,predict,stream".
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
## Adds "tools" to exercise ChatDelta tool call extraction.
## BACKEND_TEST_PROMPT Override the prompt used in predict/stream specs.
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
## BACKEND_TEST_OPTIONS Comma-separated Options[] entries forwarded to LoadModel,
## e.g. "tool_parser:hermes,reasoning_parser:qwen3".
##
## Direct usage (image already built, no docker-build-* dependency):
##
## make test-extra-backend BACKEND_IMAGE=local-ai-backend:llama-cpp \
## BACKEND_TEST_MODEL_URL=https://.../model.gguf
##
## Convenience wrappers below build a specific backend image first, then run the
## suite against it.
##
BACKEND_TEST_MODEL_URL?=https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf
## Generic target — runs the suite against whatever BACKEND_IMAGE points at.
## Depends on protogen-go so pkg/grpc/proto is generated before `go test`.
test-extra-backend: protogen-go
@test -n "$$BACKEND_IMAGE" || { echo "BACKEND_IMAGE must be set" >&2; exit 1; }
BACKEND_IMAGE="$$BACKEND_IMAGE" \
BACKEND_TEST_MODEL_URL="$${BACKEND_TEST_MODEL_URL:-$(BACKEND_TEST_MODEL_URL)}" \
BACKEND_TEST_MODEL_FILE="$$BACKEND_TEST_MODEL_FILE" \
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
BACKEND_TEST_MODEL_NAME="$$BACKEND_TEST_MODEL_NAME" \
BACKEND_TEST_MMPROJ_URL="$$BACKEND_TEST_MMPROJ_URL" \
BACKEND_TEST_MMPROJ_FILE="$$BACKEND_TEST_MMPROJ_FILE" \
BACKEND_TEST_AUDIO_URL="$$BACKEND_TEST_AUDIO_URL" \
BACKEND_TEST_AUDIO_FILE="$$BACKEND_TEST_AUDIO_FILE" \
BACKEND_TEST_CAPS="$$BACKEND_TEST_CAPS" \
BACKEND_TEST_PROMPT="$$BACKEND_TEST_PROMPT" \
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
BACKEND_TEST_OPTIONS="$$BACKEND_TEST_OPTIONS" \
BACKEND_TEST_TOOL_PROMPT="$$BACKEND_TEST_TOOL_PROMPT" \
BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
BACKEND_TEST_CACHE_TYPE_K="$$BACKEND_TEST_CACHE_TYPE_K" \
BACKEND_TEST_CACHE_TYPE_V="$$BACKEND_TEST_CACHE_TYPE_V" \
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
go test -v -timeout 30m ./tests/e2e-backends/...
## Convenience wrappers: build the image, then exercise it.
test-extra-backend-llama-cpp: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp $(MAKE) test-extra-backend
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
## turboquant: exercises the llama.cpp-fork backend with the fork's
## *TurboQuant-specific* KV-cache types (turbo3 for both K and V). turbo3
## is what makes this backend distinct from stock llama-cpp — picking q8_0
## here would only test the standard llama.cpp code path that the upstream
## llama-cpp backend already covers. The fork auto-enables flash_attention
## when turbo3/turbo4 are active, so we don't need to set it explicitly.
test-extra-backend-turboquant: docker-build-turboquant
BACKEND_IMAGE=local-ai-backend:turboquant \
BACKEND_TEST_CACHE_TYPE_K=q8_0 \
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the llama-cpp backend.
## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
## audio encoder companion). The audio fixture is a short public-domain
## "jfk.wav" clip ggml-org bundles with whisper.cpp's CI assets.
test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp \
BACKEND_TEST_MODEL_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/Qwen3-ASR-0.6B-Q8_0.gguf \
BACKEND_TEST_MMPROJ_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/mmproj-Qwen3-ASR-0.6B-Q8_0.gguf \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
## vllm is resolved from a HuggingFace model id (no file download) and
## exercises Predict + streaming + tool-call extraction via the hermes parser.
## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
## wheel was compiled against (AVX-512 VNNI/BF16); older CPUs will SIGILL
## on import — on CI this means using the bigger-runner label.
test-extra-backend-vllm: docker-build-vllm
BACKEND_IMAGE=local-ai-backend:vllm \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
## tinygrad mirrors the vllm target (same model, same caps, same parser) so
## the two backends are directly comparable. The LLM path covers Predict,
## streaming and native tool-call extraction. Companion targets below cover
## embeddings, Stable Diffusion and Whisper — run them individually or via
## the `test-extra-backend-tinygrad-all` aggregate.
test-extra-backend-tinygrad: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380) Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS
2026-04-16 20:41:18 +00:00
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
## tinygrad — embeddings via LLM last-hidden-state pooling. Reuses the same
refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380) Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS
2026-04-16 20:41:18 +00:00
## Qwen3-0.6B as the chat target so we don't need a separate BERT vendor;
## the Embedding RPC mean-pools and L2-normalizes the last-layer hidden
## state.
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
test-extra-backend-tinygrad-embeddings: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380) Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS
2026-04-16 20:41:18 +00:00
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
BACKEND_TEST_CAPS=health,load,embeddings \
$(MAKE) test-extra-backend
## tinygrad — Stable Diffusion 1.5. The original CompVis/runwayml repos have
## been gated, so we use the community-maintained mirror at
## stable-diffusion-v1-5/stable-diffusion-v1-5 with the EMA-only pruned
## checkpoint (~4.3GB). Step count is kept low (4) so a CPU-only run finishes
## in a few minutes; bump BACKEND_TEST_IMAGE_STEPS for higher quality.
test-extra-backend-tinygrad-sd: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=stable-diffusion-v1-5/stable-diffusion-v1-5 \
BACKEND_TEST_CAPS=health,load,image \
$(MAKE) test-extra-backend
## tinygrad — Whisper. Loads OpenAI's tiny.en checkpoint (smallest at ~75MB)
## from the original azure CDN through tinygrad's `fetch` helper, and
## transcribes the canonical jfk.wav fixture from whisper.cpp's CI samples.
## Exercises both AudioTranscription and AudioTranscriptionStream.
test-extra-backend-tinygrad-whisper: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=openai/whisper-tiny.en \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
test-extra-backend-tinygrad-all: \
test-extra-backend-tinygrad \
test-extra-backend-tinygrad-embeddings \
test-extra-backend-tinygrad-sd \
test-extra-backend-tinygrad-whisper
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
test-extra-backend-sglang: docker-build-sglang
BACKEND_IMAGE=local-ai-backend:sglang \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:qwen \
$(MAKE) test-extra-backend
feat: refactor shared helpers and enhance MLX backend functionality (#9335) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.
2026-04-13 16:44:03 +00:00
## mlx is Apple-Silicon-first — the MLX backend auto-detects the right tool
## parser from the chat template, so no tool_parser: option is needed (it
## would be ignored at runtime). Run this on macOS / arm64 with Metal; the
## Linux/CPU mlx variant is untested in CI.
test-extra-backend-mlx: docker-build-mlx
BACKEND_IMAGE=local-ai-backend:mlx \
BACKEND_TEST_MODEL_NAME=mlx-community/Qwen2.5-0.5B-Instruct-4bit \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
$(MAKE) test-extra-backend
test-extra-backend-mlx-vlm: docker-build-mlx-vlm
BACKEND_IMAGE=local-ai-backend:mlx-vlm \
BACKEND_TEST_MODEL_NAME=mlx-community/Qwen2.5-0.5B-Instruct-4bit \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
$(MAKE) test-extra-backend
DOCKER_IMAGE?=local-ai
IMAGE_TYPE?=core
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
BASE_IMAGE?=ubuntu:24.04
docker:
docker build \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg CUDA_MAJOR_VERSION=$(CUDA_MAJOR_VERSION) \
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-t $(DOCKER_IMAGE) .
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
docker-cuda12:
docker build \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
--build-arg CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-t $(DOCKER_IMAGE)-cuda-12 .
docker-image-intel:
docker build \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
--build-arg BUILD_TYPE=intel \
--build-arg CUDA_MAJOR_VERSION=$(CUDA_MAJOR_VERSION) \
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
-t $(DOCKER_IMAGE) .
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
## Backends
########################################################
# Pattern rule for standard backends (docker-based)
# This matches all backends that use docker-build-* and docker-save-*
backends/%: docker-build-% docker-save-% build
./local-ai backends install "ocifile://$(abspath ./backend-images/$*.tar)"
# Darwin-specific backends (keep as explicit targets since they have special build logic)
backends/llama-cpp-darwin: build
bash ./scripts/build/llama-cpp-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"
build-darwin-python-backend: build
bash ./scripts/build/python-darwin.sh
build-darwin-go-backend: build
bash ./scripts/build/golang-darwin.sh
backends/mlx:
BACKEND=mlx $(MAKE) build-darwin-python-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/mlx.tar)"
backends/diffuser-darwin:
BACKEND=diffusers $(MAKE) build-darwin-python-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/diffusers.tar)"
backends/mlx-vlm:
BACKEND=mlx-vlm $(MAKE) build-darwin-python-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/mlx-vlm.tar)"
backends/mlx-audio:
BACKEND=mlx-audio $(MAKE) build-darwin-python-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/mlx-audio.tar)"
backends/mlx-distributed:
BACKEND=mlx-distributed $(MAKE) build-darwin-python-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/mlx-distributed.tar)"
backends/stablediffusion-ggml-darwin:
BACKEND=stablediffusion-ggml BUILD_TYPE=metal $(MAKE) build-darwin-go-backend
./local-ai backends install "ocifile://$(abspath ./backend-images/stablediffusion-ggml.tar)"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
backend-images:
mkdir -p backend-images
# Backend metadata: BACKEND_NAME | DOCKERFILE_TYPE | BUILD_CONTEXT | PROGRESS_FLAG | NEEDS_BACKEND_ARG
# llama-cpp is special - uses llama-cpp Dockerfile and doesn't need BACKEND arg
BACKEND_LLAMA_CPP = llama-cpp|llama-cpp|.|false|false
# ik-llama-cpp is a fork of llama.cpp with superior CPU performance
BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
BACKEND_LOCAL_STORE = local-store|golang|.|false|true
BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
# Python backends with root context
BACKEND_RERANKERS = rerankers|python|.|false|true
BACKEND_TRANSFORMERS = transformers|python|.|false|true
BACKEND_OUTETTS = outetts|python|.|false|true
BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
BACKEND_COQUI = coqui|python|.|false|true
BACKEND_RFDETR = rfdetr|python|.|false|true
BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
BACKEND_NEUTTS = neutts|python|.|false|true
BACKEND_KOKORO = kokoro|python|.|false|true
BACKEND_VLLM = vllm|python|.|false|true
BACKEND_VLLM_OMNI = vllm-omni|python|.|false|true
BACKEND_SGLANG = sglang|python|.|false|true
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
BACKEND_MOONSHINE = moonshine|python|.|false|true
BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
BACKEND_FISH_SPEECH = fish-speech|python|.|false|true
BACKEND_FASTER_QWEN3_TTS = faster-qwen3-tts|python|.|false|true
BACKEND_QWEN_ASR = qwen-asr|python|.|false|true
BACKEND_NEMO = nemo|python|.|false|true
BACKEND_VOXCPM = voxcpm|python|.|false|true
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
BACKEND_WHISPERX = whisperx|python|.|false|true
BACKEND_ACE_STEP = ace-step|python|.|false|true
feat: refactor shared helpers and enhance MLX backend functionality (#9335) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.
2026-04-13 16:44:03 +00:00
BACKEND_MLX = mlx|python|.|false|true
BACKEND_MLX_VLM = mlx-vlm|python|.|false|true
BACKEND_MLX_DISTRIBUTED = mlx-distributed|python|./|false|true
BACKEND_TRL = trl|python|.|false|true
BACKEND_LLAMA_CPP_QUANTIZATION = llama-cpp-quantization|python|.|false|true
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
BACKEND_TINYGRAD = tinygrad|python|.|false|true
# Rust backends
BACKEND_KOKOROS = kokoros|rust|.|false|true
# C++ backends (Go wrapper with purego)
BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
# Helper function to build docker image for a backend
# Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
define docker-build-backend
docker build $(if $(filter-out false,$(4)),$(4)) \
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg CUDA_MAJOR_VERSION=$(CUDA_MAJOR_VERSION) \
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
chore: Update to Ubuntu24.04 (cont #7423) (#7769) * ci(workflows): bump GitHub Actions images to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): remove CUDA 11.x support from GitHub Actions (incompatible with ubuntu:24.04) Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): bump GitHub Actions CUDA support to 12.9 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): bump base image to ubuntu:24.04 and adjust Vulkan SDK/packages Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(backend): correct context paths for Python backends in workflows, Makefile and Dockerfile Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): disable parallel backend builds to avoid race conditions Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): export CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION for override Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): update backend Dockerfiles to Ubuntu 24.04 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): add ROCm env vars and default AMDGPU_TARGETS for hipBLAS builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(chatterbox): bump ROCm PyTorch to 2.9.1+rocm6.4 and update index URL; align hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore: add local-ai-launcher to .gitignore Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix backends GitHub Actions workflows after rebase Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): use build-time UBUNTU_VERSION variable Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(docker): remove libquadmath0 from requirements-stage base image Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(make): add backends/vllm to .NOTPARALLEL to prevent parallel builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix(docker): correct CUDA installation steps in backend Dockerfiles Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * chore(backend): update ROCm to 6.4 and align Python hipblas requirements Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for CUDA on arm64 builds Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(docker): update base image and backend Dockerfiles for Ubuntu 24.04 compatibility on arm64 Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * build(backend): increase timeout for uv installs behind slow networks on backend/Dockerfile.python Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): switch GitHub Actions runners to Ubuntu-24.04 for vibevoice backend Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * ci(workflows): fix failing GitHub Actions runners Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> * fix: Allow FROM_SOURCE to be unset, use upstream Intel images etc. Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): rm all traces of CUDA 11 Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(build): Add Ubuntu codename as an argument Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Alessandro Sturniolo <alessandro.sturniolo@gmail.com>
2026-01-06 14:26:42 +00:00
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
$(if $(FROM_SOURCE),--build-arg FROM_SOURCE=$(FROM_SOURCE)) \
$(if $(filter true,$(5)),--build-arg BACKEND=$(1)) \
-t local-ai-backend:$(1) -f backend/Dockerfile.$(2) $(3)
endef
# Generate docker-build targets from backend definitions
define generate-docker-build-target
docker-build-$(word 1,$(subst |, ,$(1))):
$$(call docker-build-backend,$(word 1,$(subst |, ,$(1))),$(word 2,$(subst |, ,$(1))),$(word 3,$(subst |, ,$(1))),$(word 4,$(subst |, ,$(1))),$(word 5,$(subst |, ,$(1))))
endef
# Generate all docker-build targets
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
$(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
$(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_TRANSFORMERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_OUTETTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
$(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM_OMNI)))
$(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
$(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
$(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_FISH_SPEECH)))
$(eval $(call generate-docker-build-target,$(BACKEND_FASTER_QWEN3_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_ASR)))
$(eval $(call generate-docker-build-target,$(BACKEND_NEMO)))
$(eval $(call generate-docker-build-target,$(BACKEND_VOXCPM)))
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
$(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
feat: refactor shared helpers and enhance MLX backend functionality (#9335) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.
2026-04-13 16:44:03 +00:00
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
$(eval $(call generate-docker-build-target,$(BACKEND_TRL)))
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
### Mock Backend for E2E Tests
########################################################
build-mock-backend: protogen-go
$(GOCMD) build -o tests/e2e/mock-backend/mock-backend ./tests/e2e/mock-backend
clean-mock-backend:
rm -f tests/e2e/mock-backend/mock-backend
########################################################
### UI E2E Test Server
########################################################
build-ui-test-server: build-mock-backend react-ui protogen-go
$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
test-ui-e2e: build-ui-test-server
cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test
test-ui-e2e-docker:
docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
docker run --rm localai-ui-e2e
clean-ui-test-server:
rm -f tests/e2e-ui/ui-test-server
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
########################################################
### END Backends
########################################################
.PHONY: swagger
swagger:
swag init -g core/http/app.go --output swagger
# DEPRECATED: gen-assets is for the legacy Alpine.js UI. Remove when legacy UI is removed.
.PHONY: gen-assets
gen-assets:
$(GOCMD) run core/dependencies_manager/manager.go webui_static.yaml core/http/static/assets
## Documentation
docs/layouts/_default:
mkdir -p docs/layouts/_default
docs/static/gallery.html: docs/layouts/_default
$(GOCMD) run ./.github/ci/modelslist.go ./gallery/index.yaml > docs/static/gallery.html
docs/public: docs/layouts/_default docs/static/gallery.html
cd docs && hugo --minify
docs-clean:
rm -rf docs/public
rm -rf docs/static/gallery.html
.PHONY: docs
docs: docs/static/gallery.html
cd docs && hugo serve
feat(launcher): add LocalAI launcher app (#6127) * Add launcher (WIP) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update gomod Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Cleanup, focus on systray Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Separate launcher from main Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add a way to identify the binary version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Implement save config, and start on boot Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Save installed version as metadata Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Stop LocalAI on quit Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix goreleaser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Check first if binary is there Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * do not show version if we don't have it Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * use fyne package Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to release Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fyne.Do Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * show WEBUI button only if LocalAI is started Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Default to localhost Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Show rel notes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update logo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small improvements and fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-08-26 12:22:04 +00:00
########################################################
## Platform-specific builds
########################################################
## fyne cross-platform build
build-launcher-darwin: build-launcher
go run github.com/tiagomelo/macos-dmg-creator/cmd/createdmg@latest \
--appName "LocalAI" \
--appBinaryPath "$(LAUNCHER_BINARY_NAME)" \
--bundleIdentifier "com.localai.launcher" \
--iconPath "core/http/static/logo.png" \
--outputDir "dist/"
build-launcher-linux:
cd cmd/launcher && go run fyne.io/tools/cmd/fyne@latest package -os linux -icon ../../core/http/static/logo.png --executable $(LAUNCHER_BINARY_NAME)-linux && mv launcher.tar.xz ../../$(LAUNCHER_BINARY_NAME)-linux.tar.xz