LocalAI/backend/index.yaml

3724 lines
144 KiB
YAML
Raw Permalink Normal View History

---
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
## metas
- &llamacpp
name: "llama-cpp"
alias: "llama-cpp"
license: mit
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
description: |
LLM inference in C/C++
urls:
- https://github.com/ggerganov/llama.cpp
tags:
- text-to-text
- LLM
- CPU
- GPU
- Metal
- CUDA
- HIP
capabilities:
default: "cpu-llama-cpp"
nvidia: "cuda12-llama-cpp"
intel: "intel-sycl-f16-llama-cpp"
amd: "rocm-llama-cpp"
metal: "metal-llama-cpp"
vulkan: "vulkan-llama-cpp"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
nvidia-l4t: "nvidia-l4t-arm64-llama-cpp"
nvidia-cuda-13: "cuda13-llama-cpp"
nvidia-cuda-12: "cuda12-llama-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp"
- &ikllamacpp
name: "ik-llama-cpp"
alias: "ik-llama-cpp"
license: mit
description: |
Fork of llama.cpp optimized for CPU performance by ikawrakow
urls:
- https://github.com/ikawrakow/ik_llama.cpp
tags:
- text-to-text
- LLM
- CPU
capabilities:
default: "cpu-ik-llama-cpp"
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
- &turboquant
name: "turboquant"
alias: "turboquant"
license: mit
description: |
Fork of llama.cpp adding the TurboQuant KV-cache quantization scheme.
Reuses the LocalAI llama.cpp gRPC server sources against the fork's libllama.
urls:
- https://github.com/TheTom/llama-cpp-turboquant
tags:
- text-to-text
- LLM
- CPU
- GPU
- CUDA
- HIP
- turboquant
- kv-cache
capabilities:
default: "cpu-turboquant"
nvidia: "cuda12-turboquant"
intel: "intel-sycl-f16-turboquant"
amd: "rocm-turboquant"
vulkan: "vulkan-turboquant"
nvidia-l4t: "nvidia-l4t-arm64-turboquant"
nvidia-cuda-13: "cuda13-turboquant"
nvidia-cuda-12: "cuda12-turboquant"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
- &whispercpp
name: "whisper"
alias: "whisper"
license: mit
icon: https://user-images.githubusercontent.com/1991296/235238348-05d0f6a4-da44-4900-a1de-d0707e75b763.jpeg
description: |
Port of OpenAI's Whisper model in C/C++
urls:
- https://github.com/ggml-org/whisper.cpp
tags:
- audio-transcription
- CPU
- GPU
- CUDA
- HIP
capabilities:
default: "cpu-whisper"
nvidia: "cuda12-whisper"
intel: "intel-sycl-f16-whisper"
metal: "metal-whisper"
amd: "rocm-whisper"
vulkan: "vulkan-whisper"
nvidia-l4t: "nvidia-l4t-arm64-whisper"
nvidia-cuda-13: "cuda13-whisper"
nvidia-cuda-12: "cuda12-whisper"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisper"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-whisper"
- &voxtral
name: "voxtral"
alias: "voxtral"
license: mit
description: |
Voxtral Realtime 4B Pure C speech-to-text inference engine
urls:
- https://github.com/mudler/voxtral.c
tags:
- audio-transcription
- CPU
- Metal
capabilities:
default: "cpu-voxtral"
metal-darwin-arm64: "metal-voxtral"
- &stablediffusionggml
name: "stablediffusion-ggml"
alias: "stablediffusion-ggml"
license: mit
icon: https://github.com/leejet/stable-diffusion.cpp/raw/master/assets/cat_with_sd_cpp_42.png
description: |
Stable Diffusion and Flux in pure C/C++
urls:
- https://github.com/leejet/stable-diffusion.cpp
tags:
- image-generation
- CPU
- GPU
- Metal
- CUDA
- HIP
capabilities:
default: "cpu-stablediffusion-ggml"
nvidia: "cuda12-stablediffusion-ggml"
intel: "intel-sycl-f16-stablediffusion-ggml"
# amd: "rocm-stablediffusion-ggml"
vulkan: "vulkan-stablediffusion-ggml"
nvidia-l4t: "nvidia-l4t-arm64-stablediffusion-ggml"
metal: "metal-stablediffusion-ggml"
nvidia-cuda-13: "cuda13-stablediffusion-ggml"
nvidia-cuda-12: "cuda12-stablediffusion-ggml"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-stablediffusion-ggml"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml"
- &rfdetr
name: "rfdetr"
alias: "rfdetr"
license: apache-2.0
icon: https://avatars.githubusercontent.com/u/53104118?s=200&v=4
description: |
RF-DETR is a real-time, transformer-based object detection model architecture developed by Roboflow and released under the Apache 2.0 license.
RF-DETR is the first real-time model to exceed 60 AP on the Microsoft COCO benchmark alongside competitive performance at base sizes. It also achieves state-of-the-art performance on RF100-VL, an object detection benchmark that measures model domain adaptability to real world problems. RF-DETR is fastest and most accurate for its size when compared current real-time objection models.
RF-DETR is small enough to run on the edge using Inference, making it an ideal model for deployments that need both strong accuracy and real-time performance.
urls:
- https://github.com/roboflow/rf-detr
tags:
- object-detection
- rfdetr
- gpu
- cpu
capabilities:
nvidia: "cuda12-rfdetr"
intel: "intel-rfdetr"
#amd: "rocm-rfdetr"
nvidia-l4t: "nvidia-l4t-arm64-rfdetr"
metal: "metal-rfdetr"
default: "cpu-rfdetr"
nvidia-cuda-13: "cuda13-rfdetr"
nvidia-cuda-12: "cuda12-rfdetr"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-rfdetr"
- &sam3cpp
name: "sam3-cpp"
alias: "sam3-cpp"
license: mit
description: |
Segment Anything Model (SAM 3/2/EdgeTAM) in C/C++ using GGML.
Supports text-prompted and point/box-prompted image segmentation.
urls:
- https://github.com/PABannier/sam3.cpp
tags:
- image-segmentation
- object-detection
- sam3
- gpu
- cpu
capabilities:
default: "cpu-sam3-cpp"
nvidia: "cuda12-sam3-cpp"
nvidia-cuda-12: "cuda12-sam3-cpp"
nvidia-cuda-13: "cuda13-sam3-cpp"
nvidia-l4t: "nvidia-l4t-arm64-sam3-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-sam3-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sam3-cpp"
intel: "intel-sycl-f32-sam3-cpp"
vulkan: "vulkan-sam3-cpp"
- &vllm
name: "vllm"
license: apache-2.0
urls:
- https://github.com/vllm-project/vllm
tags:
- text-to-text
- multimodal
- GPTQ
- AWQ
- AutoRound
- INT4
- INT8
- FP8
icon: https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png
description: |
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Speculative decoding
Chunked prefill
alias: "vllm"
capabilities:
nvidia: "cuda12-vllm"
amd: "rocm-vllm"
intel: "intel-vllm"
nvidia-cuda-12: "cuda12-vllm"
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
cpu: "cpu-vllm"
- &sglang
name: "sglang"
license: apache-2.0
urls:
- https://github.com/sgl-project/sglang
tags:
- text-to-text
- multimodal
icon: https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png
description: |
SGLang is a fast serving framework for large language models and vision language models.
It co-designs the backend runtime (RadixAttention, continuous batching, structured
decoding) and the frontend language to make interaction with models faster and more
controllable. Features include fast backend runtime, flexible frontend language,
extensive model support, and an active community.
alias: "sglang"
capabilities:
nvidia: "cuda12-sglang"
amd: "rocm-sglang"
intel: "intel-sglang"
nvidia-cuda-12: "cuda12-sglang"
cpu: "cpu-sglang"
- &vllm-omni
name: "vllm-omni"
license: apache-2.0
urls:
- https://github.com/vllm-project/vllm-omni
tags:
- text-to-image
- image-generation
- text-to-video
- video-generation
- text-to-speech
- TTS
- multimodal
- LLM
icon: https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png
description: |
vLLM-Omni is a unified interface for multimodal generation with vLLM.
It supports image generation (text-to-image, image editing), video generation
(text-to-video, image-to-video), text generation with multimodal inputs, and
text-to-speech generation. Only supports NVIDIA (CUDA) and ROCm platforms.
alias: "vllm-omni"
capabilities:
nvidia: "cuda12-vllm-omni"
amd: "rocm-vllm-omni"
nvidia-cuda-12: "cuda12-vllm-omni"
- &mlx
name: "mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/ml-explore/mlx-lm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx
license: MIT
description: |
Run LLMs with MLX
tags:
- text-to-text
- LLM
- MLX
capabilities:
default: "cpu-mlx"
nvidia: "cuda12-mlx"
metal: "metal-mlx"
nvidia-cuda-12: "cuda12-mlx"
nvidia-cuda-13: "cuda13-mlx"
nvidia-l4t: "nvidia-l4t-mlx"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx"
- &mlx-vlm
name: "mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-vlm"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/Blaizzy/mlx-vlm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-vlm
license: MIT
description: |
Run Vision-Language Models with MLX
tags:
- text-to-text
- multimodal
- vision-language
- LLM
- MLX
capabilities:
default: "cpu-mlx-vlm"
nvidia: "cuda12-mlx-vlm"
metal: "metal-mlx-vlm"
nvidia-cuda-12: "cuda12-mlx-vlm"
nvidia-cuda-13: "cuda13-mlx-vlm"
nvidia-l4t: "nvidia-l4t-mlx-vlm"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-vlm"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-vlm"
- &mlx-audio
name: "mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-audio"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/Blaizzy/mlx-audio
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-audio
license: MIT
description: |
Run Audio Models with MLX
tags:
- audio-to-text
- audio-generation
- text-to-audio
- LLM
- MLX
capabilities:
default: "cpu-mlx-audio"
nvidia: "cuda12-mlx-audio"
metal: "metal-mlx-audio"
nvidia-cuda-12: "cuda12-mlx-audio"
nvidia-cuda-13: "cuda13-mlx-audio"
nvidia-l4t: "nvidia-l4t-mlx-audio"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-audio"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-audio"
- &mlx-distributed
name: "mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-mlx-distributed"
icon: https://avatars.githubusercontent.com/u/102832242?s=200&v=4
urls:
- https://github.com/ml-explore/mlx-lm
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-mlx-distributed
license: MIT
description: |
Run distributed LLM inference with MLX across multiple Apple Silicon Macs
tags:
- text-to-text
- LLM
- MLX
- distributed
capabilities:
default: "cpu-mlx-distributed"
nvidia: "cuda12-mlx-distributed"
metal: "metal-mlx-distributed"
nvidia-cuda-12: "cuda12-mlx-distributed"
nvidia-cuda-13: "cuda13-mlx-distributed"
nvidia-l4t: "nvidia-l4t-mlx-distributed"
nvidia-l4t-cuda-12: "nvidia-l4t-mlx-distributed"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-mlx-distributed"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &rerankers
name: "rerankers"
alias: "rerankers"
capabilities:
nvidia: "cuda12-rerankers"
intel: "intel-rerankers"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-rerankers"
metal: "metal-rerankers"
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
- &tinygrad
name: "tinygrad"
alias: "tinygrad"
license: MIT
description: |
tinygrad is a minimalist deep-learning framework with zero runtime
dependencies that targets CUDA, ROCm, Metal, WebGPU and CPU (CLANG).
The LocalAI tinygrad backend exposes a single multimodal runtime that
covers LLM text generation (Llama / Qwen / Mistral via safetensors or
GGUF) with native tool-call extraction, BERT-family embeddings,
Stable Diffusion 1.x / 2 / XL image generation, and Whisper speech-to-text.
Single image: tinygrad generates its own GPU kernels and dlopens the
host driver libraries at runtime, so there is no per-toolkit build
split. The same image runs CPU-only or accelerates against
CUDA / ROCm / Metal when the host driver is visible.
urls:
- https://github.com/tinygrad/tinygrad
uri: "quay.io/go-skynet/local-ai-backends:latest-tinygrad"
mirrors:
- localai/localai-backends:latest-tinygrad
tags:
- text-to-text
- LLM
- embeddings
- image-generation
- transcription
- multimodal
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &transformers
name: "transformers"
icon: https://avatars.githubusercontent.com/u/25720743?s=200&v=4
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
alias: "transformers"
license: apache-2.0
description: |
Transformers acts as the model-definition framework for state-of-the-art machine learning models in text, computer vision, audio, video, and multimodal model, for both inference and training.
It centralizes the model definition so that this definition is agreed upon across the ecosystem. transformers is the pivot across frameworks: if a model definition is supported, it will be compatible with the majority of training frameworks (Axolotl, Unsloth, DeepSpeed, FSDP, PyTorch-Lightning, ...), inference engines (vLLM, SGLang, TGI, ...), and adjacent modeling libraries (llama.cpp, mlx, ...) which leverage the model definition from transformers.
urls:
- https://github.com/huggingface/transformers
tags:
- text-to-text
- multimodal
capabilities:
nvidia: "cuda12-transformers"
intel: "intel-transformers"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-transformers"
metal: "metal-transformers"
nvidia-cuda-13: "cuda13-transformers"
nvidia-cuda-12: "cuda12-transformers"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &diffusers
name: "diffusers"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
icon: https://raw.githubusercontent.com/huggingface/diffusers/main/docs/source/en/imgs/diffusers_library.jpg
description: |
🤗 Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether you're looking for a simple inference solution or training your own diffusion models, 🤗 Diffusers is a modular toolbox that supports both.
urls:
- https://github.com/huggingface/diffusers
tags:
- image-generation
- video-generation
- diffusion-models
license: apache-2.0
alias: "diffusers"
capabilities:
nvidia: "cuda12-diffusers"
intel: "intel-diffusers"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-diffusers"
nvidia-l4t: "nvidia-l4t-diffusers"
metal: "metal-diffusers"
default: "cpu-diffusers"
nvidia-cuda-13: "cuda13-diffusers"
nvidia-cuda-12: "cuda12-diffusers"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-diffusers"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-diffusers"
- &ace-step
name: "ace-step"
description: |
ACE-Step 1.5 is an open-source music generation model. It supports simple mode (natural language description) and advanced mode (caption, lyrics, think, bpm, keyscale, etc.). Uses in-process acestep (LLMHandler for metadata, DiT for audio).
urls:
- https://github.com/ace-step/ACE-Step-1.5
tags:
- music-generation
- sound-generation
alias: "ace-step"
capabilities:
nvidia: "cuda12-ace-step"
intel: "intel-ace-step"
amd: "rocm-ace-step"
metal: "metal-ace-step"
default: "cpu-ace-step"
nvidia-cuda-13: "cuda13-ace-step"
nvidia-cuda-12: "cuda12-ace-step"
- !!merge <<: *ace-step
name: "ace-step-development"
capabilities:
nvidia: "cuda12-ace-step-development"
intel: "intel-ace-step-development"
amd: "rocm-ace-step-development"
metal: "metal-ace-step-development"
default: "cpu-ace-step-development"
nvidia-cuda-13: "cuda13-ace-step-development"
nvidia-cuda-12: "cuda12-ace-step-development"
- &acestepcpp
name: "acestep-cpp"
description: |
ACE-Step 1.5 C++ backend using GGML. Native C++ implementation of ACE-Step music generation with GPU support through GGML backends.
Generates stereo 48kHz audio from text descriptions and optional lyrics via a two-stage pipeline: text-to-code (ace-qwen3 LLM) + code-to-audio (DiT-VAE).
urls:
- https://github.com/ace-step/acestep.cpp
tags:
- music-generation
- sound-generation
alias: "acestep-cpp"
capabilities:
default: "cpu-acestep-cpp"
nvidia: "cuda12-acestep-cpp"
nvidia-cuda-13: "cuda13-acestep-cpp"
nvidia-cuda-12: "cuda12-acestep-cpp"
intel: "intel-sycl-f16-acestep-cpp"
metal: "metal-acestep-cpp"
amd: "rocm-acestep-cpp"
vulkan: "vulkan-acestep-cpp"
nvidia-l4t: "nvidia-l4t-arm64-acestep-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-acestep-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-acestep-cpp"
- &qwen3ttscpp
name: "qwen3-tts-cpp"
description: |
Qwen3-TTS C++ backend using GGML. Native C++ text-to-speech with voice cloning support.
Generates 24kHz mono audio from text with optional reference audio for voice cloning via ECAPA-TDNN speaker embeddings.
urls:
- https://github.com/predict-woo/qwen3-tts.cpp
tags:
- text-to-speech
- tts
- voice-cloning
alias: "qwen3-tts-cpp"
capabilities:
default: "cpu-qwen3-tts-cpp"
nvidia: "cuda12-qwen3-tts-cpp"
nvidia-cuda-13: "cuda13-qwen3-tts-cpp"
nvidia-cuda-12: "cuda12-qwen3-tts-cpp"
intel: "intel-sycl-f16-qwen3-tts-cpp"
metal: "metal-qwen3-tts-cpp"
amd: "rocm-qwen3-tts-cpp"
vulkan: "vulkan-qwen3-tts-cpp"
nvidia-l4t: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-qwen3-tts-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &faster-whisper
icon: https://avatars.githubusercontent.com/u/1520500?s=200&v=4
description: |
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
urls:
- https://github.com/SYSTRAN/faster-whisper
tags:
- speech-to-text
- Whisper
license: MIT
name: "faster-whisper"
capabilities:
default: "cpu-faster-whisper"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
nvidia: "cuda12-faster-whisper"
intel: "intel-faster-whisper"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-faster-whisper"
metal: "metal-faster-whisper"
nvidia-cuda-13: "cuda13-faster-whisper"
nvidia-cuda-12: "cuda12-faster-whisper"
nvidia-l4t: "nvidia-l4t-arm64-faster-whisper"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-faster-whisper"
- &moonshine
description: |
Moonshine is a fast, accurate, and efficient speech-to-text transcription model using ONNX Runtime.
It provides real-time transcription capabilities with support for multiple model sizes and GPU acceleration.
urls:
- https://github.com/moonshine-ai/moonshine
tags:
- speech-to-text
- transcription
- ONNX
license: MIT
name: "moonshine"
alias: "moonshine"
capabilities:
nvidia: "cuda12-moonshine"
metal: "metal-moonshine"
default: "cpu-moonshine"
nvidia-cuda-13: "cuda13-moonshine"
nvidia-cuda-12: "cuda12-moonshine"
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
- &whisperx
description: |
WhisperX provides fast automatic speech recognition with word-level timestamps, speaker diarization,
and forced alignment. Built on faster-whisper and pyannote-audio for high-accuracy transcription
with speaker identification.
urls:
- https://github.com/m-bain/whisperX
tags:
- speech-to-text
- diarization
- whisperx
license: BSD-4-Clause
name: "whisperx"
alias: "whisperx"
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
capabilities:
nvidia: "cuda12-whisperx"
amd: "rocm-whisperx"
metal: "metal-whisperx"
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
default: "cpu-whisperx"
nvidia-cuda-13: "cuda13-whisperx"
nvidia-cuda-12: "cuda12-whisperx"
nvidia-l4t: "nvidia-l4t-arm64-whisperx"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-whisperx"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &kokoro
icon: https://avatars.githubusercontent.com/u/166769057?v=4
description: |
Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
urls:
- https://huggingface.co/hexgrad/Kokoro-82M
- https://github.com/hexgrad/kokoro
tags:
- text-to-speech
- TTS
- LLM
license: apache-2.0
alias: "kokoro"
name: "kokoro"
capabilities:
nvidia: "cuda12-kokoro"
intel: "intel-kokoro"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-kokoro"
nvidia-l4t: "nvidia-l4t-kokoro"
metal: "metal-kokoro"
nvidia-cuda-13: "cuda13-kokoro"
nvidia-cuda-12: "cuda12-kokoro"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-kokoro"
- &kokoros
icon: https://avatars.githubusercontent.com/u/166769057?v=4
description: |
Kokoros is a pure Rust TTS backend using the Kokoro ONNX model (82M parameters).
It provides fast, high-quality text-to-speech with streaming support, built on
ONNX Runtime for efficient CPU inference. Supports English, Japanese, Mandarin
Chinese, and German.
urls:
- https://huggingface.co/hexgrad/Kokoro-82M
- https://github.com/lucasjinreal/Kokoros
tags:
- text-to-speech
- TTS
- Rust
- ONNX
license: apache-2.0
alias: "kokoros"
name: "kokoros"
capabilities:
default: "cpu-kokoros"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &coqui
urls:
- https://github.com/idiap/coqui-ai-TTS
description: |
🐸 Coqui TTS is a library for advanced Text-to-Speech generation.
🚀 Pretrained models in +1100 languages.
🛠️ Tools for training new models and fine-tuning existing models in any language.
📚 Utilities for dataset analysis and curation.
tags:
- text-to-speech
- TTS
license: mpl-2.0
name: "coqui"
alias: "coqui"
capabilities:
nvidia: "cuda12-coqui"
intel: "intel-coqui"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
amd: "rocm-coqui"
metal: "metal-coqui"
nvidia-cuda-13: "cuda13-coqui"
nvidia-cuda-12: "cuda12-coqui"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
icon: https://avatars.githubusercontent.com/u/1338804?s=200&v=4
- &outetts
urls:
- https://github.com/OuteAI/outetts
description: |
OuteTTS is an open-weight text-to-speech model from OuteAI (OuteAI/OuteTTS-0.3-1B).
Supports custom speaker voices via audio path or default speakers.
tags:
- text-to-speech
- TTS
license: apache-2.0
name: "outetts"
alias: "outetts"
capabilities:
default: "cpu-outetts"
nvidia-cuda-12: "cuda12-outetts"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- &chatterbox
urls:
- https://github.com/resemble-ai/chatterbox
description: |
Resemble AI's first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.
Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out.
tags:
- text-to-speech
- TTS
license: MIT
icon: https://avatars.githubusercontent.com/u/49844015?s=200&v=4
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
name: "chatterbox"
alias: "chatterbox"
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
capabilities:
nvidia: "cuda12-chatterbox"
metal: "metal-chatterbox"
default: "cpu-chatterbox"
nvidia-l4t: "nvidia-l4t-arm64-chatterbox"
nvidia-cuda-13: "cuda13-chatterbox"
nvidia-cuda-12: "cuda12-chatterbox"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-chatterbox"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-chatterbox"
- &vibevoice
urls:
- https://github.com/microsoft/VibeVoice
description: |
VibeVoice-Realtime is a real-time text-to-speech model that generates natural-sounding speech.
tags:
- text-to-speech
- TTS
license: mit
name: "vibevoice"
alias: "vibevoice"
capabilities:
nvidia: "cuda12-vibevoice"
intel: "intel-vibevoice"
amd: "rocm-vibevoice"
nvidia-l4t: "nvidia-l4t-vibevoice"
metal: "metal-vibevoice"
default: "cpu-vibevoice"
nvidia-cuda-13: "cuda13-vibevoice"
nvidia-cuda-12: "cuda12-vibevoice"
nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice"
icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
- &qwen-tts
urls:
- https://github.com/QwenLM/Qwen3-TTS
description: |
Qwen3-TTS is a high-quality text-to-speech model supporting custom voice, voice design, and voice cloning.
tags:
- text-to-speech
- TTS
license: apache-2.0
name: "qwen-tts"
alias: "qwen-tts"
capabilities:
nvidia: "cuda12-qwen-tts"
intel: "intel-qwen-tts"
amd: "rocm-qwen-tts"
nvidia-l4t: "nvidia-l4t-qwen-tts"
metal: "metal-qwen-tts"
default: "cpu-qwen-tts"
nvidia-cuda-13: "cuda13-qwen-tts"
nvidia-cuda-12: "cuda12-qwen-tts"
nvidia-l4t-cuda-12: "nvidia-l4t-qwen-tts"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen-tts"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png
- &fish-speech
urls:
- https://github.com/fishaudio/fish-speech
description: |
Fish Speech is a high-quality text-to-speech model supporting voice cloning via reference audio.
tags:
- text-to-speech
- TTS
- voice-cloning
license: apache-2.0
name: "fish-speech"
alias: "fish-speech"
capabilities:
nvidia: "cuda12-fish-speech"
intel: "intel-fish-speech"
amd: "rocm-fish-speech"
nvidia-l4t: "nvidia-l4t-fish-speech"
metal: "metal-fish-speech"
default: "cpu-fish-speech"
nvidia-cuda-13: "cuda13-fish-speech"
nvidia-cuda-12: "cuda12-fish-speech"
nvidia-l4t-cuda-12: "nvidia-l4t-fish-speech"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-fish-speech"
icon: https://avatars.githubusercontent.com/u/148526220?s=200&v=4
- &faster-qwen3-tts
urls:
- https://github.com/andimarafioti/faster-qwen3-tts
- https://pypi.org/project/faster-qwen3-tts/
description: |
Real-time Qwen3-TTS inference using CUDA graph capture. Voice clone only; requires NVIDIA GPU with CUDA.
tags:
- text-to-speech
- TTS
- voice-clone
license: apache-2.0
name: "faster-qwen3-tts"
alias: "faster-qwen3-tts"
capabilities:
nvidia: "cuda12-faster-qwen3-tts"
default: "cuda12-faster-qwen3-tts"
nvidia-cuda-13: "cuda13-faster-qwen3-tts"
nvidia-cuda-12: "cuda12-faster-qwen3-tts"
nvidia-l4t: "nvidia-l4t-faster-qwen3-tts"
nvidia-l4t-cuda-12: "nvidia-l4t-faster-qwen3-tts"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-faster-qwen3-tts"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png
- &qwen-asr
urls:
- https://github.com/QwenLM/Qwen3-ASR
description: |
Qwen3-ASR is an automatic speech recognition model supporting multiple languages and batch inference.
tags:
- speech-recognition
- ASR
license: apache-2.0
name: "qwen-asr"
alias: "qwen-asr"
capabilities:
nvidia: "cuda12-qwen-asr"
intel: "intel-qwen-asr"
amd: "rocm-qwen-asr"
nvidia-l4t: "nvidia-l4t-qwen-asr"
metal: "metal-qwen-asr"
default: "cpu-qwen-asr"
nvidia-cuda-13: "cuda13-qwen-asr"
nvidia-cuda-12: "cuda12-qwen-asr"
nvidia-l4t-cuda-12: "nvidia-l4t-qwen-asr"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen-asr"
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png
- &nemo
urls:
- https://github.com/NVIDIA/NeMo
description: |
NVIDIA NEMO Toolkit for ASR provides state-of-the-art automatic speech recognition models including Parakeet models for various languages and use cases.
tags:
- speech-recognition
- ASR
- NVIDIA
license: apache-2.0
name: "nemo"
alias: "nemo"
capabilities:
nvidia: "cuda12-nemo"
intel: "intel-nemo"
amd: "rocm-nemo"
metal: "metal-nemo"
default: "cpu-nemo"
nvidia-cuda-13: "cuda13-nemo"
nvidia-cuda-12: "cuda12-nemo"
icon: https://www.nvidia.com/favicon.ico
- &voxcpm
urls:
- https://github.com/ModelBest/VoxCPM
description: |
VoxCPM is an innovative end-to-end TTS model from ModelBest, designed to generate highly expressive speech.
tags:
- text-to-speech
- TTS
license: mit
name: "voxcpm"
alias: "voxcpm"
capabilities:
nvidia: "cuda12-voxcpm"
intel: "intel-voxcpm"
amd: "rocm-voxcpm"
metal: "metal-voxcpm"
default: "cpu-voxcpm"
nvidia-cuda-13: "cuda13-voxcpm"
nvidia-cuda-12: "cuda12-voxcpm"
icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
- &pocket-tts
urls:
- https://github.com/kyutai-labs/pocket-tts
description: |
Pocket TTS is a lightweight text-to-speech model designed to run efficiently on CPUs.
tags:
- text-to-speech
- TTS
license: mit
name: "pocket-tts"
alias: "pocket-tts"
capabilities:
nvidia: "cuda12-pocket-tts"
intel: "intel-pocket-tts"
amd: "rocm-pocket-tts"
nvidia-l4t: "nvidia-l4t-pocket-tts"
metal: "metal-pocket-tts"
default: "cpu-pocket-tts"
nvidia-cuda-13: "cuda13-pocket-tts"
nvidia-cuda-12: "cuda12-pocket-tts"
nvidia-l4t-cuda-12: "nvidia-l4t-pocket-tts"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-pocket-tts"
icon: https://avatars.githubusercontent.com/u/151010778?s=200&v=4
- &piper
name: "piper"
uri: "quay.io/go-skynet/local-ai-backends:latest-piper"
icon: https://github.com/OHF-Voice/piper1-gpl/raw/main/etc/logo.png
urls:
- https://github.com/rhasspy/piper
- https://github.com/mudler/go-piper
mirrors:
- localai/localai-backends:latest-piper
license: MIT
description: |
A fast, local neural text to speech system
tags:
- text-to-speech
- TTS
- &opus
name: "opus"
alias: "opus"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-opus"
urls:
- https://opus-codec.org/
mirrors:
- localai/localai-backends:latest-cpu-opus
license: BSD-3-Clause
description: |
Opus audio codec backend for encoding and decoding audio.
Required for WebRTC transport in the Realtime API.
tags:
- audio-codec
- opus
- WebRTC
- realtime
- CPU
- &silero-vad
name: "silero-vad"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-silero-vad"
icon: https://user-images.githubusercontent.com/12515440/89997349-b3523080-dc94-11ea-9906-ca2e8bc50535.png
urls:
- https://github.com/snakers4/silero-vad
mirrors:
- localai/localai-backends:latest-cpu-silero-vad
description: |
Silero VAD: pre-trained enterprise-grade Voice Activity Detector.
Silero VAD is a voice activity detection model that can be used to detect whether a given audio contains speech or not.
tags:
- voice-activity-detection
- VAD
- silero-vad
- CPU
- &local-store
name: "local-store"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-local-store"
mirrors:
- localai/localai-backends:latest-cpu-local-store
urls:
- https://github.com/mudler/LocalAI
description: |
Local Store is a local-first, self-hosted, and open-source vector database.
tags:
- vector-database
- local-first
- open-source
- CPU
license: MIT
- &kitten-tts
name: "kitten-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-kitten-tts"
mirrors:
- localai/localai-backends:latest-kitten-tts
urls:
- https://github.com/KittenML/KittenTTS
description: |
Kitten TTS is a text-to-speech model that can generate speech from text.
tags:
- text-to-speech
- TTS
license: apache-2.0
- &neutts
name: "neutts"
urls:
- https://github.com/neuphonic/neutts-air
description: |
NeuTTS Air is the worlds first super-realistic, on-device, TTS speech language model with instant voice cloning. Built off a 0.5B LLM backbone, NeuTTS Air brings natural-sounding speech, real-time performance, built-in security and speaker cloning to your local device - unlocking a new category of embedded voice agents, assistants, toys, and compliance-safe apps.
tags:
- text-to-speech
- TTS
license: apache-2.0
capabilities:
default: "cpu-neutts"
nvidia: "cuda12-neutts"
amd: "rocm-neutts"
nvidia-cuda-12: "cuda12-neutts"
- !!merge <<: *neutts
name: "neutts-development"
capabilities:
default: "cpu-neutts-development"
nvidia: "cuda12-neutts-development"
amd: "rocm-neutts-development"
nvidia-cuda-12: "cuda12-neutts-development"
- !!merge <<: *llamacpp
name: "llama-cpp-development"
capabilities:
default: "cpu-llama-cpp-development"
nvidia: "cuda12-llama-cpp-development"
intel: "intel-sycl-f16-llama-cpp-development"
amd: "rocm-llama-cpp-development"
metal: "metal-llama-cpp-development"
vulkan: "vulkan-llama-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-development"
nvidia-cuda-13: "cuda13-llama-cpp-development"
nvidia-cuda-12: "cuda12-llama-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-development"
- !!merge <<: *ikllamacpp
name: "ik-llama-cpp-development"
capabilities:
default: "cpu-ik-llama-cpp-development"
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
- !!merge <<: *turboquant
name: "turboquant-development"
capabilities:
default: "cpu-turboquant-development"
nvidia: "cuda12-turboquant-development"
intel: "intel-sycl-f16-turboquant-development"
amd: "rocm-turboquant-development"
vulkan: "vulkan-turboquant-development"
nvidia-l4t: "nvidia-l4t-arm64-turboquant-development"
nvidia-cuda-13: "cuda13-turboquant-development"
nvidia-cuda-12: "cuda12-turboquant-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
- !!merge <<: *stablediffusionggml
name: "stablediffusion-ggml-development"
capabilities:
default: "cpu-stablediffusion-ggml-development"
nvidia: "cuda12-stablediffusion-ggml-development"
intel: "intel-sycl-f16-stablediffusion-ggml-development"
# amd: "rocm-stablediffusion-ggml-development"
vulkan: "vulkan-stablediffusion-ggml-development"
nvidia-l4t: "nvidia-l4t-arm64-stablediffusion-ggml-development"
metal: "metal-stablediffusion-ggml-development"
nvidia-cuda-13: "cuda13-stablediffusion-ggml-development"
nvidia-cuda-12: "cuda12-stablediffusion-ggml-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-stablediffusion-ggml-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml-development"
- !!merge <<: *neutts
name: "cpu-neutts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-neutts"
mirrors:
- localai/localai-backends:latest-cpu-neutts
- !!merge <<: *neutts
name: "cuda12-neutts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-neutts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-neutts
- !!merge <<: *neutts
name: "rocm-neutts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-neutts"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-neutts
- !!merge <<: *neutts
name: "cpu-neutts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-neutts"
mirrors:
- localai/localai-backends:master-cpu-neutts
- !!merge <<: *neutts
name: "cuda12-neutts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-neutts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-neutts
- !!merge <<: *neutts
name: "rocm-neutts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-neutts"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-neutts
- !!merge <<: *mlx
name: "mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx
- !!merge <<: *mlx-vlm
name: "mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-vlm"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-vlm
- !!merge <<: *mlx-audio
name: "mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-audio"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-audio
- !!merge <<: *mlx-distributed
name: "mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-mlx-distributed"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-mlx-distributed
## mlx
- !!merge <<: *mlx
name: "cpu-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-mlx"
mirrors:
- localai/localai-backends:latest-cpu-mlx
- !!merge <<: *mlx
name: "cpu-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-mlx"
mirrors:
- localai/localai-backends:master-cpu-mlx
- !!merge <<: *mlx
name: "cuda12-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-mlx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-mlx
- !!merge <<: *mlx
name: "cuda12-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-mlx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-mlx
- !!merge <<: *mlx
name: "cuda13-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-mlx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-mlx
- !!merge <<: *mlx
name: "cuda13-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-mlx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-mlx
- !!merge <<: *mlx
name: "nvidia-l4t-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-mlx"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-mlx
- !!merge <<: *mlx
name: "nvidia-l4t-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-mlx"
mirrors:
- localai/localai-backends:master-nvidia-l4t-mlx
- !!merge <<: *mlx
name: "cuda13-nvidia-l4t-arm64-mlx"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx
- !!merge <<: *mlx
name: "cuda13-nvidia-l4t-arm64-mlx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-mlx"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-mlx
## mlx-vlm
- !!merge <<: *mlx-vlm
name: "cpu-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-mlx-vlm"
mirrors:
- localai/localai-backends:latest-cpu-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cpu-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-mlx-vlm"
mirrors:
- localai/localai-backends:master-cpu-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda12-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-mlx-vlm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda12-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-mlx-vlm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda13-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-mlx-vlm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda13-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-mlx-vlm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-mlx-vlm
- !!merge <<: *mlx-vlm
name: "nvidia-l4t-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-mlx-vlm"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-mlx-vlm
- !!merge <<: *mlx-vlm
name: "nvidia-l4t-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-mlx-vlm"
mirrors:
- localai/localai-backends:master-nvidia-l4t-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda13-nvidia-l4t-arm64-mlx-vlm"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-vlm"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-vlm
- !!merge <<: *mlx-vlm
name: "cuda13-nvidia-l4t-arm64-mlx-vlm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-vlm"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-vlm
## mlx-audio
- !!merge <<: *mlx-audio
name: "cpu-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-mlx-audio"
mirrors:
- localai/localai-backends:latest-cpu-mlx-audio
- !!merge <<: *mlx-audio
name: "cpu-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-mlx-audio"
mirrors:
- localai/localai-backends:master-cpu-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda12-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-mlx-audio"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda12-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-mlx-audio"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda13-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-mlx-audio"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda13-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-mlx-audio"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-mlx-audio
- !!merge <<: *mlx-audio
name: "nvidia-l4t-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-mlx-audio"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-mlx-audio
- !!merge <<: *mlx-audio
name: "nvidia-l4t-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-mlx-audio"
mirrors:
- localai/localai-backends:master-nvidia-l4t-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda13-nvidia-l4t-arm64-mlx-audio"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-audio"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-audio
- !!merge <<: *mlx-audio
name: "cuda13-nvidia-l4t-arm64-mlx-audio-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-audio"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-audio
## mlx-distributed
- !!merge <<: *mlx-distributed
name: "cpu-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-mlx-distributed"
mirrors:
- localai/localai-backends:latest-cpu-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cpu-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-mlx-distributed"
mirrors:
- localai/localai-backends:master-cpu-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda12-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-mlx-distributed"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda12-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-mlx-distributed"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda13-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-mlx-distributed"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda13-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-mlx-distributed"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-mlx-distributed
- !!merge <<: *mlx-distributed
name: "nvidia-l4t-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-mlx-distributed"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-mlx-distributed
- !!merge <<: *mlx-distributed
name: "nvidia-l4t-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-mlx-distributed"
mirrors:
- localai/localai-backends:master-nvidia-l4t-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda13-nvidia-l4t-arm64-mlx-distributed"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-distributed"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-mlx-distributed
- !!merge <<: *mlx-distributed
name: "cuda13-nvidia-l4t-arm64-mlx-distributed-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-distributed"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-mlx-distributed
- !!merge <<: *kitten-tts
name: "kitten-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-kitten-tts"
mirrors:
- localai/localai-backends:master-kitten-tts
- !!merge <<: *kitten-tts
name: "metal-kitten-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-kitten-tts"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-kitten-tts
- !!merge <<: *kitten-tts
name: "metal-kitten-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-kitten-tts"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-kitten-tts
- !!merge <<: *local-store
name: "local-store-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-local-store"
mirrors:
- localai/localai-backends:master-cpu-local-store
- !!merge <<: *local-store
name: "metal-local-store"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-local-store"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-local-store
- !!merge <<: *local-store
name: "metal-local-store-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-local-store"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-local-store
- !!merge <<: *opus
name: "opus-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-opus"
mirrors:
- localai/localai-backends:master-cpu-opus
- !!merge <<: *opus
name: "metal-opus"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-opus"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-opus
- !!merge <<: *opus
name: "metal-opus-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-opus"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-opus
- !!merge <<: *silero-vad
name: "silero-vad-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-silero-vad"
mirrors:
- localai/localai-backends:master-cpu-silero-vad
- !!merge <<: *silero-vad
name: "metal-silero-vad"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-silero-vad"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-silero-vad
- !!merge <<: *silero-vad
name: "metal-silero-vad-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-silero-vad"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-silero-vad
- !!merge <<: *piper
name: "piper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-piper"
mirrors:
- localai/localai-backends:master-piper
- !!merge <<: *piper
name: "metal-piper"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-piper"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-piper
- !!merge <<: *piper
name: "metal-piper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-piper"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-piper
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
## llama-cpp
- !!merge <<: *llamacpp
name: "nvidia-l4t-arm64-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-llama-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "nvidia-l4t-arm64-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-llama-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-llama-cpp
- !!merge <<: *llamacpp
name: "cuda13-nvidia-l4t-arm64-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp
- !!merge <<: *llamacpp
name: "cuda13-nvidia-l4t-arm64-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "cpu-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp"
mirrors:
- localai/localai-backends:latest-cpu-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "cpu-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp"
mirrors:
- localai/localai-backends:master-cpu-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "cuda12-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "rocm-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "intel-sycl-f32-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "intel-sycl-f16-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-llama-cpp
- !!merge <<: *llamacpp
name: "vulkan-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-llama-cpp
- !!merge <<: *llamacpp
name: "vulkan-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "metal-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-llama-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "metal-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-llama-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "cuda12-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "rocm-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "intel-sycl-f32-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-llama-cpp
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
- !!merge <<: *llamacpp
name: "intel-sycl-f16-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-llama-cpp
- !!merge <<: *llamacpp
name: "cuda13-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-llama-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-llama-cpp
- !!merge <<: *llamacpp
name: "cuda13-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-llama-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-llama-cpp
## ik-llama-cpp
- !!merge <<: *ikllamacpp
name: "cpu-ik-llama-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ik-llama-cpp"
mirrors:
- localai/localai-backends:latest-cpu-ik-llama-cpp
- !!merge <<: *ikllamacpp
name: "cpu-ik-llama-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ik-llama-cpp"
mirrors:
- localai/localai-backends:master-cpu-ik-llama-cpp
feat(backend): add turboquant llama.cpp-fork backend (#9355) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-14 23:25:04 +00:00
## turboquant
- !!merge <<: *turboquant
name: "cpu-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-turboquant"
mirrors:
- localai/localai-backends:latest-cpu-turboquant
- !!merge <<: *turboquant
name: "cpu-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-turboquant"
mirrors:
- localai/localai-backends:master-cpu-turboquant
- !!merge <<: *turboquant
name: "cuda12-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-turboquant
- !!merge <<: *turboquant
name: "cuda12-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-turboquant"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-turboquant
- !!merge <<: *turboquant
name: "cuda13-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-turboquant
- !!merge <<: *turboquant
name: "cuda13-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-turboquant"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-turboquant
- !!merge <<: *turboquant
name: "rocm-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-turboquant
- !!merge <<: *turboquant
name: "rocm-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-turboquant"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f32-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f32-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-turboquant"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f16-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f16-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-turboquant"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-turboquant
- !!merge <<: *turboquant
name: "vulkan-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-turboquant
- !!merge <<: *turboquant
name: "vulkan-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-turboquant"
mirrors:
- localai/localai-backends:master-gpu-vulkan-turboquant
- !!merge <<: *turboquant
name: "nvidia-l4t-arm64-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-turboquant"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-turboquant
- !!merge <<: *turboquant
name: "nvidia-l4t-arm64-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-turboquant"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-turboquant
- !!merge <<: *turboquant
name: "cuda13-nvidia-l4t-arm64-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-turboquant"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-turboquant
- !!merge <<: *turboquant
name: "cuda13-nvidia-l4t-arm64-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
## whisper
- !!merge <<: *whispercpp
name: "nvidia-l4t-arm64-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-whisper"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-whisper
- !!merge <<: *whispercpp
name: "nvidia-l4t-arm64-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-whisper"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-whisper
- !!merge <<: *whispercpp
name: "cuda13-nvidia-l4t-arm64-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-whisper"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-whisper
- !!merge <<: *whispercpp
name: "cuda13-nvidia-l4t-arm64-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-whisper"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-whisper
- !!merge <<: *whispercpp
name: "cpu-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-whisper"
mirrors:
- localai/localai-backends:latest-cpu-whisper
- !!merge <<: *whispercpp
name: "metal-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-whisper"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-whisper
- !!merge <<: *whispercpp
name: "metal-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-whisper"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-whisper
- !!merge <<: *whispercpp
name: "cpu-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-whisper"
mirrors:
- localai/localai-backends:master-cpu-whisper
- !!merge <<: *whispercpp
name: "cuda12-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-whisper"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-whisper
- !!merge <<: *whispercpp
name: "rocm-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-whisper"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-whisper
- !!merge <<: *whispercpp
name: "intel-sycl-f32-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-whisper"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-whisper
- !!merge <<: *whispercpp
name: "intel-sycl-f16-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-whisper"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-whisper
- !!merge <<: *whispercpp
name: "vulkan-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-whisper"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-whisper
- !!merge <<: *whispercpp
name: "vulkan-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-whisper"
mirrors:
- localai/localai-backends:master-gpu-vulkan-whisper
- !!merge <<: *whispercpp
name: "metal-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-whisper"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-whisper
- !!merge <<: *whispercpp
name: "metal-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-whisper"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-whisper
- !!merge <<: *whispercpp
name: "cuda12-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-whisper
- !!merge <<: *whispercpp
name: "rocm-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-whisper"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-whisper
- !!merge <<: *whispercpp
name: "intel-sycl-f32-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-whisper"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-whisper
- !!merge <<: *whispercpp
name: "intel-sycl-f16-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-whisper"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-whisper
- !!merge <<: *whispercpp
name: "cuda13-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-whisper"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-whisper
- !!merge <<: *whispercpp
name: "cuda13-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-whisper
## stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cpu-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-cpu-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cpu-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-cpu-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "metal-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "metal-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "vulkan-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "vulkan-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-gpu-vulkan-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda12-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "intel-sycl-f32-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-stablediffusion-ggml"
- !!merge <<: *stablediffusionggml
name: "intel-sycl-f16-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda12-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "intel-sycl-f32-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "intel-sycl-f16-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "nvidia-l4t-arm64-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "nvidia-l4t-arm64-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda13-stablediffusion-ggml"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-stablediffusion-ggml"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cuda13-stablediffusion-ggml-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-stablediffusion-ggml
feat: do not bundle llama-cpp anymore (#5790) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-07-18 11:24:12 +00:00
# vllm
- !!merge <<: *vllm
name: "vllm-development"
capabilities:
nvidia: "cuda12-vllm-development"
amd: "rocm-vllm-development"
intel: "intel-vllm-development"
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
cpu: "cpu-vllm-development"
- !!merge <<: *vllm
name: "cuda12-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm
- !!merge <<: *vllm
name: "rocm-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-vllm
- !!merge <<: *vllm
name: "intel-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-vllm"
mirrors:
- localai/localai-backends:latest-gpu-intel-vllm
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
- !!merge <<: *vllm
name: "cpu-vllm"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vllm"
mirrors:
- localai/localai-backends:latest-cpu-vllm
- !!merge <<: *vllm
name: "cuda12-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vllm
- !!merge <<: *vllm
name: "rocm-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-vllm
- !!merge <<: *vllm
name: "intel-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-vllm"
mirrors:
- localai/localai-backends:master-gpu-intel-vllm
feat(vllm): parity with llama.cpp backend (#9328) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.
2026-04-13 09:00:29 +00:00
- !!merge <<: *vllm
name: "cpu-vllm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vllm"
mirrors:
- localai/localai-backends:master-cpu-vllm
# sglang
- !!merge <<: *sglang
name: "sglang-development"
capabilities:
nvidia: "cuda12-sglang-development"
amd: "rocm-sglang-development"
intel: "intel-sglang-development"
cpu: "cpu-sglang-development"
- !!merge <<: *sglang
name: "cuda12-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "rocm-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-sglang
- !!merge <<: *sglang
name: "intel-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sglang"
mirrors:
- localai/localai-backends:latest-gpu-intel-sglang
- !!merge <<: *sglang
name: "cpu-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sglang"
mirrors:
- localai/localai-backends:latest-cpu-sglang
- !!merge <<: *sglang
name: "cuda12-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "rocm-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-sglang
- !!merge <<: *sglang
name: "intel-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sglang"
mirrors:
- localai/localai-backends:master-gpu-intel-sglang
- !!merge <<: *sglang
name: "cpu-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sglang"
mirrors:
- localai/localai-backends:master-cpu-sglang
# vllm-omni
- !!merge <<: *vllm-omni
name: "vllm-omni-development"
capabilities:
nvidia: "cuda12-vllm-omni-development"
amd: "rocm-vllm-omni-development"
nvidia-cuda-12: "cuda12-vllm-omni-development"
- !!merge <<: *vllm-omni
name: "cuda12-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vllm-omni"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vllm-omni
- !!merge <<: *vllm-omni
name: "rocm-vllm-omni"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vllm-omni"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-vllm-omni
- !!merge <<: *vllm-omni
name: "cuda12-vllm-omni-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vllm-omni"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vllm-omni
- !!merge <<: *vllm-omni
name: "rocm-vllm-omni-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vllm-omni"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-vllm-omni
# rfdetr
- !!merge <<: *rfdetr
name: "rfdetr-development"
capabilities:
nvidia: "cuda12-rfdetr-development"
intel: "intel-rfdetr-development"
#amd: "rocm-rfdetr-development"
nvidia-l4t: "nvidia-l4t-arm64-rfdetr-development"
metal: "metal-rfdetr-development"
default: "cpu-rfdetr-development"
nvidia-cuda-13: "cuda13-rfdetr-development"
- !!merge <<: *rfdetr
name: "cuda12-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-rfdetr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-rfdetr
- !!merge <<: *rfdetr
name: "intel-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-rfdetr"
mirrors:
- localai/localai-backends:latest-gpu-intel-rfdetr
# - !!merge <<: *rfdetr
# name: "rocm-rfdetr"
# uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-hipblas-rfdetr"
# mirrors:
# - localai/localai-backends:latest-gpu-hipblas-rfdetr
- !!merge <<: *rfdetr
name: "nvidia-l4t-arm64-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-rfdetr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-rfdetr
- !!merge <<: *rfdetr
name: "nvidia-l4t-arm64-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-rfdetr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-rfdetr
- !!merge <<: *rfdetr
name: "cpu-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-rfdetr"
mirrors:
- localai/localai-backends:latest-cpu-rfdetr
- !!merge <<: *rfdetr
name: "cuda12-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-rfdetr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-rfdetr
- !!merge <<: *rfdetr
name: "intel-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-rfdetr"
mirrors:
- localai/localai-backends:master-gpu-intel-rfdetr
# - !!merge <<: *rfdetr
# name: "rocm-rfdetr-development"
# uri: "quay.io/go-skynet/local-ai-backends:master-gpu-hipblas-rfdetr"
# mirrors:
# - localai/localai-backends:master-gpu-hipblas-rfdetr
- !!merge <<: *rfdetr
name: "cpu-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-rfdetr"
mirrors:
- localai/localai-backends:master-cpu-rfdetr
- !!merge <<: *rfdetr
name: "intel-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-rfdetr"
mirrors:
- localai/localai-backends:latest-gpu-intel-rfdetr
- !!merge <<: *rfdetr
name: "cuda13-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-rfdetr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-rfdetr
- !!merge <<: *rfdetr
name: "cuda13-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-rfdetr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-rfdetr
- !!merge <<: *rfdetr
name: "metal-rfdetr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-rfdetr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-rfdetr
- !!merge <<: *rfdetr
name: "metal-rfdetr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-rfdetr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-rfdetr
## sam3-cpp
- !!merge <<: *sam3cpp
name: "sam3-cpp-development"
capabilities:
default: "cpu-sam3-cpp-development"
nvidia: "cuda12-sam3-cpp-development"
nvidia-cuda-12: "cuda12-sam3-cpp-development"
nvidia-cuda-13: "cuda13-sam3-cpp-development"
nvidia-l4t: "nvidia-l4t-arm64-sam3-cpp-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-sam3-cpp-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-sam3-cpp-development"
intel: "intel-sycl-f32-sam3-cpp-development"
vulkan: "vulkan-sam3-cpp-development"
- !!merge <<: *sam3cpp
name: "cpu-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sam3-cpp"
mirrors:
- localai/localai-backends:latest-cpu-sam3-cpp
- !!merge <<: *sam3cpp
name: "cpu-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sam3-cpp"
mirrors:
- localai/localai-backends:master-cpu-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda12-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda12-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-sam3-cpp
- !!merge <<: *sam3cpp
name: "nvidia-l4t-arm64-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "nvidia-l4t-arm64-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-nvidia-l4t-arm64-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "cuda13-nvidia-l4t-arm64-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-sam3-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-sam3-cpp
- !!merge <<: *sam3cpp
name: "intel-sycl-f32-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-sam3-cpp
- !!merge <<: *sam3cpp
name: "intel-sycl-f32-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-sam3-cpp
- !!merge <<: *sam3cpp
name: "vulkan-sam3-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-sam3-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-sam3-cpp
- !!merge <<: *sam3cpp
name: "vulkan-sam3-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-sam3-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-sam3-cpp
## Rerankers
- !!merge <<: *rerankers
name: "rerankers-development"
capabilities:
nvidia: "cuda12-rerankers-development"
intel: "intel-rerankers-development"
amd: "rocm-rerankers-development"
metal: "metal-rerankers-development"
nvidia-cuda-13: "cuda13-rerankers-development"
- !!merge <<: *rerankers
name: "cuda12-rerankers"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-rerankers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-rerankers
- !!merge <<: *rerankers
name: "intel-rerankers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-rerankers"
mirrors:
- localai/localai-backends:latest-gpu-intel-rerankers
- !!merge <<: *rerankers
name: "rocm-rerankers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-rerankers"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-rerankers
- !!merge <<: *rerankers
name: "cuda12-rerankers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-rerankers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-rerankers
- !!merge <<: *rerankers
name: "rocm-rerankers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-rerankers"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-rerankers
- !!merge <<: *rerankers
name: "intel-rerankers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-rerankers"
mirrors:
- localai/localai-backends:master-gpu-intel-rerankers
- !!merge <<: *rerankers
name: "cuda13-rerankers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-rerankers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-rerankers
- !!merge <<: *rerankers
name: "cuda13-rerankers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-rerankers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-rerankers
- !!merge <<: *rerankers
name: "metal-rerankers"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-rerankers"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-rerankers
- !!merge <<: *rerankers
name: "metal-rerankers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-rerankers"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-rerankers
feat(backend): add tinygrad multimodal backend (experimental) (#9364) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad|python|.|false|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
2026-04-15 17:48:23 +00:00
## tinygrad
## Single image — the meta anchor above carries the latest uri directly
## since there is only one variant. The development entry below points at
## the master tag.
- !!merge <<: *tinygrad
name: "tinygrad-development"
uri: "quay.io/go-skynet/local-ai-backends:master-tinygrad"
mirrors:
- localai/localai-backends:master-tinygrad
## Transformers
- !!merge <<: *transformers
name: "transformers-development"
capabilities:
nvidia: "cuda12-transformers-development"
intel: "intel-transformers-development"
amd: "rocm-transformers-development"
metal: "metal-transformers-development"
nvidia-cuda-13: "cuda13-transformers-development"
- !!merge <<: *transformers
name: "cuda12-transformers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-transformers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-transformers
- !!merge <<: *transformers
name: "rocm-transformers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-transformers"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-transformers
- !!merge <<: *transformers
name: "intel-transformers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-transformers"
mirrors:
- localai/localai-backends:latest-gpu-intel-transformers
- !!merge <<: *transformers
name: "cuda12-transformers-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-transformers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-transformers
- !!merge <<: *transformers
name: "rocm-transformers-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-transformers"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-transformers
- !!merge <<: *transformers
name: "intel-transformers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-transformers"
mirrors:
- localai/localai-backends:master-gpu-intel-transformers
- !!merge <<: *transformers
name: "cuda13-transformers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-transformers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-transformers
- !!merge <<: *transformers
name: "cuda13-transformers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-transformers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-transformers
- !!merge <<: *transformers
name: "metal-transformers"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-transformers"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-transformers
- !!merge <<: *transformers
name: "metal-transformers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-transformers"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-transformers
## Diffusers
- !!merge <<: *diffusers
name: "diffusers-development"
capabilities:
nvidia: "cuda12-diffusers-development"
intel: "intel-diffusers-development"
amd: "rocm-diffusers-development"
nvidia-l4t: "nvidia-l4t-diffusers-development"
metal: "metal-diffusers-development"
default: "cpu-diffusers-development"
nvidia-cuda-13: "cuda13-diffusers-development"
- !!merge <<: *diffusers
name: "cpu-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-diffusers"
mirrors:
- localai/localai-backends:latest-cpu-diffusers
- !!merge <<: *diffusers
name: "cpu-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-diffusers"
mirrors:
- localai/localai-backends:master-cpu-diffusers
- !!merge <<: *diffusers
name: "nvidia-l4t-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-diffusers"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-diffusers
- !!merge <<: *diffusers
name: "nvidia-l4t-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-diffusers"
mirrors:
- localai/localai-backends:master-nvidia-l4t-diffusers
- !!merge <<: *diffusers
name: "cuda13-nvidia-l4t-arm64-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-diffusers"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-diffusers
- !!merge <<: *diffusers
name: "cuda13-nvidia-l4t-arm64-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-diffusers"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-diffusers
- !!merge <<: *diffusers
name: "cuda12-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-diffusers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-diffusers
- !!merge <<: *diffusers
name: "rocm-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-diffusers"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-diffusers
- !!merge <<: *diffusers
name: "intel-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-diffusers"
mirrors:
- localai/localai-backends:latest-gpu-intel-diffusers
- !!merge <<: *diffusers
name: "cuda12-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-diffusers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-diffusers
- !!merge <<: *diffusers
name: "rocm-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-diffusers"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-diffusers
- !!merge <<: *diffusers
name: "intel-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-diffusers"
mirrors:
- localai/localai-backends:master-gpu-intel-diffusers
- !!merge <<: *diffusers
name: "cuda13-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-diffusers"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-diffusers
- !!merge <<: *diffusers
name: "cuda13-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-diffusers"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-diffusers
- !!merge <<: *diffusers
name: "metal-diffusers"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-diffusers"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-diffusers
- !!merge <<: *diffusers
name: "metal-diffusers-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-diffusers"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-diffusers
## ace-step
- !!merge <<: *ace-step
name: "cpu-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ace-step"
mirrors:
- localai/localai-backends:latest-cpu-ace-step
- !!merge <<: *ace-step
name: "cpu-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ace-step"
mirrors:
- localai/localai-backends:master-cpu-ace-step
- !!merge <<: *ace-step
name: "cuda12-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ace-step"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-ace-step
- !!merge <<: *ace-step
name: "cuda12-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ace-step"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-ace-step
- !!merge <<: *ace-step
name: "cuda13-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ace-step"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-ace-step
- !!merge <<: *ace-step
name: "cuda13-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ace-step"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-ace-step
- !!merge <<: *ace-step
name: "rocm-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ace-step"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-ace-step
- !!merge <<: *ace-step
name: "rocm-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ace-step"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-ace-step
- !!merge <<: *ace-step
name: "intel-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-ace-step"
mirrors:
- localai/localai-backends:latest-gpu-intel-ace-step
- !!merge <<: *ace-step
name: "intel-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-ace-step"
mirrors:
- localai/localai-backends:master-gpu-intel-ace-step
- !!merge <<: *ace-step
name: "metal-ace-step"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ace-step"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-ace-step
- !!merge <<: *ace-step
name: "metal-ace-step-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ace-step"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-ace-step
## acestep-cpp
- !!merge <<: *acestepcpp
name: "nvidia-l4t-arm64-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "nvidia-l4t-arm64-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda13-nvidia-l4t-arm64-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda13-nvidia-l4t-arm64-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "cpu-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-acestep-cpp"
mirrors:
- localai/localai-backends:latest-cpu-acestep-cpp
- !!merge <<: *acestepcpp
name: "metal-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "metal-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-acestep-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-acestep-cpp
- !!merge <<: *acestepcpp
name: "cpu-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-acestep-cpp"
mirrors:
- localai/localai-backends:master-cpu-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda12-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-acestep-cpp
- !!merge <<: *acestepcpp
name: "rocm-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-acestep-cpp
- !!merge <<: *acestepcpp
name: "intel-sycl-f32-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-acestep-cpp
- !!merge <<: *acestepcpp
name: "intel-sycl-f16-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-acestep-cpp
- !!merge <<: *acestepcpp
name: "vulkan-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-acestep-cpp
- !!merge <<: *acestepcpp
name: "vulkan-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda12-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-acestep-cpp
- !!merge <<: *acestepcpp
name: "rocm-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-acestep-cpp
- !!merge <<: *acestepcpp
name: "intel-sycl-f32-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-acestep-cpp
- !!merge <<: *acestepcpp
name: "intel-sycl-f16-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda13-acestep-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-acestep-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-acestep-cpp
- !!merge <<: *acestepcpp
name: "cuda13-acestep-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-acestep-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-acestep-cpp
## qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "nvidia-l4t-arm64-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "nvidia-l4t-arm64-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-nvidia-l4t-arm64-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cpu-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-cpu-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "metal-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "metal-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cpu-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-cpu-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda12-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "rocm-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f32-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f16-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "vulkan-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "vulkan-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-vulkan-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda12-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "rocm-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f32-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "intel-sycl-f16-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-qwen3-tts-cpp"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-qwen3-tts-cpp
- !!merge <<: *qwen3ttscpp
name: "cuda13-qwen3-tts-cpp-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen3-tts-cpp
## kokoro
- !!merge <<: *kokoro
name: "kokoro-development"
capabilities:
nvidia: "cuda12-kokoro-development"
intel: "intel-kokoro-development"
amd: "rocm-kokoro-development"
nvidia-l4t: "nvidia-l4t-kokoro-development"
metal: "metal-kokoro-development"
- !!merge <<: *kokoro
name: "cuda12-kokoro-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-kokoro"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-kokoro
- !!merge <<: *kokoro
name: "rocm-kokoro-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-kokoro"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-kokoro
- !!merge <<: *kokoro
name: "intel-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-kokoro"
mirrors:
- localai/localai-backends:latest-gpu-intel-kokoro
- !!merge <<: *kokoro
name: "intel-kokoro-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-kokoro"
mirrors:
- localai/localai-backends:master-gpu-intel-kokoro
- !!merge <<: *kokoro
name: "nvidia-l4t-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-kokoro"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-kokoro
- !!merge <<: *kokoro
name: "nvidia-l4t-kokoro-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-kokoro"
mirrors:
- localai/localai-backends:master-nvidia-l4t-kokoro
- !!merge <<: *kokoro
name: "cuda12-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-kokoro"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-kokoro
- !!merge <<: *kokoro
name: "rocm-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-kokoro"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-kokoro
- !!merge <<: *kokoro
name: "cuda13-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-kokoro"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-kokoro
- !!merge <<: *kokoro
name: "cuda13-kokoro-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-kokoro"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-kokoro
- !!merge <<: *kokoro
name: "metal-kokoro"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-kokoro"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-kokoro
- !!merge <<: *kokoro
name: "metal-kokoro-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-kokoro"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-kokoro
## kokoros (Rust)
- !!merge <<: *kokoros
name: "kokoros-development"
capabilities:
default: "cpu-kokoros-development"
- !!merge <<: *kokoros
name: "cpu-kokoros"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-kokoros"
mirrors:
- localai/localai-backends:latest-cpu-kokoros
- !!merge <<: *kokoros
name: "cpu-kokoros-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-kokoros"
mirrors:
- localai/localai-backends:master-cpu-kokoros
## faster-whisper
- !!merge <<: *faster-whisper
name: "faster-whisper-development"
capabilities:
default: "cpu-faster-whisper-development"
nvidia: "cuda12-faster-whisper-development"
intel: "intel-faster-whisper-development"
amd: "rocm-faster-whisper-development"
metal: "metal-faster-whisper-development"
nvidia-cuda-13: "cuda13-faster-whisper-development"
nvidia-l4t: "nvidia-l4t-arm64-faster-whisper-development"
- !!merge <<: *faster-whisper
name: "cuda12-faster-whisper-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-faster-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-faster-whisper
- !!merge <<: *faster-whisper
name: "rocm-faster-whisper-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-faster-whisper"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-faster-whisper
- !!merge <<: *faster-whisper
name: "intel-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-intel-faster-whisper
- !!merge <<: *faster-whisper
name: "intel-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-faster-whisper"
mirrors:
- localai/localai-backends:master-gpu-intel-faster-whisper
- !!merge <<: *faster-whisper
name: "cuda13-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-faster-whisper
- !!merge <<: *faster-whisper
name: "cuda13-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-faster-whisper"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-faster-whisper
- !!merge <<: *faster-whisper
name: "metal-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-faster-whisper"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-faster-whisper
- !!merge <<: *faster-whisper
name: "metal-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-faster-whisper"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-faster-whisper
- !!merge <<: *faster-whisper
name: "cuda12-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-faster-whisper
- !!merge <<: *faster-whisper
name: "rocm-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-faster-whisper"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-faster-whisper
- !!merge <<: *faster-whisper
name: "cpu-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-faster-whisper"
mirrors:
- localai/localai-backends:latest-cpu-faster-whisper
- !!merge <<: *faster-whisper
name: "cpu-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-faster-whisper"
mirrors:
- localai/localai-backends:master-cpu-faster-whisper
- !!merge <<: *faster-whisper
name: "nvidia-l4t-arm64-faster-whisper"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-faster-whisper"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-faster-whisper
- !!merge <<: *faster-whisper
name: "nvidia-l4t-arm64-faster-whisper-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-faster-whisper"
mirrors:
- localai/localai-backends:master-nvidia-l4t-faster-whisper
## moonshine
- !!merge <<: *moonshine
name: "moonshine-development"
capabilities:
nvidia: "cuda12-moonshine-development"
default: "cpu-moonshine-development"
nvidia-cuda-13: "cuda13-moonshine-development"
nvidia-cuda-12: "cuda12-moonshine-development"
- !!merge <<: *moonshine
name: "cpu-moonshine"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-moonshine"
mirrors:
- localai/localai-backends:latest-cpu-moonshine
- !!merge <<: *moonshine
name: "cpu-moonshine-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-moonshine"
mirrors:
- localai/localai-backends:master-cpu-moonshine
- !!merge <<: *moonshine
name: "cuda12-moonshine"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-moonshine"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-moonshine
- !!merge <<: *moonshine
name: "cuda12-moonshine-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-moonshine"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-moonshine
- !!merge <<: *moonshine
name: "cuda13-moonshine"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-moonshine"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-moonshine
- !!merge <<: *moonshine
name: "cuda13-moonshine-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-moonshine"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-moonshine
- !!merge <<: *moonshine
name: "metal-moonshine"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-moonshine"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-moonshine
- !!merge <<: *moonshine
name: "metal-moonshine-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-moonshine"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-moonshine
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
## whisperx
- !!merge <<: *whisperx
name: "whisperx-development"
capabilities:
nvidia: "cuda12-whisperx-development"
amd: "rocm-whisperx-development"
metal: "metal-whisperx-development"
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
default: "cpu-whisperx-development"
nvidia-cuda-13: "cuda13-whisperx-development"
nvidia-cuda-12: "cuda12-whisperx-development"
nvidia-l4t: "nvidia-l4t-arm64-whisperx-development"
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
- !!merge <<: *whisperx
name: "cpu-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-whisperx"
mirrors:
- localai/localai-backends:latest-cpu-whisperx
- !!merge <<: *whisperx
name: "cpu-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-whisperx"
mirrors:
- localai/localai-backends:master-cpu-whisperx
- !!merge <<: *whisperx
name: "cuda12-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-whisperx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-whisperx
- !!merge <<: *whisperx
name: "cuda12-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-whisperx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-whisperx
- !!merge <<: *whisperx
name: "rocm-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-whisperx"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
name: "rocm-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-whisperx"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-whisperx
- !!merge <<: *whisperx
name: "cuda13-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-whisperx"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-whisperx
- !!merge <<: *whisperx
name: "cuda13-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-whisperx"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-whisperx
- !!merge <<: *whisperx
name: "metal-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-whisperx"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-whisperx
- !!merge <<: *whisperx
name: "metal-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-whisperx"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-whisperx
- !!merge <<: *whisperx
name: "nvidia-l4t-arm64-whisperx"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-whisperx"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-whisperx
- !!merge <<: *whisperx
name: "nvidia-l4t-arm64-whisperx-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-whisperx"
mirrors:
- localai/localai-backends:master-nvidia-l4t-whisperx
## coqui
- !!merge <<: *coqui
name: "coqui-development"
capabilities:
nvidia: "cuda12-coqui-development"
intel: "intel-coqui-development"
amd: "rocm-coqui-development"
metal: "metal-coqui-development"
- !!merge <<: *coqui
name: "cuda12-coqui"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-coqui"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-coqui
- !!merge <<: *coqui
name: "cuda12-coqui-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-coqui"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-coqui
- !!merge <<: *coqui
name: "rocm-coqui-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-coqui"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-coqui
- !!merge <<: *coqui
name: "intel-coqui"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-coqui"
mirrors:
- localai/localai-backends:latest-gpu-intel-coqui
- !!merge <<: *coqui
name: "intel-coqui-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-coqui"
mirrors:
- localai/localai-backends:master-gpu-intel-coqui
- !!merge <<: *coqui
name: "rocm-coqui"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-coqui"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-coqui
- !!merge <<: *coqui
name: "metal-coqui"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-coqui"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-coqui
- !!merge <<: *coqui
name: "metal-coqui-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-coqui"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-coqui
## outetts
- !!merge <<: *outetts
name: "outetts-development"
capabilities:
default: "cpu-outetts-development"
nvidia-cuda-12: "cuda12-outetts-development"
- !!merge <<: *outetts
name: "cpu-outetts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-outetts"
mirrors:
- localai/localai-backends:latest-cpu-outetts
- !!merge <<: *outetts
name: "cpu-outetts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-outetts"
mirrors:
- localai/localai-backends:master-cpu-outetts
- !!merge <<: *outetts
name: "cuda12-outetts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-outetts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-outetts
- !!merge <<: *outetts
name: "cuda12-outetts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-outetts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-outetts
## chatterbox
- !!merge <<: *chatterbox
name: "chatterbox-development"
capabilities:
nvidia: "cuda12-chatterbox-development"
metal: "metal-chatterbox-development"
default: "cpu-chatterbox-development"
nvidia-l4t: "nvidia-l4t-arm64-chatterbox"
nvidia-cuda-13: "cuda13-chatterbox-development"
nvidia-cuda-12: "cuda12-chatterbox-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-chatterbox"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-chatterbox-development"
- !!merge <<: *chatterbox
name: "cpu-chatterbox"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-chatterbox"
mirrors:
- localai/localai-backends:latest-cpu-chatterbox
- !!merge <<: *chatterbox
name: "cpu-chatterbox-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-chatterbox"
mirrors:
- localai/localai-backends:master-cpu-chatterbox
- !!merge <<: *chatterbox
name: "nvidia-l4t-arm64-chatterbox"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-chatterbox"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-chatterbox
- !!merge <<: *chatterbox
name: "nvidia-l4t-arm64-chatterbox-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-chatterbox"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-chatterbox
- !!merge <<: *chatterbox
name: "metal-chatterbox"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-chatterbox"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-chatterbox
- !!merge <<: *chatterbox
name: "metal-chatterbox-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-chatterbox"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-chatterbox
- !!merge <<: *chatterbox
name: "cuda12-chatterbox-development"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-chatterbox"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-chatterbox
- !!merge <<: *chatterbox
name: "cuda12-chatterbox"
feat: Add backend gallery (#5607) * feat: Add backend gallery This PR add support to manage backends as similar to models. There is now available a backend gallery which can be used to install and remove extra backends. The backend gallery can be configured similarly as a model gallery, and API calls allows to install and remove new backends in runtime, and as well during the startup phase of LocalAI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add backends docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * wip: Backend Dockerfile for python backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: drop extras images, build python backends separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup on all backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Tweaks Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop old backends leftovers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Move dockerfile upper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix proto Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Feature dropped for consistency - we prefer model galleries Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing packages in the build image Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * exllama is ponly available on cublas Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pin torch on chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups to index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Debug CI * Install accellerators deps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add target arch * Add cuda minor version Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use self-hosted runners Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: use quay for test images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups for vllm and chatterbox Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Small fixups on CI Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chatterbox is only available for nvidia Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Simplify CI builds Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt test, use qwen3 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(model gallery): add jina-reranker-v1-tiny-en-gguf Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gguf-parser): recover from potential panics that can happen while reading ggufs with gguf-parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use reranker from llama.cpp in AIO images Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Limit concurrent jobs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2025-06-15 12:56:52 +00:00
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-chatterbox"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-chatterbox
- !!merge <<: *chatterbox
name: "cuda13-chatterbox"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-chatterbox"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-chatterbox
- !!merge <<: *chatterbox
name: "cuda13-chatterbox-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-chatterbox"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-chatterbox
- !!merge <<: *chatterbox
name: "cuda13-nvidia-l4t-arm64-chatterbox"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-chatterbox"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-chatterbox
- !!merge <<: *chatterbox
name: "cuda13-nvidia-l4t-arm64-chatterbox-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-chatterbox"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-chatterbox
## vibevoice
- !!merge <<: *vibevoice
name: "vibevoice-development"
capabilities:
nvidia: "cuda12-vibevoice-development"
intel: "intel-vibevoice-development"
amd: "rocm-vibevoice-development"
nvidia-l4t: "nvidia-l4t-vibevoice-development"
metal: "metal-vibevoice-development"
default: "cpu-vibevoice-development"
nvidia-cuda-13: "cuda13-vibevoice-development"
nvidia-cuda-12: "cuda12-vibevoice-development"
nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice-development"
- !!merge <<: *vibevoice
name: "cpu-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-vibevoice"
mirrors:
- localai/localai-backends:latest-cpu-vibevoice
- !!merge <<: *vibevoice
name: "cpu-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vibevoice"
mirrors:
- localai/localai-backends:master-cpu-vibevoice
- !!merge <<: *vibevoice
name: "cuda12-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-vibevoice"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-vibevoice
- !!merge <<: *vibevoice
name: "cuda12-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-vibevoice"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-vibevoice
- !!merge <<: *vibevoice
name: "cuda13-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-vibevoice"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-vibevoice
- !!merge <<: *vibevoice
name: "cuda13-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-vibevoice"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-vibevoice
- !!merge <<: *vibevoice
name: "intel-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-vibevoice"
mirrors:
- localai/localai-backends:latest-gpu-intel-vibevoice
- !!merge <<: *vibevoice
name: "intel-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-vibevoice"
mirrors:
- localai/localai-backends:master-gpu-intel-vibevoice
- !!merge <<: *vibevoice
name: "rocm-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-vibevoice"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-vibevoice
- !!merge <<: *vibevoice
name: "rocm-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-vibevoice"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-vibevoice
- !!merge <<: *vibevoice
name: "nvidia-l4t-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-vibevoice"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-vibevoice
- !!merge <<: *vibevoice
name: "nvidia-l4t-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-vibevoice"
mirrors:
- localai/localai-backends:master-nvidia-l4t-vibevoice
- !!merge <<: *vibevoice
name: "cuda13-nvidia-l4t-arm64-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-vibevoice
- !!merge <<: *vibevoice
name: "cuda13-nvidia-l4t-arm64-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-vibevoice
- !!merge <<: *vibevoice
name: "metal-vibevoice"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-vibevoice"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-vibevoice
- !!merge <<: *vibevoice
name: "metal-vibevoice-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-vibevoice
## qwen-tts
- !!merge <<: *qwen-tts
name: "qwen-tts-development"
capabilities:
nvidia: "cuda12-qwen-tts-development"
intel: "intel-qwen-tts-development"
amd: "rocm-qwen-tts-development"
nvidia-l4t: "nvidia-l4t-qwen-tts-development"
metal: "metal-qwen-tts-development"
default: "cpu-qwen-tts-development"
nvidia-cuda-13: "cuda13-qwen-tts-development"
nvidia-cuda-12: "cuda12-qwen-tts-development"
nvidia-l4t-cuda-12: "nvidia-l4t-qwen-tts-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen-tts-development"
- !!merge <<: *qwen-tts
name: "cpu-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-qwen-tts"
mirrors:
- localai/localai-backends:latest-cpu-qwen-tts
- !!merge <<: *qwen-tts
name: "cpu-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-qwen-tts"
mirrors:
- localai/localai-backends:master-cpu-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda12-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-qwen-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda12-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-qwen-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda13-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-qwen-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda13-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen-tts
- !!merge <<: *qwen-tts
name: "intel-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-qwen-tts"
mirrors:
- localai/localai-backends:latest-gpu-intel-qwen-tts
- !!merge <<: *qwen-tts
name: "intel-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-qwen-tts"
mirrors:
- localai/localai-backends:master-gpu-intel-qwen-tts
- !!merge <<: *qwen-tts
name: "rocm-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-qwen-tts"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-qwen-tts
- !!merge <<: *qwen-tts
name: "rocm-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-qwen-tts"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-qwen-tts
- !!merge <<: *qwen-tts
name: "nvidia-l4t-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-qwen-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-qwen-tts
- !!merge <<: *qwen-tts
name: "nvidia-l4t-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-qwen-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda13-nvidia-l4t-arm64-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen-tts
- !!merge <<: *qwen-tts
name: "cuda13-nvidia-l4t-arm64-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-qwen-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-qwen-tts
- !!merge <<: *qwen-tts
name: "metal-qwen-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-qwen-tts"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-qwen-tts
- !!merge <<: *qwen-tts
name: "metal-qwen-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-qwen-tts"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-qwen-tts
## fish-speech
- !!merge <<: *fish-speech
name: "fish-speech-development"
capabilities:
nvidia: "cuda12-fish-speech-development"
intel: "intel-fish-speech-development"
amd: "rocm-fish-speech-development"
nvidia-l4t: "nvidia-l4t-fish-speech-development"
metal: "metal-fish-speech-development"
default: "cpu-fish-speech-development"
nvidia-cuda-13: "cuda13-fish-speech-development"
nvidia-cuda-12: "cuda12-fish-speech-development"
nvidia-l4t-cuda-12: "nvidia-l4t-fish-speech-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-fish-speech-development"
- !!merge <<: *fish-speech
name: "cpu-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-fish-speech"
mirrors:
- localai/localai-backends:latest-cpu-fish-speech
- !!merge <<: *fish-speech
name: "cpu-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-fish-speech"
mirrors:
- localai/localai-backends:master-cpu-fish-speech
- !!merge <<: *fish-speech
name: "cuda12-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-fish-speech"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-fish-speech
- !!merge <<: *fish-speech
name: "cuda12-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-fish-speech"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-fish-speech
- !!merge <<: *fish-speech
name: "cuda13-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-fish-speech"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-fish-speech
- !!merge <<: *fish-speech
name: "cuda13-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-fish-speech"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-fish-speech
- !!merge <<: *fish-speech
name: "intel-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-fish-speech"
mirrors:
- localai/localai-backends:latest-gpu-intel-fish-speech
- !!merge <<: *fish-speech
name: "intel-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-fish-speech"
mirrors:
- localai/localai-backends:master-gpu-intel-fish-speech
- !!merge <<: *fish-speech
name: "rocm-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-fish-speech"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-fish-speech
- !!merge <<: *fish-speech
name: "rocm-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-fish-speech"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-fish-speech
- !!merge <<: *fish-speech
name: "nvidia-l4t-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-fish-speech"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-fish-speech
- !!merge <<: *fish-speech
name: "nvidia-l4t-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-fish-speech"
mirrors:
- localai/localai-backends:master-nvidia-l4t-fish-speech
- !!merge <<: *fish-speech
name: "cuda13-nvidia-l4t-arm64-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-fish-speech"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-fish-speech
- !!merge <<: *fish-speech
name: "cuda13-nvidia-l4t-arm64-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-fish-speech"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-fish-speech
- !!merge <<: *fish-speech
name: "metal-fish-speech"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-fish-speech"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-fish-speech
- !!merge <<: *fish-speech
name: "metal-fish-speech-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-fish-speech"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-fish-speech
## faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "faster-qwen3-tts-development"
capabilities:
nvidia: "cuda12-faster-qwen3-tts-development"
default: "cuda12-faster-qwen3-tts-development"
nvidia-cuda-13: "cuda13-faster-qwen3-tts-development"
nvidia-cuda-12: "cuda12-faster-qwen3-tts-development"
nvidia-l4t: "nvidia-l4t-faster-qwen3-tts-development"
nvidia-l4t-cuda-12: "nvidia-l4t-faster-qwen3-tts-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-faster-qwen3-tts-development"
- !!merge <<: *faster-qwen3-tts
name: "cuda12-faster-qwen3-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-faster-qwen3-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "cuda12-faster-qwen3-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-faster-qwen3-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "cuda13-faster-qwen3-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-faster-qwen3-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "cuda13-faster-qwen3-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-faster-qwen3-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "nvidia-l4t-faster-qwen3-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-faster-qwen3-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "nvidia-l4t-faster-qwen3-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-faster-qwen3-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "cuda13-nvidia-l4t-arm64-faster-qwen3-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-faster-qwen3-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-faster-qwen3-tts
- !!merge <<: *faster-qwen3-tts
name: "cuda13-nvidia-l4t-arm64-faster-qwen3-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-faster-qwen3-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-faster-qwen3-tts
## qwen-asr
- !!merge <<: *qwen-asr
name: "qwen-asr-development"
capabilities:
nvidia: "cuda12-qwen-asr-development"
intel: "intel-qwen-asr-development"
amd: "rocm-qwen-asr-development"
nvidia-l4t: "nvidia-l4t-qwen-asr-development"
metal: "metal-qwen-asr-development"
default: "cpu-qwen-asr-development"
nvidia-cuda-13: "cuda13-qwen-asr-development"
nvidia-cuda-12: "cuda12-qwen-asr-development"
nvidia-l4t-cuda-12: "nvidia-l4t-qwen-asr-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-qwen-asr-development"
- !!merge <<: *qwen-asr
name: "cpu-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-qwen-asr"
mirrors:
- localai/localai-backends:latest-cpu-qwen-asr
- !!merge <<: *qwen-asr
name: "cpu-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-qwen-asr"
mirrors:
- localai/localai-backends:master-cpu-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda12-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-qwen-asr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda12-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-qwen-asr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda13-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-qwen-asr"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda13-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-qwen-asr"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-qwen-asr
- !!merge <<: *qwen-asr
name: "intel-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-qwen-asr"
mirrors:
- localai/localai-backends:latest-gpu-intel-qwen-asr
- !!merge <<: *qwen-asr
name: "intel-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-qwen-asr"
mirrors:
- localai/localai-backends:master-gpu-intel-qwen-asr
- !!merge <<: *qwen-asr
name: "rocm-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-qwen-asr"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-qwen-asr
- !!merge <<: *qwen-asr
name: "rocm-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-qwen-asr"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-qwen-asr
- !!merge <<: *qwen-asr
name: "nvidia-l4t-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-qwen-asr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-qwen-asr
- !!merge <<: *qwen-asr
name: "nvidia-l4t-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-qwen-asr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda13-nvidia-l4t-arm64-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen-asr"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-qwen-asr
- !!merge <<: *qwen-asr
name: "cuda13-nvidia-l4t-arm64-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-qwen-asr"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-qwen-asr
- !!merge <<: *qwen-asr
name: "metal-qwen-asr"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-qwen-asr"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-qwen-asr
- !!merge <<: *qwen-asr
name: "metal-qwen-asr-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-qwen-asr"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-qwen-asr
## nemo
- !!merge <<: *nemo
name: "nemo-development"
capabilities:
nvidia: "cuda12-nemo-development"
intel: "intel-nemo-development"
amd: "rocm-nemo-development"
metal: "metal-nemo-development"
default: "cpu-nemo-development"
nvidia-cuda-13: "cuda13-nemo-development"
nvidia-cuda-12: "cuda12-nemo-development"
- !!merge <<: *nemo
name: "cpu-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-nemo"
mirrors:
- localai/localai-backends:latest-cpu-nemo
- !!merge <<: *nemo
name: "cpu-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-nemo"
mirrors:
- localai/localai-backends:master-cpu-nemo
- !!merge <<: *nemo
name: "cuda12-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-nemo"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-nemo
- !!merge <<: *nemo
name: "cuda12-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-nemo"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-nemo
- !!merge <<: *nemo
name: "cuda13-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-nemo"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-nemo
- !!merge <<: *nemo
name: "cuda13-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-nemo"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-nemo
- !!merge <<: *nemo
name: "intel-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-nemo"
mirrors:
- localai/localai-backends:latest-gpu-intel-nemo
- !!merge <<: *nemo
name: "intel-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-nemo"
mirrors:
- localai/localai-backends:master-gpu-intel-nemo
- !!merge <<: *nemo
name: "rocm-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-nemo"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-nemo
- !!merge <<: *nemo
name: "rocm-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-nemo"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-nemo
- !!merge <<: *nemo
name: "metal-nemo"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-nemo"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-nemo
- !!merge <<: *nemo
name: "metal-nemo-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-nemo"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-nemo
## voxcpm
- !!merge <<: *voxcpm
name: "voxcpm-development"
capabilities:
nvidia: "cuda12-voxcpm-development"
intel: "intel-voxcpm-development"
amd: "rocm-voxcpm-development"
metal: "metal-voxcpm-development"
default: "cpu-voxcpm-development"
nvidia-cuda-13: "cuda13-voxcpm-development"
nvidia-cuda-12: "cuda12-voxcpm-development"
- !!merge <<: *voxcpm
name: "cpu-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-voxcpm"
mirrors:
- localai/localai-backends:latest-cpu-voxcpm
- !!merge <<: *voxcpm
name: "cpu-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-voxcpm"
mirrors:
- localai/localai-backends:master-cpu-voxcpm
- !!merge <<: *voxcpm
name: "cuda12-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-voxcpm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-voxcpm
- !!merge <<: *voxcpm
name: "cuda12-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-voxcpm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-voxcpm
- !!merge <<: *voxcpm
name: "cuda13-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-voxcpm"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-voxcpm
- !!merge <<: *voxcpm
name: "cuda13-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-voxcpm"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-voxcpm
- !!merge <<: *voxcpm
name: "intel-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-voxcpm"
mirrors:
- localai/localai-backends:latest-gpu-intel-voxcpm
- !!merge <<: *voxcpm
name: "intel-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-voxcpm"
mirrors:
- localai/localai-backends:master-gpu-intel-voxcpm
- !!merge <<: *voxcpm
name: "rocm-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-voxcpm"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-voxcpm
- !!merge <<: *voxcpm
name: "rocm-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-voxcpm"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-voxcpm
- !!merge <<: *voxcpm
name: "metal-voxcpm"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-voxcpm"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-voxcpm
- !!merge <<: *voxcpm
name: "metal-voxcpm-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-voxcpm"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-voxcpm
## pocket-tts
- !!merge <<: *pocket-tts
name: "pocket-tts-development"
capabilities:
nvidia: "cuda12-pocket-tts-development"
intel: "intel-pocket-tts-development"
amd: "rocm-pocket-tts-development"
nvidia-l4t: "nvidia-l4t-pocket-tts-development"
metal: "metal-pocket-tts-development"
default: "cpu-pocket-tts-development"
nvidia-cuda-13: "cuda13-pocket-tts-development"
nvidia-cuda-12: "cuda12-pocket-tts-development"
nvidia-l4t-cuda-12: "nvidia-l4t-pocket-tts-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-pocket-tts-development"
- !!merge <<: *pocket-tts
name: "cpu-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-pocket-tts"
mirrors:
- localai/localai-backends:latest-cpu-pocket-tts
- !!merge <<: *pocket-tts
name: "cpu-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-pocket-tts"
mirrors:
- localai/localai-backends:master-cpu-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda12-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-pocket-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda12-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-pocket-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda13-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-pocket-tts"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda13-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-pocket-tts"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-pocket-tts
- !!merge <<: *pocket-tts
name: "intel-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-pocket-tts"
mirrors:
- localai/localai-backends:latest-gpu-intel-pocket-tts
- !!merge <<: *pocket-tts
name: "intel-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-pocket-tts"
mirrors:
- localai/localai-backends:master-gpu-intel-pocket-tts
- !!merge <<: *pocket-tts
name: "rocm-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-pocket-tts"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-pocket-tts
- !!merge <<: *pocket-tts
name: "rocm-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-pocket-tts"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-pocket-tts
- !!merge <<: *pocket-tts
name: "nvidia-l4t-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-pocket-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-pocket-tts
- !!merge <<: *pocket-tts
name: "nvidia-l4t-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-pocket-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda13-nvidia-l4t-arm64-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-pocket-tts"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-pocket-tts
- !!merge <<: *pocket-tts
name: "cuda13-nvidia-l4t-arm64-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-pocket-tts"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-pocket-tts
- !!merge <<: *pocket-tts
name: "metal-pocket-tts"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-pocket-tts"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-pocket-tts
- !!merge <<: *pocket-tts
name: "metal-pocket-tts-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-pocket-tts"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-pocket-tts
## voxtral
- !!merge <<: *voxtral
name: "cpu-voxtral"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-voxtral"
mirrors:
- localai/localai-backends:latest-cpu-voxtral
- !!merge <<: *voxtral
name: "cpu-voxtral-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-voxtral"
mirrors:
- localai/localai-backends:master-cpu-voxtral
- !!merge <<: *voxtral
name: "metal-voxtral"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-voxtral"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-voxtral
- !!merge <<: *voxtral
name: "metal-voxtral-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-voxtral"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-voxtral
- &trl
name: "trl"
alias: "trl"
license: apache-2.0
description: |
HuggingFace TRL fine-tuning backend. Supports SFT, DPO, GRPO, RLOO, Reward, KTO, ORPO training methods.
Works on CPU and GPU.
urls:
- https://github.com/huggingface/trl
tags:
- fine-tuning
- LLM
- CPU
- GPU
- CUDA
capabilities:
default: "cpu-trl"
nvidia: "cuda12-trl"
nvidia-cuda-12: "cuda12-trl"
nvidia-cuda-13: "cuda13-trl"
## TRL backend images
- !!merge <<: *trl
name: "cpu-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-trl"
mirrors:
- localai/localai-backends:latest-cpu-trl
- !!merge <<: *trl
name: "cpu-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-trl"
mirrors:
- localai/localai-backends:master-cpu-trl
- !!merge <<: *trl
name: "cuda12-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda12-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda12-trl
- !!merge <<: *trl
name: "cuda12-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda12-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda12-trl
- !!merge <<: *trl
name: "cuda13-trl"
uri: "quay.io/go-skynet/local-ai-backends:latest-cublas-cuda13-trl"
mirrors:
- localai/localai-backends:latest-cublas-cuda13-trl
- !!merge <<: *trl
name: "cuda13-trl-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cublas-cuda13-trl"
mirrors:
- localai/localai-backends:master-cublas-cuda13-trl
## llama.cpp quantization backend
- &llama-cpp-quantization
name: "llama-cpp-quantization"
alias: "llama-cpp-quantization"
license: mit
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
description: |
Model quantization backend using llama.cpp. Downloads HuggingFace models, converts them to GGUF format,
and quantizes them to various formats (q4_k_m, q5_k_m, q8_0, f16, etc.).
urls:
- https://github.com/ggml-org/llama.cpp
tags:
- quantization
- GGUF
- CPU
capabilities:
default: "cpu-llama-cpp-quantization"
metal: "metal-darwin-arm64-llama-cpp-quantization"
- !!merge <<: *llama-cpp-quantization
name: "cpu-llama-cpp-quantization"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-quantization"
mirrors:
- localai/localai-backends:latest-cpu-llama-cpp-quantization
- !!merge <<: *llama-cpp-quantization
name: "metal-darwin-arm64-llama-cpp-quantization"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-llama-cpp-quantization"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-llama-cpp-quantization