LocalAI

mirror of https://github.com/mudler/LocalAI synced 2026-05-24 09:28:23 +00:00

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

Find a file

LocalAI [bot] bc4cd3dd85 Some checks are pending build backend container images / generate-matrix (push) Waiting to run Details build backend container images / backend-jobs-multiarch (push) Blocked by required conditions Details build backend container images / backend-jobs-singlearch (push) Blocked by required conditions Details build backend container images / backend-merge-jobs-multiarch (push) Blocked by required conditions Details build backend container images / backend-merge-jobs-singlearch (push) Blocked by required conditions Details build backend container images / backend-jobs-darwin (push) Blocked by required conditions Details Build test / build-test (push) Waiting to run Details Build test / launcher-build-darwin (push) Waiting to run Details Build test / launcher-build-linux (push) Waiting to run Details Explorer deployment / build-linux (push) Waiting to run Details GPU tests / ubuntu-latest (1.21.x) (push) Waiting to run Details generate and publish intel docker caches / generate_caches (intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04, linux/amd64, arc-runner-set) (push) Waiting to run Details Deploy docs to GitHub Pages / build (push) Waiting to run Details Deploy docs to GitHub Pages / deploy (push) Blocked by required conditions Details build container images / hipblas-jobs (rocm/dev-ubuntu-24.04:7.2.1, hipblas, --jobs=3 --output-sync=target, linux/amd64, ubuntu-latest, auto, -gpu-hipblas, noble, 2404) (push) Waiting to run Details build container images / core-image-build (intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04, intel, --jobs=3 --output-sync=target, linux/amd64, ubuntu-latest, auto, -gpu-intel, noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:22.04, cublas, 13, 0, --jobs=4 --output-sync=target, linux/amd64, ubuntu-latest, false, auto, -gpu-nvidia-cuda-13, noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:24.04, , --jobs=4 --output-sync=target, amd64, linux/amd64, ubuntu-latest, false, auto, , noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:24.04, , --jobs=4 --output-sync=target, arm64, linux/arm64, ubuntu-24.04-arm, false, auto, , noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:24.04, cublas, 12, 8, --jobs=4 --output-sync=target, linux/amd64, ubuntu-latest, false, auto, -gpu-nvidia-cuda-12, noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:24.04, vulkan, --jobs=4 --output-sync=target, amd64, linux/amd64, ubuntu-latest, false, auto, -gpu-vulkan, noble, 2404) (push) Waiting to run Details build container images / core-image-build (ubuntu:24.04, vulkan, --jobs=4 --output-sync=target, arm64, linux/arm64, ubuntu-24.04-arm, false, auto, -gpu-vulkan, noble, 2404) (push) Waiting to run Details build container images / core-image-merge (push) Blocked by required conditions Details build container images / gpu-vulkan-image-merge (push) Blocked by required conditions Details build container images / gh-runner (nvcr.io/nvidia/l4t-jetpack:r36.4.0, cublas, 12, 0, --jobs=4 --output-sync=target, linux/arm64, ubuntu-24.04-arm, true, auto, -nvidia-l4t-arm64, jammy, 2204) (push) Waiting to run Details build container images / gh-runner (ubuntu:24.04, cublas, 13, 0, --jobs=4 --output-sync=target, linux/arm64, ubuntu-24.04-arm, false, auto, -nvidia-l4t-arm64-cuda-13, noble, 2404) (push) Waiting to run Details lint / golangci-lint (push) Waiting to run Details Security Scan / tests (push) Waiting to run Details Tests extras backends / tests-qwen-tts (push) Blocked by required conditions Details Tests extras backends / tests-qwen-asr (push) Blocked by required conditions Details Tests extras backends / tests-nemo (push) Blocked by required conditions Details Tests extras backends / tests-voxcpm (push) Blocked by required conditions Details Tests extras backends / tests-ik-llama-cpp-grpc (push) Blocked by required conditions Details Tests extras backends / tests-turboquant-grpc (push) Blocked by required conditions Details Tests extras backends / tests-diffusers (push) Blocked by required conditions Details Tests extras backends / tests-coqui (push) Blocked by required conditions Details Tests extras backends / tests-moonshine (push) Blocked by required conditions Details Tests extras backends / tests-pocket-tts (push) Blocked by required conditions Details Tests extras backends / tests-whisper-grpc-transcription (push) Blocked by required conditions Details Tests extras backends / detect-changes (push) Waiting to run Details Tests extras backends / tests-transformers (push) Blocked by required conditions Details Tests extras backends / tests-rerankers (push) Blocked by required conditions Details Tests extras backends / tests-sherpa-onnx-grpc-tts (push) Blocked by required conditions Details Tests extras backends / tests-llama-cpp-quantization (push) Blocked by required conditions Details Tests extras backends / tests-llama-cpp-grpc (push) Blocked by required conditions Details Tests extras backends / tests-llama-cpp-grpc-transcription (push) Blocked by required conditions Details Tests extras backends / tests-llama-cpp-smoke (push) Waiting to run Details Tests extras backends / tests-sherpa-onnx-realtime (push) Blocked by required conditions Details Tests extras backends / tests-sherpa-onnx-grpc-transcription (push) Blocked by required conditions Details Tests extras backends / tests-acestep-cpp (push) Blocked by required conditions Details Tests extras backends / tests-qwen3-tts-cpp (push) Blocked by required conditions Details Tests extras backends / tests-vibevoice-cpp (push) Blocked by required conditions Details Tests extras backends / tests-vibevoice-cpp-grpc-tts (push) Blocked by required conditions Details Tests extras backends / tests-vibevoice-cpp-grpc-transcription (push) Blocked by required conditions Details Tests extras backends / tests-localvqe-grpc-transform (push) Blocked by required conditions Details Tests extras backends / tests-voxtral (push) Blocked by required conditions Details E2E Backend Tests / tests-e2e-backend (1.25.x) (push) Waiting to run Details Tests extras backends / tests-kokoros (push) Blocked by required conditions Details Tests extras backends / tests-insightface-grpc (push) Blocked by required conditions Details Tests extras backends / tests-speaker-recognition-grpc (push) Blocked by required conditions Details tests / tests-linux (1.26.x) (push) Waiting to run Details tests / tests-apple (1.26.x) (push) Waiting to run Details tests-aio / tests-aio (push) Waiting to run Details UI E2E Tests / tests-ui-e2e (1.26.x) (push) Waiting to run Details feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 ) * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector<common_speculative_type>`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated <tensor regex>=<buffer type>; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.` (none of these sub-structs exist in the fork) `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_, cpuparams, tensor_buft_overrides}) is preprocessed out. The macro is not defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' \| wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: * [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>		2026-05-12 17:22:37 +02:00
.agents	ci(bump-deps): register ds4 + move version pin into the Makefile (#9761 )	2026-05-11 22:46:02 +02:00
.devcontainer	fix: Add named volumes for Windows Docker compatibility (#8661 )	2026-02-26 23:18:53 +01:00
.devcontainer-scripts	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
.docker	ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738 )	2026-05-10 00:03:52 +02:00
.github	ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781 )	2026-05-12 17:22:09 +02:00
.vscode	feat: refactor build process, drop embedded backends (#5875 )	2025-07-22 16:31:04 +02:00
backend	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 )	2026-05-12 17:22:37 +02:00
cmd	feat: Merge repeated log lines in the terminal (#9141 )	2026-03-26 22:16:13 +01:00
configuration	refactor: move remaining api packages to core (#1731 )	2024-03-01 16:19:53 +01:00
core	fix(ollama): accept `prompt` alias on /api/embed for Ollama parity (#9780 )	2026-05-12 17:21:20 +02:00
custom-ca-certs	feat(certificates): add support for custom CA certificates (#880 )	2023-11-01 20:10:14 +01:00
docs	feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765 )	2026-05-12 17:22:37 +02:00
examples	docs: make examples repository link more prominent (#8895 )	2026-03-09 09:26:16 +01:00
gallery	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 )	2026-05-11 22:15:47 +02:00
internal	feat: cleanups, small enhancements	2023-07-04 18:58:19 +02:00
pkg	fix: parse vulkan VRAM from text (#9669 )	2026-05-12 09:53:48 +02:00
prompt-templates	Requested Changes from GPT4ALL to Luna-AI-Llama2 (#1092 )	2023-09-22 11:22:17 +02:00
scripts	ci(bump-deps): register ds4 + move version pin into the Makefile (#9761 )	2026-05-11 22:46:02 +02:00
swagger	feat(swagger): update swagger (#9723 )	2026-05-08 23:44:55 +02:00
tests	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 )	2026-05-11 22:15:47 +02:00
.air.toml	feat(ui): chat stats, small visual enhancements (#7223 )	2025-11-10 18:12:07 +01:00
.dockerignore	feat(whisper-cpp): Convert to Purego and add VAD (#6087 )	2025-08-28 17:25:18 +02:00
.editorconfig	feat(stores): Vector store backend (#1795 )	2024-03-22 21:14:04 +01:00
.env	feat(diffusers): add experimental support for sd_embed-style prompt embedding (#8504 )	2026-02-11 22:58:19 +01:00
.gitattributes	chore(linguist): add *.hpp files to linguist-vendored (#4154 )	2024-11-14 14:12:16 +01:00
.gitignore	fix(ui): Add tracing inline settings back and create UI tests (#9027 )	2026-03-16 17:51:06 +01:00
.gitmodules	feat: Add Kokoros backend (#9212 )	2026-04-08 19:23:16 +02:00
.golangci.yml	chore: add golangci-lint with new-from-merge-base baseline (#9603 )	2026-04-28 22:07:44 +02:00
.goreleaser.yaml	feat(ui): move to React for frontend (#8772 )	2026-03-05 21:47:12 +01:00
.yamllint	fix: yamlint warnings and errors (#2131 )	2024-04-25 17:25:56 +00:00
AGENTS.md	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 )	2026-05-11 22:15:47 +02:00
CLAUDE.md	fix(realtime): Add functions to conversation history (#8616 )	2026-02-21 19:03:49 +01:00
CONTRIBUTING.md	docs(agents): adopt kernel's AI coding assistants policy	2026-04-19 22:50:54 +00:00
docker-compose.distributed.yaml	fix(distributed): worker container healthcheck always unhealthy	2026-04-27 13:51:57 +00:00
docker-compose.yaml	fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545 )	2026-04-24 22:02:23 +02:00
Dockerfile	chore(deps): bump node from 25-slim to 26-slim (#9769 )	2026-05-12 09:19:51 +02:00
Entitlements.plist	Feat: OSX Local Codesigning (#1319 )	2023-11-23 15:22:54 +01:00
entrypoint.sh	feat: ⚠️ reduce images size and stop bundling sources (#5721 )	2025-06-26 18:41:38 +02:00
go.mod	chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 (#9773 )	2026-05-12 09:51:39 +02:00
go.sum	chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 (#9773 )	2026-05-12 09:51:39 +02:00
LICENSE	chore(docs): update license year	2025-02-15 18:17:15 +01:00
Makefile	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 )	2026-05-11 22:15:47 +02:00
README.md	docs: credit the LocalAI maintainers team	2026-05-02 23:37:04 +00:00
renovate.json	ci: manually update deps	2023-05-04 15:01:29 +02:00
SECURITY.md	docs: clarify SECURITY.md version support table with specific ranges and EOL dates (#8861 )	2026-03-08 17:58:19 +01:00
webui_static.yaml	feat(ui): move to React for frontend (#8772 )	2026-03-05 21:47:12 +01:00

README.md

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

Drop-in API compatibility — OpenAI, Anthropic, ElevenLabs APIs
36+ backends — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
Any hardware — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
Multi-user ready — API key auth, user quotas, role-based access
Built-in AI agents — autonomous agents with tool use, RAG, MCP, and skills
Privacy-first — your data never leaves your infrastructure

Created by Ettore Di Giacinto and maintained by the LocalAI team.

📖 Documentation | 💬 Discord | 💻 Quickstart | 🖼️ Models | ❓FAQ

Guided tour

https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18

Click to see more!

Quickstart

macOS

Note: The DMG is not signed by Apple. After installing, run: sudo xattr -d com.apple.quarantine /Applications/LocalAI.app. See #6268 for details.

Containers (Docker, podman, ...)

Already ran LocalAI before? Use docker start -i local-ai to restart an existing container.

CPU only:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

NVIDIA GPU:

# CUDA 13
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13

# CUDA 12
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64

# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU (oneAPI):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel

Vulkan GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

Loading models

# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# From Huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# From the Ollama OCI registry
local-ai run ollama://gemma:2b
# From a YAML config
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# From a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest

Automatic Backend Detection: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see GPU Acceleration.

For more details, see the Getting Started guide.

Latest News

April 2026: Voice recognition, Face recognition, identification & liveness detection, Ollama API compatibility, Video generation in stable-diffusion.ggml, Backend versioning with auto-upgrade, Pin models & load-on-demand toggle, Universal model importer, new backends: sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad multimodal
March 2026: Agent management, New React UI, WebRTC, MLX-distributed via P2P and RDMA, MCP Apps, MCP Client-side
February 2026: Realtime API for audio-to-audio with tool calling, ACE-Step 1.5 support
January 2026: LocalAI 3.10.0 — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. Release notes
December 2025: Dynamic Memory Resource reclaimer, Automatic multi-GPU model fitting (llama.cpp), Vibevoice backend
November 2025: Import models via URL, Multiple chats and history
October 2025: Model Context Protocol (MCP) support for agentic capabilities
September 2025: New Launcher for macOS and Linux, extended backend support for Mac and Nvidia L4T, MLX-Audio, WAN 2.2
August 2025: MLX, MLX-VLM, Diffusers, llama.cpp now supported on Apple Silicon
July 2025: All backends migrated outside the main binary — lightweight, modular architecture

For older news and full release notes, see GitHub Releases and the News page.

Features

Text generation (llama.cpp, transformers, vllm ... and more)
Text to Audio
Audio to Text
Image generation
OpenAI-compatible tools API
Realtime API (Speech-to-speech)
Embeddings generation
Constrained grammars
Download models from Huggingface
Vision API
Object Detection
Reranker API
P2P Inferencing
Distributed Mode — Horizontal scaling with PostgreSQL + NATS
Model Context Protocol (MCP)
Built-in Agents — Autonomous AI agents with tool use, RAG, skills, SSE streaming, and Agent Hub
Backend Gallery — Install/remove backends on the fly via OCI images
Voice Activity Detection (Silero-VAD)
Integrated WebUI

Supported Backends & Acceleration

LocalAI supports 36+ backends including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for NVIDIA (CUDA 12/13), AMD (ROCm), Intel (oneAPI/SYCL), Apple Silicon (Metal), Vulkan, and NVIDIA Jetson (L4T). All backends can be installed on-the-fly from the Backend Gallery.

See the full Backend & Model Compatibility Table and GPU Acceleration guide.

Resources

Team

LocalAI is maintained by a small team of humans, together with the wider community of contributors.

Ettore Di Giacinto — original author and project lead
Richard Palethorpe — maintainer

A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in Discord — LocalAI is a community-driven project and wouldn't exist without you. See the full contributors list.

Citation

If you utilize this repository, data in a downstream project, please consider citing it with:

@misc{localai,
  author = {Ettore Di Giacinto},
  title = {LocalAI: The free, Open source OpenAI alternative},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},

Star history

License

LocalAI is a community-driven project created by Ettore Di Giacinto and maintained by the LocalAI team.

MIT - Author Ettore Di Giacinto mudler@localai.io

Acknowledgements

LocalAI couldn't have been built without the help of great software already available from the community. Thank you!

llama.cpp
https://github.com/tatsu-lab/stanford_alpaca
https://github.com/cornelk/llama-go for the initial ideas
https://github.com/antimatter15/alpaca.cpp
https://github.com/EdVince/Stable-Diffusion-NCNN
https://github.com/ggerganov/whisper.cpp
https://github.com/rhasspy/piper
exo for the MLX distributed auto-parallel sharding implementation

Contributors

This is a community project, a special thanks to our contributors!