2023-07-14 23:19:43 +00:00
syntax = "proto3" ;
option go_package = "github.com/go-skynet/LocalAI/pkg/grpc/proto" ;
option java_multiple_files = true ;
2023-07-14 23:19:43 +00:00
option java_package = "io.skynet.localai.backend" ;
option java_outer_classname = "LocalAIBackend" ;
2023-07-14 23:19:43 +00:00
2023-07-14 23:19:43 +00:00
package backend ;
2023-07-14 23:19:43 +00:00
2023-07-14 23:19:43 +00:00
service Backend {
2023-07-14 23:19:43 +00:00
rpc Health ( HealthMessage ) returns ( Reply ) { }
2026-03-03 11:39:06 +00:00
rpc Free ( HealthMessage ) returns ( Result ) { }
2023-07-14 23:19:43 +00:00
rpc Predict ( PredictOptions ) returns ( Reply ) { }
rpc LoadModel ( ModelOptions ) returns ( Result ) { }
rpc PredictStream ( PredictOptions ) returns ( stream Reply ) { }
2023-07-14 23:19:43 +00:00
rpc Embedding ( PredictOptions ) returns ( EmbeddingResult ) { }
2023-07-14 23:19:43 +00:00
rpc GenerateImage ( GenerateImageRequest ) returns ( Result ) { }
2025-04-26 16:05:01 +00:00
rpc GenerateVideo ( GenerateVideoRequest ) returns ( Result ) { }
2023-07-14 23:19:43 +00:00
rpc AudioTranscription ( TranscriptRequest ) returns ( TranscriptResult ) { }
2026-04-14 14:13:40 +00:00
rpc AudioTranscriptionStream ( TranscriptRequest ) returns ( stream TranscriptStreamResponse ) { }
2023-07-14 23:19:43 +00:00
rpc TTS ( TTSRequest ) returns ( Result ) { }
2026-01-30 10:58:01 +00:00
rpc TTSStream ( TTSRequest ) returns ( stream Reply ) { }
2024-08-24 00:20:28 +00:00
rpc SoundGeneration ( SoundGenerationRequest ) returns ( Result ) { }
2023-08-18 19:23:14 +00:00
rpc TokenizeString ( PredictOptions ) returns ( TokenizationResponse ) { }
rpc Status ( HealthMessage ) returns ( StatusResponse ) { }
2025-07-27 20:02:51 +00:00
rpc Detect ( DetectOptions ) returns ( DetectResponse ) { }
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 19:55:41 +00:00
rpc FaceVerify ( FaceVerifyRequest ) returns ( FaceVerifyResponse ) { }
rpc FaceAnalyze ( FaceAnalyzeRequest ) returns ( FaceAnalyzeResponse ) { }
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).
Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): add 1:N identify + register/forget endpoints
Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).
Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).
As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): gallery, docs, CI and e2e coverage
- backend/index.yaml: speaker-recognition backend entry + CPU and
CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
deliberate placeholder — the HF URI must be curl'd and its hash
filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
Helper resolveFaceFixture is reused as-is — the only thing face/voice
share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
speaker-recognition-{ecapa,all} targets. Audio fixtures default to
VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
build entries — scripts/changed-backends.js auto-picks these up.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): wire a working /v1/voice/analyze head
Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.
Defaults to two open-licence HuggingFace checkpoints:
- audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
age regression + 3-way gender (female / male / child).
- superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.
Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.
Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.
Adds transformers to the backend's CPU and CUDA-12 requirements.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256
Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): soundfile loader + honest analyze default
Two issues surfaced on first end-to-end smoke with the actual backend
image:
1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
for audio decoding. Switch SpeechBrainEngine._load_waveform to the
already-present soundfile (listed in requirements.txt) plus a numpy
linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
codepath we never exercise (torchaudio's ffmpeg backend).
2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
24-ft-age-gender, but AutoModelForAudioClassification silently
mangles that checkpoint — it reports the age head weights as
UNEXPECTED and re-initialises the classifier head with random
values, so the "gender" output is noise and there is no age output
at all. Make age/gender opt-in instead (empty default; users wire
a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
age_gender_model: option). Emotion keeps its working Superb default.
Also broaden _infer_age_gender's tensor-shape handling and catch
runtime exceptions so a dodgy age/gender head never takes down the
whole analyze call.
Docs and README updated to match the new policy.
Verified with the branch-scoped gallery on localhost:
- voice/embed → 192-d ECAPA-TDNN vector
- voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker
dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze → emotion populated, age/gender omitted (opt-in)
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec
Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):
1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
(the dataset is gated). Swap them for the speechbrain test samples
served from github.com/speechbrain/speechbrain/raw/develop/ —
public, no auth, correct 16kHz mono format.
2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
file1/file2 were same-speaker. The speechbrain samples are three
different speakers (example1/2/5), and there is no easy un-gated
source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
are all license- or size-gated for CI use). Replace the ceiling
check with a relative-ordering assertion: d(pair) > d(same-clip)
for both file2 and file3 — that's enough to prove the embeddings
encode speaker info, and it works with any three non-identical
clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
asserted.
Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.
Assisted-by: Claude:claude-opus-4-7
* fix(ci): checkout with submodules in the reusable backend_build workflow
The kokoros Rust backend build fails with
failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file
because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.
The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.
Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.
Assisted-by: Claude:claude-opus-4-7
2026-04-23 10:07:14 +00:00
rpc VoiceVerify ( VoiceVerifyRequest ) returns ( VoiceVerifyResponse ) { }
rpc VoiceAnalyze ( VoiceAnalyzeRequest ) returns ( VoiceAnalyzeResponse ) { }
rpc VoiceEmbed ( VoiceEmbedRequest ) returns ( VoiceEmbedResponse ) { }
2024-03-22 20:14:04 +00:00
rpc StoresSet ( StoresSetOptions ) returns ( Result ) { }
rpc StoresDelete ( StoresDeleteOptions ) returns ( Result ) { }
rpc StoresGet ( StoresGetOptions ) returns ( StoresGetResult ) { }
rpc StoresFind ( StoresFindOptions ) returns ( StoresFindResult ) { }
2024-04-24 22:19:02 +00:00
rpc Rerank ( RerankRequest ) returns ( RerankResult ) { }
2024-10-01 12:41:20 +00:00
rpc GetMetrics ( MetricsRequest ) returns ( MetricsResponse ) ;
2024-11-20 13:48:40 +00:00
rpc VAD ( VADRequest ) returns ( VADResponse ) { }
2026-01-22 23:38:28 +00:00
feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654)
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp
Closes #1648.
OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.
Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.
Backends:
* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
vibevoice's ASR prompt asks the model to emit
[{Start,End,Speaker,Content}] natively, so diarization is a by-product
of the same pass; include_text=true preserves the transcript per
segment, otherwise we drop it.
* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
C API (pyannote segmentation + speaker-embedding extractor + fast
clustering). libsherpa-shim grew config builders, a SetClustering
wrapper for per-call num_clusters/threshold overrides, and a
segment_at accessor (purego can't read field arrays out of
SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).
Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.
Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.
Tests:
* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
per-speaker totals, nil-safety, fallback to backend NumSpeakers when
no segments).
* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
duration clamping, fallback id).
* tests/e2e: mock-backend grew a deterministic Diarize that emits
raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
remapping, verbose_json speakers summary + transcript pass-through
(gated by include_text), RTTM bytes content-type, and rejection of
unknown response_format. mock-diarize model config registered with
known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.
Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(diarization): correct sherpa-onnx symbol name + lint cleanup
CI failures on #9654:
* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
(Destroy in the middle, not the prefix); the rest of the diarization
surface follows the same naming pattern. The mismatched name made
purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
before the BeforeAll could probe Health, taking down every sherpa-onnx
test job — not just the diarization-related ones.
* golangci-lint flagged 5 errcheck violations on new defer cleanups
(os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
closure (matches the pattern other LocalAI files use for new code, since
pre-existing bare defers are grandfathered in via new-from-merge-base).
* golangci-lint also flagged forbidigo violations: the new
diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
which are forbidden by the project's coding-style policy
(.agents/coding-style.md). Convert both files to Ginkgo/Gomega
Describe/It with Expect(...) — they get picked up by the existing
TestBackend / TestOpenAI suites, no new suite plumbing needed.
* modernize linter: tightened the diarization segment loop to
`for i := range int(numSegments)` (Go 1.22+ idiom).
Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget
Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.
Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:
1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
passed the uploaded path straight through, so anything that wasn't
already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
dir, matching the rate the model was trained on; both AudioTranscription
and Diarize now route through it. This is the same shape sherpa-onnx
uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
so we don't reuse that helper.
2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
fine for a five-second clip but not for anything past ~10 s — vibevoice
stops mid-JSON, the parse fails, and the caller sees a hard error.
Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
model's ~30 tok/s rate); generation stops at EOS so this is a cap
rather than a target.
3. As a defensive belt-and-braces, mirror AudioTranscription's existing
"fall back to a single segment if the model emits non-JSON text"
pattern in Diarize, so partial / unusual model output never produces
a 500. This kept the endpoint usable while diagnosing (1) and (2),
and is the right behaviour to keep.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime
Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:
rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
24k mono wav: exec: "ffmpeg": executable file not found in $PATH
Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).
Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:
* Container builds without ffmpeg keep working for WAV uploads
(the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
surfaced error is still descriptive enough to act on.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases
The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt
install in backend/Dockerfile.golang and pointed update-alternatives at
them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has
gcc-14 in main), but every Go backend that builds on
`nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails
at the apt step:
E: Unable to locate package gcc-14
This blocked unrelated jobs:
backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper,
acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only
matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually
need gcc-14 anywhere else.
Make the gcc-14 install conditional on the package being available in
the configured apt repos. On noble: identical behaviour to today (gcc-14
installed, update-alternatives points at it). On jammy: skip the
gcc-14 stanza entirely and let build-essential's default gcc take over,
which is what the other Go backends compile with anyway.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-05 13:10:13 +00:00
rpc Diarize ( DiarizeRequest ) returns ( DiarizeResponse ) { }
2026-03-13 20:37:15 +00:00
rpc AudioEncode ( AudioEncodeRequest ) returns ( AudioEncodeResult ) { }
rpc AudioDecode ( AudioDecodeRequest ) returns ( AudioDecodeResult ) { }
feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.
Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
bidirectional AudioTransformStream for low-latency frame-by-frame use.
This is the first bidi stream in the proto; per-frame unary at LocalVQE's
16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
embed,interface,base} with paired-channel ergonomics.
LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
wrapper needed because LocalVQE handles CPU feature selection internally
via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
LocalVQE runs single-threaded at ~1× realtime instead of the documented
~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).
HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
endpoint, accepts audio + optional reference + params[*]=v form fields.
Persists inputs alongside the output in GeneratedContentDir/audio so the
React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
(interleaved stereo mic+ref in, mono out). JSON session.update envelope
for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
utils.AudioToWav (with passthrough fast-path), so the user can upload any
format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
no /audio/transformations endpoint), traced via TraceMiddleware.
Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
huggingface.co/LocalAI-io/LocalVQE).
React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
section (sibling of Tools / Biometrics) — the page is enhancement, not
generation, so it lives outside Studio. Two AudioInput components
(Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
the speakers — the mic naturally picks up speaker bleed, giving a real
(mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
and useAudioPeaks hook (shared module-scoped AudioContext to avoid
hitting browser context limits with three players on one page); migrated
TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
the history entry stores all three URLs so clicking re-renders the full
triple, not just the output.
Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
two and let GPU-class hardware route through Vulkan in the gallery
capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
(8 specs covering tabs, file upload, multipart shape, history, errors).
Docs:
- New docs/content/features/audio-transform.md covering the (audio,
reference) mental model, batch + WebSocket wire formats, LocalVQE param
keys, and a YAML config example. Cross-links from text-to-audio and
audio-to-text feature pages.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 20:07:11 +00:00
rpc AudioTransform ( AudioTransformRequest ) returns ( AudioTransformResult ) { }
rpc AudioTransformStream ( stream AudioTransformFrameRequest ) returns ( stream AudioTransformFrameResponse ) { }
feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801)
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase
Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.
Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
`liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
on `liquid_audio.utils.snapshot_download` so absolute local paths from
LocalAI's gallery resolve without a HF round-trip. soundfile in place of
torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
(config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
new RPC to the test fakes.
Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
membership in both speech-input and audio-output groups so a lone flag
still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
branch.
Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
so the gate is unit-testable; accept realtime_audio models and self-fill
empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
user-pinned VAD slot, and partial legacy pipeline.
UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
the four-slot grid; rename Edit Pipeline → Edit Model Config; less
pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
fields, surfaced when backend === liquid-audio, plumbed via
extra_options on submit/export/import.
Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
[realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
mode option. Fixed pre-existing `transcribe` typo on the asr entry
(loader silently dropped the unknown string → entry never surfaced as a
transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
`<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
regex pair on the LFM2 pythonic format.
Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
docker-build-liquid-audio.
Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
checklist with every place a new FLAG_* needs to be registered.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): function calling + history cap for any-to-any models
Three pieces, all on the realtime_audio path that just landed:
1. liquid-audio backend (backend/python/liquid-audio/backend.py):
- _build_chat_state grows a `tools_prelude` arg.
- new _render_tools_prelude parses request.Tools (the OpenAI Chat
Completions function array realtime.go already serialises) and
emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
template so the model sees the same prompt shape whether served
via llama-cpp or here. Without this the backend silently dropped
tools — function calling was wired end-to-end on the Go side but
the model never saw a tool list.
2. Realtime history cap (core/http/endpoints/openai/realtime.go):
- Session grows MaxHistoryItems int; default picked by new
defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
1.5B degrades quickly past a handful of turns), 0/unlimited for
legacy pipelines composing larger LLMs.
- triggerResponse runs conv.Items through trimRealtimeItems before
building conversationHistory. Helper walks the cut left if it
would orphan a function_call_output, so tool result + call pairs
stay intact.
- realtime_gate_test.go: specs for defaultMaxHistoryItems and
trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
preservation).
3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
- Reuses the chat page's MCP plumbing — useMCPClient hook,
ClientMCPDropdown component, same auto-connect/disconnect effect
pattern. No bespoke tool registry, no new REST endpoints; tools
come from whichever MCP servers the user toggles on, exactly as
on the chat page.
- sendSessionUpdate now passes session.tools=getToolsForLLM(); the
update re-fires when the active server set changes mid-session.
- New response.function_call_arguments.done handler executes via
the hook's executeTool (which round-trips through the MCP client
SDK), then replies with conversation.item.create
{type:function_call_output} + response.create so the model
completes its turn with the tool output. Mirrors chat's
client-side agentic loop, translated to the realtime wire shape.
UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page
Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.
Wire:
- core/http/endpoints/openai/realtime.go:
- RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
helper mirrors chat.go's requireAssistantAccess (no-op when auth
disabled, else requires auth.RoleAdmin).
- Session grows AssistantExecutor mcpTools.ToolExecutor.
- runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
fail closed if DisableLocalAIAssistant or the holder has no tools,
DiscoverTools and inject into session.Tools, prepend
holder.SystemPrompt() to instructions.
- Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
the function_call_arguments client emit (the client can't execute
these — it doesn't know about them). After the loop, if any
assistant tool ran, trigger another response so the model speaks the
result. Mirrors chat's agentic loop, driven server-side rather than
via client round-trip.
- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
gains `localai_assistant` (JSON omitempty). Handshake calls
isCurrentUserAdmin and builds RealtimeSessionOptions.
- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
checkbox under the Tools dropdown; passes localai_assistant: true to
realtimeApi.call's body, captured in the connect callback's deps.
Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): render Manage Mode tool calls in the Talk transcript
Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.
realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.
Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools
Fallout from a review pass on the Manage Mode patches:
- Bound the server-side agentic loop. triggerResponse used to recurse on
executedAssistantTool with no cap — a model that kept calling tools
would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
over triggerResponseAtTurn(toolTurn int); recursion increments the
counter and stops at the cap with an xlog.Warn.
- Preserve Manage Mode tools across client session.update. The handler
used to blindly overwrite session.Tools, so toggling a client MCP
server mid-session silently wiped the in-process admin tools. Session
now caches the original AssistantTools slice at session creation and
the session.update handler merges them back in (client names win on
collision — the client is explicit).
- strconv.ParseBool for the localai_assistant query param instead of
hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.
- Talk.jsx: render both tool_call and tool_result on
response.output_item.done instead of splitting them across .added and
.done. The server's event pairing (added → done) stays correct; the
UI just doesn't need to inspect both phases of the same item. One
switch case instead of two, no behavioural change.
Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(test-extra): wire liquid-audio backend smoke test
The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.
- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
backend/python/liquid-audio, gated on the per-backend detect flag
The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.
The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-13 19:57:27 +00:00
// AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
// that load a speech-to-speech model consume input audio frames and emit
// interleaved audio + transcript + tool-call deltas as typed events.
// Backends without S2S support return UNIMPLEMENTED.
rpc AudioToAudioStream ( stream AudioToAudioRequest ) returns ( stream AudioToAudioResponse ) { }
feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.
Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
bidirectional AudioTransformStream for low-latency frame-by-frame use.
This is the first bidi stream in the proto; per-frame unary at LocalVQE's
16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
embed,interface,base} with paired-channel ergonomics.
LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
wrapper needed because LocalVQE handles CPU feature selection internally
via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
LocalVQE runs single-threaded at ~1× realtime instead of the documented
~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).
HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
endpoint, accepts audio + optional reference + params[*]=v form fields.
Persists inputs alongside the output in GeneratedContentDir/audio so the
React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
(interleaved stereo mic+ref in, mono out). JSON session.update envelope
for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
utils.AudioToWav (with passthrough fast-path), so the user can upload any
format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
no /audio/transformations endpoint), traced via TraceMiddleware.
Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
huggingface.co/LocalAI-io/LocalVQE).
React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
section (sibling of Tools / Biometrics) — the page is enhancement, not
generation, so it lives outside Studio. Two AudioInput components
(Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
the speakers — the mic naturally picks up speaker bleed, giving a real
(mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
and useAudioPeaks hook (shared module-scoped AudioContext to avoid
hitting browser context limits with three players on one page); migrated
TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
the history entry stores all three URLs so clicking re-renders the full
triple, not just the output.
Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
two and let GPU-class hardware route through Vulkan in the gallery
capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
(8 specs covering tabs, file upload, multipart shape, history, errors).
Docs:
- New docs/content/features/audio-transform.md covering the (audio,
reference) mental model, batch + WebSocket wire formats, LocalVQE param
keys, and a YAML config example. Cross-links from text-to-audio and
audio-to-text feature pages.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 20:07:11 +00:00
2026-01-22 23:38:28 +00:00
rpc ModelMetadata ( ModelOptions ) returns ( ModelMetadataResponse ) { }
2026-03-21 01:08:02 +00:00
// Fine-tuning RPCs
rpc StartFineTune ( FineTuneRequest ) returns ( FineTuneJobResult ) { }
rpc FineTuneProgress ( FineTuneProgressRequest ) returns ( stream FineTuneProgressUpdate ) { }
rpc StopFineTune ( FineTuneStopRequest ) returns ( Result ) { }
rpc ListCheckpoints ( ListCheckpointsRequest ) returns ( ListCheckpointsResponse ) { }
rpc ExportModel ( ExportModelRequest ) returns ( Result ) { }
2026-03-21 23:56:34 +00:00
// Quantization RPCs
rpc StartQuantization ( QuantizationRequest ) returns ( QuantizationJobResult ) { }
rpc QuantizationProgress ( QuantizationProgressRequest ) returns ( stream QuantizationProgressUpdate ) { }
rpc StopQuantization ( QuantizationStopRequest ) returns ( Result ) { }
2026-03-29 22:47:27 +00:00
2024-10-01 12:41:20 +00:00
}
// Define the empty request
message MetricsRequest { }
message MetricsResponse {
int32 slot_id = 1 ;
string prompt_json_for_slot = 2 ; // Stores the prompt as a JSON string.
float tokens_per_second = 3 ;
int32 tokens_generated = 4 ;
int32 prompt_tokens_processed = 5 ;
2024-04-24 22:19:02 +00:00
}
message RerankRequest {
string query = 1 ;
repeated string documents = 2 ;
int32 top_n = 3 ;
}
message RerankResult {
Usage usage = 1 ;
repeated DocumentResult results = 2 ;
}
message Usage {
int32 total_tokens = 1 ;
int32 prompt_tokens = 2 ;
}
message DocumentResult {
int32 index = 1 ;
string text = 2 ;
float relevance_score = 3 ;
2024-03-22 20:14:04 +00:00
}
message StoresKey {
repeated float Floats = 1 ;
}
message StoresValue {
bytes Bytes = 1 ;
}
message StoresSetOptions {
repeated StoresKey Keys = 1 ;
repeated StoresValue Values = 2 ;
}
message StoresDeleteOptions {
repeated StoresKey Keys = 1 ;
}
message StoresGetOptions {
repeated StoresKey Keys = 1 ;
}
message StoresGetResult {
repeated StoresKey Keys = 1 ;
repeated StoresValue Values = 2 ;
}
message StoresFindOptions {
StoresKey Key = 1 ;
int32 TopK = 2 ;
}
message StoresFindResult {
repeated StoresKey Keys = 1 ;
repeated StoresValue Values = 2 ;
repeated float Similarities = 3 ;
2023-07-14 23:19:43 +00:00
}
message HealthMessage { }
// The request message containing the user's name.
message PredictOptions {
string Prompt = 1 ;
int32 Seed = 2 ;
int32 Threads = 3 ;
int32 Tokens = 4 ;
int32 TopK = 5 ;
int32 Repeat = 6 ;
int32 Batch = 7 ;
int32 NKeep = 8 ;
float Temperature = 9 ;
float Penalty = 10 ;
bool F16KV = 11 ;
bool DebugMode = 12 ;
repeated string StopPrompts = 13 ;
bool IgnoreEOS = 14 ;
float TailFreeSamplingZ = 15 ;
float TypicalP = 16 ;
float FrequencyPenalty = 17 ;
float PresencePenalty = 18 ;
int32 Mirostat = 19 ;
float MirostatETA = 20 ;
float MirostatTAU = 21 ;
bool PenalizeNL = 22 ;
string LogitBias = 23 ;
bool MLock = 25 ;
bool MMap = 26 ;
bool PromptCacheAll = 27 ;
bool PromptCacheRO = 28 ;
string Grammar = 29 ;
string MainGPU = 30 ;
string TensorSplit = 31 ;
float TopP = 32 ;
string PromptCachePath = 33 ;
bool Debug = 34 ;
2023-07-14 23:19:43 +00:00
repeated int32 EmbeddingTokens = 35 ;
string Embeddings = 36 ;
2023-07-25 17:05:27 +00:00
float RopeFreqBase = 37 ;
float RopeFreqScale = 38 ;
float NegativePromptScale = 39 ;
string NegativePrompt = 40 ;
2023-09-14 15:44:16 +00:00
int32 NDraft = 41 ;
2023-11-11 12:14:59 +00:00
repeated string Images = 42 ;
2024-04-11 17:20:22 +00:00
bool UseTokenizerTemplate = 43 ;
repeated Message Messages = 44 ;
2024-09-19 09:21:59 +00:00
repeated string Videos = 45 ;
2024-09-19 10:26:53 +00:00
repeated string Audios = 46 ;
2024-09-28 15:23:56 +00:00
string CorrelationId = 47 ;
2025-11-07 20:23:50 +00:00
string Tools = 48 ; // JSON array of available tools/functions for tool calling
string ToolChoice = 49 ; // JSON string or object specifying tool choice behavior
2025-11-16 12:27:36 +00:00
int32 Logprobs = 50 ; // Number of top logprobs to return (maps to OpenAI logprobs parameter)
int32 TopLogprobs = 51 ; // Number of top logprobs to return per token (maps to OpenAI top_logprobs parameter)
2026-03-05 21:50:10 +00:00
map < string , string > Metadata = 52 ; // Generic per-request metadata (e.g., enable_thinking)
2026-03-21 23:57:15 +00:00
float MinP = 53 ; // Minimum probability sampling threshold (0.0 = disabled)
2023-07-14 23:19:43 +00:00
}
2026-03-08 21:21:57 +00:00
// ToolCallDelta represents an incremental tool call update from the C++ parser.
// Used for both streaming (partial diffs) and non-streaming (final tool calls).
message ToolCallDelta {
int32 index = 1 ; // tool call index (0-based)
string id = 2 ; // tool call ID (e.g., "call_abc123")
string name = 3 ; // function name (set on first appearance)
string arguments = 4 ; // arguments chunk (incremental in streaming, full in non-streaming)
}
// ChatDelta represents incremental content/reasoning/tool_call updates parsed by the C++ backend.
message ChatDelta {
string content = 1 ; // content text delta
string reasoning_content = 2 ; // reasoning/thinking text delta
repeated ToolCallDelta tool_calls = 3 ; // tool call deltas
}
2023-07-14 23:19:43 +00:00
// The response message containing the result
message Reply {
2023-07-27 16:41:04 +00:00
bytes message = 1 ;
2024-04-15 17:47:11 +00:00
int32 tokens = 2 ;
int32 prompt_tokens = 3 ;
2025-01-17 16:05:58 +00:00
double timing_prompt_processing = 4 ;
double timing_token_generation = 5 ;
2025-05-25 20:25:05 +00:00
bytes audio = 6 ;
2025-11-16 12:27:36 +00:00
bytes logprobs = 7 ; // JSON-encoded logprobs data matching OpenAI format
2026-03-08 21:21:57 +00:00
repeated ChatDelta chat_deltas = 8 ; // Parsed chat deltas from C++ autoparser (streaming + non-streaming)
2023-07-14 23:19:43 +00:00
}
2025-02-02 12:25:03 +00:00
message GrammarTrigger {
string word = 1 ;
}
2023-07-14 23:19:43 +00:00
message ModelOptions {
string Model = 1 ;
int32 ContextSize = 2 ;
int32 Seed = 3 ;
int32 NBatch = 4 ;
bool F16Memory = 5 ;
bool MLock = 6 ;
bool MMap = 7 ;
bool VocabOnly = 8 ;
bool LowVRAM = 9 ;
bool Embeddings = 10 ;
bool NUMA = 11 ;
int32 NGPULayers = 12 ;
string MainGPU = 13 ;
string TensorSplit = 14 ;
2023-07-14 23:19:43 +00:00
int32 Threads = 15 ;
2023-07-27 19:56:05 +00:00
float RopeFreqBase = 17 ;
float RopeFreqScale = 18 ;
2023-08-02 22:51:08 +00:00
float RMSNormEps = 19 ;
int32 NGQA = 20 ;
2023-08-07 20:39:10 +00:00
string ModelFile = 21 ;
2025-04-19 13:52:29 +00:00
2023-08-09 06:38:51 +00:00
// Diffusers
string PipelineType = 26 ;
string SchedulerType = 27 ;
bool CUDA = 28 ;
2023-08-15 23:11:42 +00:00
float CFGScale = 29 ;
2023-08-17 21:38:59 +00:00
bool IMG2IMG = 30 ;
string CLIPModel = 31 ;
string CLIPSubfolder = 32 ;
int32 CLIPSkip = 33 ;
2023-12-13 18:20:22 +00:00
string ControlNet = 48 ;
2023-08-22 16:48:06 +00:00
string Tokenizer = 34 ;
2023-08-25 19:58:46 +00:00
// LLM (llama.cpp)
string LoraBase = 35 ;
string LoraAdapter = 36 ;
2023-11-11 17:40:48 +00:00
float LoraScale = 42 ;
2023-08-25 19:58:46 +00:00
bool NoMulMatQ = 37 ;
2023-09-14 15:44:16 +00:00
string DraftModel = 39 ;
2024-03-22 20:14:04 +00:00
2023-09-04 17:25:23 +00:00
string AudioPath = 38 ;
2023-09-22 13:52:38 +00:00
// vllm
string Quantization = 40 ;
2024-03-01 21:48:53 +00:00
float GPUMemoryUtilization = 50 ;
bool TrustRemoteCode = 51 ;
bool EnforceEager = 52 ;
int32 SwapSpace = 53 ;
int32 MaxModelLen = 54 ;
2024-04-20 14:37:02 +00:00
int32 TensorParallelSize = 55 ;
2024-10-23 13:46:06 +00:00
string LoadFormat = 58 ;
2025-02-18 18:27:58 +00:00
bool DisableLogStatus = 66 ;
string DType = 67 ;
int32 LimitImagePerPrompt = 68 ;
int32 LimitVideoPerPrompt = 69 ;
int32 LimitAudioPerPrompt = 70 ;
2023-11-11 12:14:59 +00:00
string MMProj = 41 ;
2023-11-11 17:40:48 +00:00
string RopeScaling = 43 ;
float YarnExtFactor = 44 ;
float YarnAttnFactor = 45 ;
float YarnBetaFast = 46 ;
float YarnBetaSlow = 47 ;
2024-01-25 23:13:21 +00:00
string Type = 49 ;
2024-05-13 17:07:51 +00:00
2025-08-31 15:59:09 +00:00
string FlashAttention = 56 ;
2024-05-13 17:07:51 +00:00
bool NoKVOffload = 57 ;
2024-10-31 11:12:22 +00:00
string ModelPath = 59 ;
2024-11-05 14:14:33 +00:00
repeated string LoraAdapters = 60 ;
repeated float LoraScales = 61 ;
2024-12-03 21:41:22 +00:00
repeated string Options = 62 ;
2024-12-06 09:23:59 +00:00
string CacheTypeKey = 63 ;
string CacheTypeValue = 64 ;
2025-02-02 12:25:03 +00:00
repeated GrammarTrigger GrammarTriggers = 65 ;
2025-05-22 19:49:30 +00:00
bool Reranking = 71 ;
2025-06-28 19:26:07 +00:00
repeated string Overrides = 72 ;
feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map
LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.
Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.
Operators can now write:
engine_args:
data_parallel_size: 8
enable_expert_parallel: true
all2all_backend: deepep_low_latency
speculative_config:
method: deepseek_mtp
num_speculative_tokens: 3
kv_cache_dtype: fp8
without further proto/Go/Python plumbing per field.
Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.
Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.
cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.
Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(vllm): bot job to bump cublas13 vLLM wheel pin
vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.
The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* docs(vllm): document engine_args and speculative decoding
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-28 22:49:28 +00:00
// EngineArgs carries a JSON-encoded map of backend-native engine arguments
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
// Unknown keys produce an error at LoadModel time.
string EngineArgs = 73 ;
2023-07-14 23:19:43 +00:00
}
message Result {
string message = 1 ;
bool success = 2 ;
2023-07-14 23:19:43 +00:00
}
message EmbeddingResult {
repeated float embeddings = 1 ;
2023-07-14 23:19:43 +00:00
}
message TranscriptRequest {
string dst = 2 ;
string language = 3 ;
uint32 threads = 4 ;
2024-06-24 17:21:22 +00:00
bool translate = 5 ;
2025-09-10 17:09:28 +00:00
bool diarize = 6 ;
2025-12-18 13:40:45 +00:00
string prompt = 7 ;
2026-04-14 14:13:40 +00:00
float temperature = 8 ;
repeated string timestamp_granularities = 9 ;
bool stream = 10 ;
2023-07-14 23:19:43 +00:00
}
message TranscriptResult {
repeated TranscriptSegment segments = 1 ;
string text = 2 ;
2026-04-14 14:13:40 +00:00
string language = 3 ;
float duration = 4 ;
}
message TranscriptStreamResponse {
string delta = 1 ;
TranscriptResult final_result = 2 ;
2023-07-14 23:19:43 +00:00
}
2026-05-05 22:32:52 +00:00
message TranscriptWord {
int64 start = 1 ;
int64 end = 2 ;
string text = 3 ;
}
2023-07-14 23:19:43 +00:00
message TranscriptSegment {
int32 id = 1 ;
int64 start = 2 ;
int64 end = 3 ;
string text = 4 ;
repeated int32 tokens = 5 ;
feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299)
* feat(proto): add speaker field to TranscriptSegment for diarization
Add speaker field to the gRPC TranscriptSegment message and map it
through the Go schema, enabling backends to return speaker labels.
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): add whisperx backend for transcription with diarization
Add Python gRPC backend using WhisperX for speech-to-text with
word-level timestamps, forced alignment, and speaker diarization
via pyannote-audio when HF_TOKEN is provided.
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): register whisperx backend in Makefile
Signed-off-by: eureka928 <meobius123@gmail.com>
* feat(whisperx): add whisperx meta and image entries to index.yaml
Signed-off-by: eureka928 <meobius123@gmail.com>
* ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): unpin torch versions and use CPU index for cpu requirements
Address review feedback:
- Use --extra-index-url for CPU torch wheels to reduce size
- Remove torch version pins, let uv resolve compatible versions
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): pin torch ROCm variant to fix CI build failure
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): pin torch CPU variant to fix uv resolution failure
Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra
index instead of picking torch==2.8.0+cu128 from PyPI, which pulls
unresolvable CUDA dependencies.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure
uv's default first-match strategy finds torch on PyPI before checking
the extra index, causing it to pick torch==2.8.0+cu128 instead of the
CPU variant. This makes whisperx's transitive torch dependency
unresolvable. Using unsafe-best-match lets uv consider all indexes.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(whisperx): drop +cpu local version suffix to fix uv resolution failure
PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the
issue where uv cannot locate an explicit +cpu local version specifier.
This aligns with the pattern used by all other CPU backends.
Signed-off-by: eureka928 <meobius123@gmail.com>
* fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution
uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4,
+rocm6.3) in pinned requirements. The --extra-index-url already points
to the correct ROCm wheel index and --index-strategy unsafe-best-match
(set in libbackend.sh) ensures the ROCm variant is preferred.
Applies the same fix as 7f5d72e8 (which resolved this for +cpu) across
all 14 hipblas requirements files.
Signed-off-by: eureka928 <meobius123@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: eureka928 <meobius123@gmail.com>
* revert: scope hipblas suffix fix to whisperx only
Reverts changes to non-whisperx hipblas requirements files per
maintainer review — other backends are building fine with the +rocm
local version suffix.
Signed-off-by: eureka928 <meobius123@gmail.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: eureka928 <meobius123@gmail.com>
---------
Signed-off-by: eureka928 <meobius123@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:33:12 +00:00
string speaker = 6 ;
2026-05-05 22:32:52 +00:00
repeated TranscriptWord words = 7 ;
2023-07-14 23:19:43 +00:00
}
message GenerateImageRequest {
int32 height = 1 ;
int32 width = 2 ;
int32 step = 4 ;
int32 seed = 5 ;
string positive_prompt = 6 ;
string negative_prompt = 7 ;
string dst = 8 ;
2023-08-17 21:38:59 +00:00
string src = 9 ;
2023-08-14 21:12:00 +00:00
// Diffusers
2023-08-17 21:38:59 +00:00
string EnableParameters = 10 ;
int32 CLIPSkip = 11 ;
2025-09-10 17:09:28 +00:00
2025-07-30 20:42:34 +00:00
// Reference images for models that support them (e.g., Flux Kontext)
repeated string ref_images = 12 ;
2023-07-14 23:19:43 +00:00
}
2025-04-26 16:05:01 +00:00
message GenerateVideoRequest {
string prompt = 1 ;
2025-08-28 08:26:42 +00:00
string negative_prompt = 2 ; // Negative prompt for video generation
string start_image = 3 ; // Path or base64 encoded image for the start frame
string end_image = 4 ; // Path or base64 encoded image for the end frame
int32 width = 5 ;
int32 height = 6 ;
int32 num_frames = 7 ; // Number of frames to generate
int32 fps = 8 ; // Frames per second
int32 seed = 9 ;
float cfg_scale = 10 ; // Classifier-free guidance scale
int32 step = 11 ; // Number of inference steps
string dst = 12 ; // Output path for the generated video
2025-04-26 16:05:01 +00:00
}
2023-07-14 23:19:43 +00:00
message TTSRequest {
string text = 1 ;
string model = 2 ;
string dst = 3 ;
2024-03-14 22:08:34 +00:00
string voice = 4 ;
2024-06-01 18:26:27 +00:00
optional string language = 5 ;
2023-07-14 23:19:43 +00:00
}
2023-08-18 19:23:14 +00:00
2024-11-20 13:48:40 +00:00
message VADRequest {
repeated float audio = 1 ;
}
message VADSegment {
float start = 1 ;
float end = 2 ;
}
message VADResponse {
repeated VADSegment segments = 1 ;
}
feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654)
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp
Closes #1648.
OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.
Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.
Backends:
* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
vibevoice's ASR prompt asks the model to emit
[{Start,End,Speaker,Content}] natively, so diarization is a by-product
of the same pass; include_text=true preserves the transcript per
segment, otherwise we drop it.
* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
C API (pyannote segmentation + speaker-embedding extractor + fast
clustering). libsherpa-shim grew config builders, a SetClustering
wrapper for per-call num_clusters/threshold overrides, and a
segment_at accessor (purego can't read field arrays out of
SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).
Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.
Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.
Tests:
* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
per-speaker totals, nil-safety, fallback to backend NumSpeakers when
no segments).
* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
duration clamping, fallback id).
* tests/e2e: mock-backend grew a deterministic Diarize that emits
raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
remapping, verbose_json speakers summary + transcript pass-through
(gated by include_text), RTTM bytes content-type, and rejection of
unknown response_format. mock-diarize model config registered with
known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.
Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(diarization): correct sherpa-onnx symbol name + lint cleanup
CI failures on #9654:
* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
(Destroy in the middle, not the prefix); the rest of the diarization
surface follows the same naming pattern. The mismatched name made
purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
before the BeforeAll could probe Health, taking down every sherpa-onnx
test job — not just the diarization-related ones.
* golangci-lint flagged 5 errcheck violations on new defer cleanups
(os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
closure (matches the pattern other LocalAI files use for new code, since
pre-existing bare defers are grandfathered in via new-from-merge-base).
* golangci-lint also flagged forbidigo violations: the new
diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
which are forbidden by the project's coding-style policy
(.agents/coding-style.md). Convert both files to Ginkgo/Gomega
Describe/It with Expect(...) — they get picked up by the existing
TestBackend / TestOpenAI suites, no new suite plumbing needed.
* modernize linter: tightened the diarization segment loop to
`for i := range int(numSegments)` (Go 1.22+ idiom).
Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget
Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.
Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:
1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
passed the uploaded path straight through, so anything that wasn't
already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
dir, matching the rate the model was trained on; both AudioTranscription
and Diarize now route through it. This is the same shape sherpa-onnx
uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
so we don't reuse that helper.
2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
fine for a five-second clip but not for anything past ~10 s — vibevoice
stops mid-JSON, the parse fails, and the caller sees a hard error.
Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
model's ~30 tok/s rate); generation stops at EOS so this is a cap
rather than a target.
3. As a defensive belt-and-braces, mirror AudioTranscription's existing
"fall back to a single segment if the model emits non-JSON text"
pattern in Diarize, so partial / unusual model output never produces
a 500. This kept the endpoint usable while diagnosing (1) and (2),
and is the right behaviour to keep.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime
Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:
rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
24k mono wav: exec: "ffmpeg": executable file not found in $PATH
Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).
Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:
* Container builds without ffmpeg keep working for WAV uploads
(the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
surfaced error is still descriptive enough to act on.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases
The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt
install in backend/Dockerfile.golang and pointed update-alternatives at
them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has
gcc-14 in main), but every Go backend that builds on
`nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails
at the apt step:
E: Unable to locate package gcc-14
This blocked unrelated jobs:
backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper,
acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only
matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually
need gcc-14 anywhere else.
Make the gcc-14 install conditional on the package being available in
the configured apt repos. On noble: identical behaviour to today (gcc-14
installed, update-alternatives points at it). On jammy: skip the
gcc-14 stanza entirely and let build-essential's default gcc take over,
which is what the other Go backends compile with anyway.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-05 13:10:13 +00:00
// --- Speaker diarization messages ---
//
// Pure speaker diarization: "who spoke when". Returns time-stamped segments
// labelled with cluster IDs (the same string for the same speaker across
// segments). Some backends (e.g. vibevoice.cpp) produce diarization as a
// by-product of ASR and may also fill in `text` per segment; backends with a
// dedicated diarization pipeline (e.g. sherpa-onnx pyannote) leave `text`
// empty and emit only the segmentation.
message DiarizeRequest {
string dst = 1 ; // path to audio file (HTTP layer materialises uploads to a temp file)
uint32 threads = 2 ;
string language = 3 ; // optional; only meaningful for transcription-bundling backends
int32 num_speakers = 4 ; // exact speaker count if known (>0 forces); 0 = auto
int32 min_speakers = 5 ; // hint when auto-detecting; 0 = unset
int32 max_speakers = 6 ; // hint when auto-detecting; 0 = unset
float clustering_threshold = 7 ; // distance threshold when num_speakers unknown; 0 = backend default
float min_duration_on = 8 ; // discard segments shorter than this (seconds); 0 = backend default
float min_duration_off = 9 ; // merge gaps shorter than this (seconds); 0 = backend default
bool include_text = 10 ; // when the backend can emit per-segment transcript for free, ask it to populate `text`
}
message DiarizeSegment {
int32 id = 1 ;
float start = 2 ; // seconds
float end = 3 ; // seconds
string speaker = 4 ; // backend-emitted speaker label (e.g. "0", "SPEAKER_00")
string text = 5 ; // optional per-segment transcript (empty unless include_text and supported)
}
message DiarizeResponse {
repeated DiarizeSegment segments = 1 ;
int32 num_speakers = 2 ; // count of distinct speaker labels in `segments`
float duration = 3 ; // total audio duration in seconds (0 if unknown)
string language = 4 ; // optional, when the backend bundles transcription
}
2024-08-24 00:20:28 +00:00
message SoundGenerationRequest {
string text = 1 ;
string model = 2 ;
string dst = 3 ;
optional float duration = 4 ;
optional float temperature = 5 ;
optional bool sample = 6 ;
optional string src = 7 ;
optional int32 src_divisor = 8 ;
2026-02-05 11:04:53 +00:00
optional bool think = 9 ;
optional string caption = 10 ;
optional string lyrics = 11 ;
optional int32 bpm = 12 ;
optional string keyscale = 13 ;
optional string language = 14 ;
optional string timesignature = 15 ;
optional bool instrumental = 17 ;
2024-08-24 00:20:28 +00:00
}
2023-08-18 19:23:14 +00:00
message TokenizationResponse {
int32 length = 1 ;
repeated int32 tokens = 2 ;
}
message MemoryUsageData {
uint64 total = 1 ;
map < string , uint64 > breakdown = 2 ;
}
message StatusResponse {
enum State {
UNINITIALIZED = 0 ;
BUSY = 1 ;
READY = 2 ;
ERROR = - 1 ;
}
State state = 1 ;
MemoryUsageData memory = 2 ;
2024-03-22 20:14:04 +00:00
}
2024-04-11 17:20:22 +00:00
message Message {
string role = 1 ;
string content = 2 ;
2025-11-07 20:23:50 +00:00
// Optional fields for OpenAI-compatible message format
string name = 3 ; // Tool name (for tool messages)
string tool_call_id = 4 ; // Tool call ID (for tool messages)
string reasoning_content = 5 ; // Reasoning content (for thinking models)
string tool_calls = 6 ; // Tool calls as JSON string (for assistant messages with tool calls)
2025-01-17 16:05:58 +00:00
}
2025-07-27 20:02:51 +00:00
message DetectOptions {
string src = 1 ;
2026-04-09 19:49:11 +00:00
string prompt = 2 ; // Text prompt (for SAM 3 PCS mode)
repeated float points = 3 ; // Point coordinates as [x1, y1, label1, x2, y2, label2, ...] (label: 1=pos, 0=neg)
repeated float boxes = 4 ; // Box coordinates as [x1, y1, x2, y2, ...]
float threshold = 5 ; // Detection confidence threshold
2025-07-27 20:02:51 +00:00
}
message Detection {
float x = 1 ;
float y = 2 ;
float width = 3 ;
float height = 4 ;
float confidence = 5 ;
string class_name = 6 ;
2026-04-09 19:49:11 +00:00
bytes mask = 7 ; // PNG-encoded binary segmentation mask
2025-07-27 20:02:51 +00:00
}
message DetectResponse {
repeated Detection Detections = 1 ;
}
2026-01-22 23:38:28 +00:00
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 19:55:41 +00:00
// --- Face recognition messages ---
message FacialArea {
float x = 1 ;
float y = 2 ;
float w = 3 ;
float h = 4 ;
}
message FaceVerifyRequest {
string img1 = 1 ; // base64-encoded image
string img2 = 2 ; // base64-encoded image
float threshold = 3 ; // cosine-distance threshold; 0 = use backend default
feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection
Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.
Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.
FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.
Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud
Assisted-by: Claude:claude-opus-4-7 go vet
* test(insightface): wire test target into test-extra
The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.
Assisted-by: Claude:claude-opus-4-7
* test(insightface): exercise antispoofing in e2e-backends (both paths)
Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:
real fixture -> is_real=true, score>0, verified stays true
spoof fixture -> is_real=false, verified vetoed to false
The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.
Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
* depend on insightface-antispoof-models
* pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
* include face_antispoof in BACKEND_TEST_CAPS
backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.
Assisted-by: Claude:claude-opus-4-7 go vet
2026-04-23 16:28:15 +00:00
bool anti_spoofing = 4 ; // run MiniFASNet liveness on each image; failed liveness forces verified=false
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 19:55:41 +00:00
}
message FaceVerifyResponse {
bool verified = 1 ;
float distance = 2 ; // 1 - cosine_similarity
float threshold = 3 ;
float confidence = 4 ; // 0-100
string model = 5 ; // e.g. "buffalo_l"
FacialArea img1_area = 6 ;
FacialArea img2_area = 7 ;
float processing_time_ms = 8 ;
feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection
Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.
Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.
FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.
Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud
Assisted-by: Claude:claude-opus-4-7 go vet
* test(insightface): wire test target into test-extra
The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.
Assisted-by: Claude:claude-opus-4-7
* test(insightface): exercise antispoofing in e2e-backends (both paths)
Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:
real fixture -> is_real=true, score>0, verified stays true
spoof fixture -> is_real=false, verified vetoed to false
The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.
Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
* depend on insightface-antispoof-models
* pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
* include face_antispoof in BACKEND_TEST_CAPS
backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.
Assisted-by: Claude:claude-opus-4-7 go vet
2026-04-23 16:28:15 +00:00
bool img1_is_real = 9 ; // anti-spoofing result when enabled
float img1_antispoof_score = 10 ;
bool img2_is_real = 11 ;
float img2_antispoof_score = 12 ;
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 19:55:41 +00:00
}
message FaceAnalyzeRequest {
string img = 1 ; // base64-encoded image
repeated string actions = 2 ; // subset of ["age","gender","emotion","race"]; empty = all-supported
bool anti_spoofing = 3 ;
}
message FaceAnalysis {
FacialArea region = 1 ;
float face_confidence = 2 ;
float age = 3 ;
string dominant_gender = 4 ; // "Man" | "Woman"
map < string , float > gender = 5 ;
string dominant_emotion = 6 ; // reserved; empty in MVP
map < string , float > emotion = 7 ;
string dominant_race = 8 ; // not populated
map < string , float > race = 9 ;
bool is_real = 10 ; // anti-spoofing result when enabled
float antispoof_score = 11 ;
}
message FaceAnalyzeResponse {
repeated FaceAnalysis faces = 1 ;
}
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).
Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): add 1:N identify + register/forget endpoints
Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).
Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).
As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): gallery, docs, CI and e2e coverage
- backend/index.yaml: speaker-recognition backend entry + CPU and
CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
deliberate placeholder — the HF URI must be curl'd and its hash
filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
Helper resolveFaceFixture is reused as-is — the only thing face/voice
share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
speaker-recognition-{ecapa,all} targets. Audio fixtures default to
VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
build entries — scripts/changed-backends.js auto-picks these up.
Assisted-by: Claude:claude-opus-4-7
* feat(voice-recognition): wire a working /v1/voice/analyze head
Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.
Defaults to two open-licence HuggingFace checkpoints:
- audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
age regression + 3-way gender (female / male / child).
- superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.
Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.
Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.
Adds transformers to the backend's CPU and CUDA-12 requirements.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256
Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): soundfile loader + honest analyze default
Two issues surfaced on first end-to-end smoke with the actual backend
image:
1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
for audio decoding. Switch SpeechBrainEngine._load_waveform to the
already-present soundfile (listed in requirements.txt) plus a numpy
linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
codepath we never exercise (torchaudio's ffmpeg backend).
2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
24-ft-age-gender, but AutoModelForAudioClassification silently
mangles that checkpoint — it reports the age head weights as
UNEXPECTED and re-initialises the classifier head with random
values, so the "gender" output is noise and there is no age output
at all. Make age/gender opt-in instead (empty default; users wire
a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
age_gender_model: option). Emotion keeps its working Superb default.
Also broaden _infer_age_gender's tensor-shape handling and catch
runtime exceptions so a dodgy age/gender head never takes down the
whole analyze call.
Docs and README updated to match the new policy.
Verified with the branch-scoped gallery on localhost:
- voice/embed → 192-d ECAPA-TDNN vector
- voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker
dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze → emotion populated, age/gender omitted (opt-in)
Assisted-by: Claude:claude-opus-4-7
* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec
Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):
1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
(the dataset is gated). Swap them for the speechbrain test samples
served from github.com/speechbrain/speechbrain/raw/develop/ —
public, no auth, correct 16kHz mono format.
2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
file1/file2 were same-speaker. The speechbrain samples are three
different speakers (example1/2/5), and there is no easy un-gated
source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
are all license- or size-gated for CI use). Replace the ceiling
check with a relative-ordering assertion: d(pair) > d(same-clip)
for both file2 and file3 — that's enough to prove the embeddings
encode speaker info, and it works with any three non-identical
clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
asserted.
Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.
Assisted-by: Claude:claude-opus-4-7
* fix(ci): checkout with submodules in the reusable backend_build workflow
The kokoros Rust backend build fails with
failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file
because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.
The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.
Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.
Assisted-by: Claude:claude-opus-4-7
2026-04-23 10:07:14 +00:00
// --- Voice (speaker) recognition messages ---
//
// Analogous to the Face* messages above, but for speaker biometrics.
// Audio fields accept a filesystem path (same convention as
// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
// data-URI inputs to a temp file before calling the gRPC backend.
message VoiceVerifyRequest {
string audio1 = 1 ; // path to first audio clip
string audio2 = 2 ; // path to second audio clip
float threshold = 3 ; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4 ; // reserved for future AASIST bolt-on
}
message VoiceVerifyResponse {
bool verified = 1 ;
float distance = 2 ; // 1 - cosine_similarity
float threshold = 3 ;
float confidence = 4 ; // 0-100
string model = 5 ; // e.g. "speechbrain/spkrec-ecapa-voxceleb"
float processing_time_ms = 6 ;
}
message VoiceAnalyzeRequest {
string audio = 1 ; // path to audio clip
repeated string actions = 2 ; // subset of ["age","gender","emotion"]; empty = all-supported
}
message VoiceAnalysis {
float start = 1 ; // segment start time in seconds (0 if single-utterance)
float end = 2 ; // segment end time in seconds
float age = 3 ;
string dominant_gender = 4 ;
map < string , float > gender = 5 ;
string dominant_emotion = 6 ;
map < string , float > emotion = 7 ;
}
message VoiceAnalyzeResponse {
repeated VoiceAnalysis segments = 1 ;
}
message VoiceEmbedRequest {
string audio = 1 ; // path to audio clip
}
message VoiceEmbedResponse {
repeated float embedding = 1 ;
string model = 2 ;
}
2026-03-08 21:21:57 +00:00
message ToolFormatMarkers {
string format_type = 1 ; // "json_native", "tag_with_json", "tag_with_tagged"
// Tool section markers
string section_start = 2 ; // e.g., "<tool_call>", "[TOOL_CALLS]"
string section_end = 3 ; // e.g., "</tool_call>"
string per_call_start = 4 ; // e.g., "<|tool_call_begin|>"
string per_call_end = 5 ; // e.g., "<|tool_call_end|>"
// Function name markers (TAG_WITH_JSON / TAG_WITH_TAGGED)
string func_name_prefix = 6 ; // e.g., "<function="
string func_name_suffix = 7 ; // e.g., ">"
string func_close = 8 ; // e.g., "</function>"
// Argument markers (TAG_WITH_TAGGED)
string arg_name_prefix = 9 ; // e.g., "<param="
string arg_name_suffix = 10 ; // e.g., ">"
string arg_value_prefix = 11 ;
string arg_value_suffix = 12 ; // e.g., "</param>"
string arg_separator = 13 ; // e.g., "\n"
// JSON format fields (JSON_NATIVE)
string name_field = 14 ; // e.g., "name"
string args_field = 15 ; // e.g., "arguments"
string id_field = 16 ; // e.g., "id"
bool fun_name_is_key = 17 ;
bool tools_array_wrapped = 18 ;
2026-03-20 07:12:21 +00:00
reserved 19 ;
2026-03-08 21:21:57 +00:00
// Reasoning markers
string reasoning_start = 20 ; // e.g., "<think>"
string reasoning_end = 21 ; // e.g., "</think>"
// Content markers
string content_start = 22 ;
string content_end = 23 ;
// Args wrapper markers
string args_start = 24 ; // e.g., "<args>"
string args_end = 25 ; // e.g., "</args>"
// JSON parameter ordering
string function_field = 26 ; // e.g., "function" (wrapper key in JSON)
repeated string parameter_order = 27 ;
// Generated ID field (alternative field name for generated IDs)
string gen_id_field = 28 ; // e.g., "call_id"
// Call ID markers (position and delimiters for tool call IDs)
string call_id_position = 29 ; // "none", "pre_func_name", "between_func_and_args", "post_args"
string call_id_prefix = 30 ; // e.g., "[CALL_ID]"
string call_id_suffix = 31 ; // e.g., ""
}
2026-03-13 20:37:15 +00:00
message AudioEncodeRequest {
bytes pcm_data = 1 ;
int32 sample_rate = 2 ;
int32 channels = 3 ;
map < string , string > options = 4 ;
}
message AudioEncodeResult {
repeated bytes frames = 1 ;
int32 sample_rate = 2 ;
int32 samples_per_frame = 3 ;
}
message AudioDecodeRequest {
repeated bytes frames = 1 ;
map < string , string > options = 2 ;
}
message AudioDecodeResult {
bytes pcm_data = 1 ;
int32 sample_rate = 2 ;
int32 samples_per_frame = 3 ;
}
feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.
Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
bidirectional AudioTransformStream for low-latency frame-by-frame use.
This is the first bidi stream in the proto; per-frame unary at LocalVQE's
16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
embed,interface,base} with paired-channel ergonomics.
LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
wrapper needed because LocalVQE handles CPU feature selection internally
via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
LocalVQE runs single-threaded at ~1× realtime instead of the documented
~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).
HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
endpoint, accepts audio + optional reference + params[*]=v form fields.
Persists inputs alongside the output in GeneratedContentDir/audio so the
React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
(interleaved stereo mic+ref in, mono out). JSON session.update envelope
for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
utils.AudioToWav (with passthrough fast-path), so the user can upload any
format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
no /audio/transformations endpoint), traced via TraceMiddleware.
Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
huggingface.co/LocalAI-io/LocalVQE).
React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
section (sibling of Tools / Biometrics) — the page is enhancement, not
generation, so it lives outside Studio. Two AudioInput components
(Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
the speakers — the mic naturally picks up speaker bleed, giving a real
(mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
and useAudioPeaks hook (shared module-scoped AudioContext to avoid
hitting browser context limits with three players on one page); migrated
TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
the history entry stores all three URLs so clicking re-renders the full
triple, not just the output.
Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
two and let GPU-class hardware route through Vulkan in the gallery
capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
(8 specs covering tabs, file upload, multipart shape, history, errors).
Docs:
- New docs/content/features/audio-transform.md covering the (audio,
reference) mental model, batch + WebSocket wire formats, LocalVQE param
keys, and a YAML config example. Cross-links from text-to-audio and
audio-to-text feature pages.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 20:07:11 +00:00
// Generic audio transform: an audio-in, audio-out operation, optionally
// conditioned on a second reference signal. Concrete transforms include
// AEC + noise suppression + dereverberation (LocalVQE), voice conversion
// (reference = target speaker), pitch shifting, etc.
message AudioTransformRequest {
string audio_path = 1 ; // required, primary input file path
string reference_path = 2 ; // optional auxiliary; empty => zero-fill
string dst = 3 ; // required, output file path
map < string , string > params = 4 ; // backend-specific tuning
}
message AudioTransformResult {
string dst = 1 ;
int32 sample_rate = 2 ;
int32 samples = 3 ;
bool reference_provided = 4 ;
}
// Bidirectional streaming audio transform. The first message MUST carry a
// Config; subsequent messages carry Frames. A second Config mid-stream
// resets streaming state before the next frame.
message AudioTransformFrameRequest {
oneof payload {
AudioTransformStreamConfig config = 1 ;
AudioTransformFrame frame = 2 ;
}
}
message AudioTransformStreamConfig {
enum SampleFormat {
F32_LE = 0 ;
S16_LE = 1 ;
}
SampleFormat sample_format = 1 ;
int32 sample_rate = 2 ; // 0 => backend default
int32 frame_samples = 3 ; // 0 => backend default
map < string , string > params = 4 ;
bool reset = 5 ; // reset streaming state before next frame
}
message AudioTransformFrame {
bytes audio_pcm = 1 ; // frame_samples samples in stream's format
bytes reference_pcm = 2 ; // empty => zero-fill (silent reference)
}
message AudioTransformFrameResponse {
bytes pcm = 1 ;
int64 frame_index = 2 ;
}
feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801)
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase
Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.
Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
`liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
on `liquid_audio.utils.snapshot_download` so absolute local paths from
LocalAI's gallery resolve without a HF round-trip. soundfile in place of
torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
(config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
new RPC to the test fakes.
Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
membership in both speech-input and audio-output groups so a lone flag
still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
branch.
Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
so the gate is unit-testable; accept realtime_audio models and self-fill
empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
user-pinned VAD slot, and partial legacy pipeline.
UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
the four-slot grid; rename Edit Pipeline → Edit Model Config; less
pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
fields, surfaced when backend === liquid-audio, plumbed via
extra_options on submit/export/import.
Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
[realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
mode option. Fixed pre-existing `transcribe` typo on the asr entry
(loader silently dropped the unknown string → entry never surfaced as a
transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
`<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
regex pair on the LFM2 pythonic format.
Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
docker-build-liquid-audio.
Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
checklist with every place a new FLAG_* needs to be registered.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): function calling + history cap for any-to-any models
Three pieces, all on the realtime_audio path that just landed:
1. liquid-audio backend (backend/python/liquid-audio/backend.py):
- _build_chat_state grows a `tools_prelude` arg.
- new _render_tools_prelude parses request.Tools (the OpenAI Chat
Completions function array realtime.go already serialises) and
emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
template so the model sees the same prompt shape whether served
via llama-cpp or here. Without this the backend silently dropped
tools — function calling was wired end-to-end on the Go side but
the model never saw a tool list.
2. Realtime history cap (core/http/endpoints/openai/realtime.go):
- Session grows MaxHistoryItems int; default picked by new
defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
1.5B degrades quickly past a handful of turns), 0/unlimited for
legacy pipelines composing larger LLMs.
- triggerResponse runs conv.Items through trimRealtimeItems before
building conversationHistory. Helper walks the cut left if it
would orphan a function_call_output, so tool result + call pairs
stay intact.
- realtime_gate_test.go: specs for defaultMaxHistoryItems and
trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
preservation).
3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
- Reuses the chat page's MCP plumbing — useMCPClient hook,
ClientMCPDropdown component, same auto-connect/disconnect effect
pattern. No bespoke tool registry, no new REST endpoints; tools
come from whichever MCP servers the user toggles on, exactly as
on the chat page.
- sendSessionUpdate now passes session.tools=getToolsForLLM(); the
update re-fires when the active server set changes mid-session.
- New response.function_call_arguments.done handler executes via
the hook's executeTool (which round-trips through the MCP client
SDK), then replies with conversation.item.create
{type:function_call_output} + response.create so the model
completes its turn with the tool output. Mirrors chat's
client-side agentic loop, translated to the realtime wire shape.
UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page
Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.
Wire:
- core/http/endpoints/openai/realtime.go:
- RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
helper mirrors chat.go's requireAssistantAccess (no-op when auth
disabled, else requires auth.RoleAdmin).
- Session grows AssistantExecutor mcpTools.ToolExecutor.
- runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
fail closed if DisableLocalAIAssistant or the holder has no tools,
DiscoverTools and inject into session.Tools, prepend
holder.SystemPrompt() to instructions.
- Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
the function_call_arguments client emit (the client can't execute
these — it doesn't know about them). After the loop, if any
assistant tool ran, trigger another response so the model speaks the
result. Mirrors chat's agentic loop, driven server-side rather than
via client round-trip.
- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
gains `localai_assistant` (JSON omitempty). Handshake calls
isCurrentUserAdmin and builds RealtimeSessionOptions.
- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
checkbox under the Tools dropdown; passes localai_assistant: true to
realtimeApi.call's body, captured in the connect callback's deps.
Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): render Manage Mode tool calls in the Talk transcript
Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.
realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.
Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools
Fallout from a review pass on the Manage Mode patches:
- Bound the server-side agentic loop. triggerResponse used to recurse on
executedAssistantTool with no cap — a model that kept calling tools
would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
over triggerResponseAtTurn(toolTurn int); recursion increments the
counter and stops at the cap with an xlog.Warn.
- Preserve Manage Mode tools across client session.update. The handler
used to blindly overwrite session.Tools, so toggling a client MCP
server mid-session silently wiped the in-process admin tools. Session
now caches the original AssistantTools slice at session creation and
the session.update handler merges them back in (client names win on
collision — the client is explicit).
- strconv.ParseBool for the localai_assistant query param instead of
hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.
- Talk.jsx: render both tool_call and tool_result on
response.output_item.done instead of splitting them across .added and
.done. The server's event pairing (added → done) stays correct; the
UI just doesn't need to inspect both phases of the same item. One
switch case instead of two, no behavioural change.
Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(test-extra): wire liquid-audio backend smoke test
The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.
- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
backend/python/liquid-audio, gated on the per-backend detect flag
The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.
The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-13 19:57:27 +00:00
// === AudioToAudioStream messages =========================================
//
// Bidirectional stream between the LocalAI core and an any-to-any audio
// model. The client opens the stream with a Config payload, then alternates
// Frame (input audio) and Control (turn boundaries, function-call results,
// session updates) payloads. The server streams back typed events: audio
// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
// `meta`; the stream ends with a `response.done` (success) or `error` event.
message AudioToAudioRequest {
oneof payload {
AudioToAudioConfig config = 1 ;
AudioToAudioFrame frame = 2 ;
AudioToAudioControl control = 3 ;
}
}
message AudioToAudioConfig {
// PCM format for client→server audio. 0 => backend default
// (16 kHz for the LFM2-Audio Conformer encoder).
int32 input_sample_rate = 1 ;
// Preferred server→client audio rate. 0 => backend default
// (24 kHz for the LFM2-Audio vocoder).
int32 output_sample_rate = 2 ;
// Optional system prompt override. Empty => backend chooses based on
// mode (e.g. "Respond with interleaved text and audio.").
string system_prompt = 3 ;
// Optional baked-voice id. Models that only ship a fixed set of
// voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
// this against their voice table; an empty string keeps the default.
string voice = 4 ;
// JSON-encoded array of tool definitions in OpenAI Chat Completions
// format. Empty => no tools.
string tools = 5 ;
// Free-form sampling / decoding parameters (temperature, top_k,
// max_new_tokens, audio_top_k, etc).
map < string , string > params = 6 ;
// True => reset any session-scoped state before processing further
// frames on this stream. The first Config implicitly resets.
bool reset = 7 ;
}
message AudioToAudioFrame {
// Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
// is a valid "user finished speaking" marker without trailing audio.
bytes pcm = 1 ;
// Marks the last frame of a user turn. The backend may begin emitting
// a response immediately after seeing this.
bool end_of_input = 2 ;
}
message AudioToAudioControl {
// Free-form control event names. Initial set:
// "input_audio_buffer.commit" — user finished speaking
// "response.cancel" — abort in-flight generation
// "conversation.item.create" — inject a non-audio item (e.g.
// function_call_output as JSON in
// `payload`)
// "session.update" — re-configure mid-stream
string event = 1 ;
// Event-specific JSON payload.
bytes payload = 2 ;
}
message AudioToAudioResponse {
// Event identifies what this frame carries. Mirrors the OpenAI Realtime
// API server-event names where applicable. Initial set:
// "response.audio.delta"
// "response.audio_transcript.delta"
// "response.function_call_arguments.delta"
// "response.function_call_arguments.done"
// "response.done"
// "error"
string event = 1 ;
// Populated when event = response.audio.delta.
bytes pcm = 2 ;
// Populated alongside pcm to identify its rate. 0 => same as the
// session's negotiated output_sample_rate.
int32 sample_rate = 3 ;
// JSON payload for non-PCM events (transcript chunk, tool args, error
// body).
bytes meta = 4 ;
// Monotonic per-stream counter, useful for client reordering and
// debugging.
int64 sequence = 5 ;
}
2026-01-22 23:38:28 +00:00
message ModelMetadataResponse {
bool supports_thinking = 1 ;
string rendered_template = 2 ; // The rendered chat template with enable_thinking=true (empty if not applicable)
2026-03-08 21:21:57 +00:00
ToolFormatMarkers tool_format = 3 ; // Auto-detected tool format markers from differential template analysis
2026-04-18 18:30:13 +00:00
string media_marker = 4 ; // Marker the backend expects in the prompt for each multimodal input (images/audio/video). Empty when the backend does not use a marker.
2026-01-22 23:38:28 +00:00
}
2026-03-21 01:08:02 +00:00
// Fine-tuning messages
message FineTuneRequest {
// Model identification
string model = 1 ; // HF model name or local path
string training_type = 2 ; // "lora", "loha", "lokr", "full" — what parameters to train
string training_method = 3 ; // "sft", "dpo", "grpo", "rloo", "reward", "kto", "orpo", "network_training"
// Adapter config (universal across LoRA/LoHa/LoKr for LLM + diffusion)
int32 adapter_rank = 10 ; // LoRA rank (r), default 16
int32 adapter_alpha = 11 ; // scaling factor, default 16
float adapter_dropout = 12 ; // default 0.0
repeated string target_modules = 13 ; // layer names to adapt
// Universal training hyperparameters
float learning_rate = 20 ; // default 2e-4
int32 num_epochs = 21 ; // default 3
int32 batch_size = 22 ; // default 2
int32 gradient_accumulation_steps = 23 ; // default 4
int32 warmup_steps = 24 ; // default 5
int32 max_steps = 25 ; // 0 = use epochs
int32 save_steps = 26 ; // 0 = only save final
float weight_decay = 27 ; // default 0.01
bool gradient_checkpointing = 28 ;
string optimizer = 29 ; // adamw_8bit, adamw, sgd, adafactor, prodigy
int32 seed = 30 ; // default 3407
string mixed_precision = 31 ; // fp16, bf16, fp8, no
// Dataset
string dataset_source = 40 ; // HF dataset ID, local file/dir path
string dataset_split = 41 ; // train, test, etc.
// Output
string output_dir = 50 ;
string job_id = 51 ; // client-assigned or auto-generated
// Resume training from a checkpoint
string resume_from_checkpoint = 55 ; // path to checkpoint dir to resume from
// Backend-specific AND method-specific extensibility
map < string , string > extra_options = 60 ;
}
message FineTuneJobResult {
string job_id = 1 ;
bool success = 2 ;
string message = 3 ;
}
message FineTuneProgressRequest {
string job_id = 1 ;
}
message FineTuneProgressUpdate {
string job_id = 1 ;
int32 current_step = 2 ;
int32 total_steps = 3 ;
float current_epoch = 4 ;
float total_epochs = 5 ;
float loss = 6 ;
float learning_rate = 7 ;
float grad_norm = 8 ;
float eval_loss = 9 ;
float eta_seconds = 10 ;
float progress_percent = 11 ;
string status = 12 ; // queued, caching, loading_model, loading_dataset, training, saving, completed, failed, stopped
string message = 13 ;
string checkpoint_path = 14 ; // set when a checkpoint is saved
string sample_path = 15 ; // set when a sample is generated (video/image backends)
map < string , float > extra_metrics = 16 ; // method-specific metrics
}
message FineTuneStopRequest {
string job_id = 1 ;
bool save_checkpoint = 2 ;
}
message ListCheckpointsRequest {
string output_dir = 1 ;
}
message ListCheckpointsResponse {
repeated CheckpointInfo checkpoints = 1 ;
}
message CheckpointInfo {
string path = 1 ;
int32 step = 2 ;
float epoch = 3 ;
float loss = 4 ;
string created_at = 5 ;
}
message ExportModelRequest {
string checkpoint_path = 1 ;
string output_path = 2 ;
string export_format = 3 ; // lora, loha, lokr, merged_16bit, merged_4bit, gguf, diffusers
string quantization_method = 4 ; // for GGUF: q4_k_m, q5_k_m, q8_0, f16, etc.
string model = 5 ; // base model name (for merge operations)
map < string , string > extra_options = 6 ;
}
2026-03-21 23:56:34 +00:00
// Quantization messages
message QuantizationRequest {
string model = 1 ; // HF model name or local path
string quantization_type = 2 ; // q4_k_m, q5_k_m, q8_0, f16, etc.
string output_dir = 3 ; // where to write output files
string job_id = 4 ; // client-assigned job ID
map < string , string > extra_options = 5 ; // hf_token, custom flags, etc.
}
message QuantizationJobResult {
string job_id = 1 ;
bool success = 2 ;
string message = 3 ;
}
message QuantizationProgressRequest {
string job_id = 1 ;
}
message QuantizationProgressUpdate {
string job_id = 1 ;
float progress_percent = 2 ;
string status = 3 ; // queued, downloading, converting, quantizing, completed, failed, stopped
string message = 4 ;
string output_file = 5 ; // set when completed — path to the output GGUF file
map < string , float > extra_metrics = 6 ; // e.g. file_size_mb, compression_ratio
}
message QuantizationStopRequest {
string job_id = 1 ;
}
2026-03-29 22:47:27 +00:00