LocalAI/docs/content/features/audio-diarization.md

+++
disableToc = false
title = "Speaker Diarization"
weight = 17
url = "/features/audio-diarization/"
+++

Speaker diarization answers the question **"who spoke when?"** — given an audio clip with multiple speakers, it returns time-stamped segments labelled with a stable speaker ID (`SPEAKER_00`, `SPEAKER_01`, …).

LocalAI exposes this through the `/v1/audio/diarization` endpoint, modelled after `/v1/audio/transcriptions`. Two backends are supported today:

- **[sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)** — pyannote-3.0 segmentation + a speaker-embedding extractor (3D-Speaker, NeMo, WeSpeaker) + fast clustering. Pure diarization — no transcription cost. Recommended when you only need speaker turns.
- **[vibevoice.cpp](https://github.com/microsoft/VibeVoice)** — produces speaker-labelled segments as a by-product of its long-form ASR pass, so you can optionally get a transcript per segment for free.

Because diarization is exposed as a regular OpenAI-compatible endpoint, any HTTP client works. There is no Python dependency on pyannote or NeMo on the consumer side.

## Endpoint

```
POST /v1/audio/diarization
Content-Type: multipart/form-data
```

| Field | Type | Description |
|-------|------|-------------|
| `file` | file (required) | audio file in any format `ffmpeg` accepts |
| `model` | string (required) | name of the diarization-capable model |
| `num_speakers` | int | exact speaker count when known (>0 forces; 0 = auto) |
| `min_speakers` | int | hint when auto-detecting |
| `max_speakers` | int | hint when auto-detecting |
| `clustering_threshold` | float | cosine distance threshold used when `num_speakers` is unknown |
| `min_duration_on` | float | discard segments shorter than this many seconds |
| `min_duration_off` | float | merge gaps shorter than this many seconds |
| `language` | string | only meaningful for backends that bundle ASR (e.g. vibevoice) |
| `include_text` | bool | when the backend can emit per-segment transcript for free, populate it |
| `response_format` | string | `json` (default), `verbose_json`, or `rttm` |

### Response — `json` (default)

Compact payload, no transcription, no per-speaker summary:

```json
{
  "task": "diarize",
  "duration": 12.34,
  "num_speakers": 2,
  "segments": [
    {"id": 0, "speaker": "SPEAKER_00", "label": "0", "start": 0.00, "end": 2.34},
    {"id": 1, "speaker": "SPEAKER_01", "label": "1", "start": 2.34, "end": 4.10}
  ]
}
```

`speaker` is the normalized, zero-padded label clients should display. `label` preserves the raw backend-emitted ID for clients that maintain their own speaker dictionary.

### Response — `verbose_json`

Adds per-speaker totals and (when the backend supports it and `include_text=true`) the per-segment transcript:

```json
{
  "task": "diarize",
  "duration": 12.34,
  "language": "en",
  "num_speakers": 2,
  "segments": [
    {"id": 0, "speaker": "SPEAKER_00", "label": "0", "start": 0.00, "end": 2.34, "text": "Hello, world."},
    {"id": 1, "speaker": "SPEAKER_01", "label": "1", "start": 2.34, "end": 4.10, "text": "How are you?"}
  ],
  "speakers": [
    {"id": "SPEAKER_00", "label": "0", "total_speech_duration": 5.6, "segment_count": 3},
    {"id": "SPEAKER_01", "label": "1", "total_speech_duration": 1.76, "segment_count": 1}
  ]
}
```

### Response — `rttm`

NIST RTTM, the standard interchange format used by `pyannote.metrics` / `dscore`:

```
SPEAKER audio 1 0.000 2.340 <NA> <NA> SPEAKER_00 <NA> <NA>
SPEAKER audio 1 2.340 1.760 <NA> <NA> SPEAKER_01 <NA> <NA>
```

Returned as `Content-Type: text/plain; charset=utf-8`.

## Quick start

```bash
curl http://localhost:8080/v1/audio/diarization \
  -H "Content-Type: multipart/form-data" \
  -F file="@meeting.wav" \
  -F model="pyannote-diarization" \
  -F num_speakers=3
```

## Backend setup — sherpa-onnx (pure diarization)

Sherpa-onnx needs two ONNX models: pyannote segmentation and a speaker-embedding extractor. Place them under your LocalAI models directory and reference them from the YAML:

```yaml
name: pyannote-diarization
backend: sherpa-onnx
type: diarization
parameters:
  model: sherpa-onnx-pyannote-segmentation-3-0/model.onnx
options:
  - diarize.embedding_model=3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx
  # Optional clustering knobs (per-call DiarizeRequest fields override these):
  - diarize.threshold=0.5
  - diarize.min_duration_on=0.3
  - diarize.min_duration_off=0.5
known_usecases:
  - FLAG_DIARIZATION
```

Both `model:` and `diarize.embedding_model=` are resolved relative to the LocalAI models directory.

## Backend setup — vibevoice.cpp (diarization + ASR)

vibevoice.cpp's ASR mode emits `[{Start, End, Speaker, Content}]` natively, so a single pass gives both diarization and transcription:

```yaml
name: vibevoice-diarize
backend: vibevoice-cpp
parameters:
  model: vibevoice-asr.gguf
options:
  - type=asr
  - tokenizer=vibevoice-tokenizer.gguf
known_usecases:
  - FLAG_DIARIZATION
  - FLAG_TRANSCRIPT
```

Pass `include_text=true` on the request to populate the `text` field on each diarization segment.

```bash
curl http://localhost:8080/v1/audio/diarization \
  -H "Content-Type: multipart/form-data" \
  -F file="@interview.wav" \
  -F model="vibevoice-diarize" \
  -F include_text=true \
  -F response_format=verbose_json
```

## Notes

- **Speaker identity across files**: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store.
- **Hints vs. forces**: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself.
- **Sample rate**: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz.
feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654) * feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp Closes #1648. OpenAI-style multipart endpoint that returns "who spoke when". Single endpoint instead of the issue's three-endpoint sketch (refactor /vad, /vad/embedding, /diarization) — the typical client wants one call, and embeddings can land later as a sibling without breaking this surface. Response shape borrows from Pyannote/Deepgram: segments carry a normalised SPEAKER_NN id (zero-padded, stable across the response) plus the raw backend label, optional per-segment text when the backend bundles ASR, and a speakers summary in verbose_json. response_format also accepts rttm so consumers can pipe straight into pyannote.metrics / dscore. Backends: * vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass. vibevoice's ASR prompt asks the model to emit [{Start,End,Speaker,Content}] natively, so diarization is a by-product of the same pass; include_text=true preserves the transcript per segment, otherwise we drop it. * sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization C API (pyannote segmentation + speaker-embedding extractor + fast clustering). libsherpa-shim grew config builders, a SetClustering wrapper for per-call num_clusters/threshold overrides, and a segment_at accessor (purego can't read field arrays out of SherpaOnnxOfflineSpeakerDiarizationSegment[] directly). Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment / DiarizeResponse messages, threaded through interface.go, base, server, client, embed. Default Base impl returns unimplemented. Capability surfaces all updated: FLAG_DIARIZATION usecase, FeatureAudioDiarization permission (default-on), RouteFeatureRegistry entries for /v1/audio/diarization and /audio/diarization, audio instruction-def description widened, CAP_DIARIZATION JS symbol, swagger regenerated, /api/instructions discovery map updated. Tests: * core/backend: speaker-label normalisation (first-seen → SPEAKER_NN, per-speaker totals, nil-safety, fallback to backend NumSpeakers when no segments). * core/http/endpoints/openai: RTTM rendering (file-id basename, negative duration clamping, fallback id). * tests/e2e: mock-backend grew a deterministic Diarize that emits raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN remapping, verbose_json speakers summary + transcript pass-through (gated by include_text), RTTM bytes content-type, and rejection of unknown response_format. mock-diarize model config registered with known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard. Docs: new features/audio-diarization.md (request/response, RTTM example, sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry in whats-new.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(diarization): correct sherpa-onnx symbol name + lint cleanup CI failures on #9654: * sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`. Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult (Destroy in the middle, not the prefix); the rest of the diarization surface follows the same naming pattern. The mismatched name made purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server before the BeforeAll could probe Health, taking down every sherpa-onnx test job — not just the diarization-related ones. * golangci-lint flagged 5 errcheck violations on new defer cleanups (os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()` closure (matches the pattern other LocalAI files use for new code, since pre-existing bare defers are grandfathered in via new-from-merge-base). * golangci-lint also flagged forbidigo violations: the new diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`, which are forbidden by the project's coding-style policy (.agents/coding-style.md). Convert both files to Ginkgo/Gomega Describe/It with Expect(...) — they get picked up by the existing TestBackend / TestOpenAI suites, no new suite plumbing needed. * modernize linter: tightened the diarization segment loop to `for i := range int(numSegments)` (Go 1.22+ idiom). Verified locally: golangci-lint with new-from-merge-base=origin/master reports 0 issues across all touched packages, and the four mocked diarization e2e specs in tests/e2e/mock_backend_test.go still pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3: both /v1/audio/transcriptions and /v1/audio/diarization now succeed and return correctly attributed speaker turns for the full clip. Two latent issues surfaced once the diarization endpoint actually exercised the backend with a non-trivial input: 1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code passed the uploaded path straight through, so anything that wasn't already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and the very unhelpful "vv_capi_asr failed". prepareWavInput shells out to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp dir, matching the rate the model was trained on; both AudioTranscription and Diarize now route through it. This is the same shape sherpa-onnx uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz so we don't reuse that helper. 2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's fine for a five-second clip but not for anything past ~10 s — vibevoice stops mid-JSON, the parse fails, and the caller sees a hard error. Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the model's ~30 tok/s rate); generation stops at EOS so this is a cap rather than a target. 3. As a defensive belt-and-braces, mirror AudioTranscription's existing "fall back to a single segment if the model emits non-JSON text" pattern in Diarize, so partial / unusual model output never produces a 500. This kept the endpoint usable while diagnosing (1) and (2), and is the right behaviour to keep. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime Spotted by tests-e2e-backend (1.25.x): the previous fix forced every incoming audio file through `ffmpeg -ar 24000 ...`, which meant the backend container — which does not ship ffmpeg — failed even for the existing happy path where the caller already uploads a WAV. The container-side error was: rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to 24k mono wav: exec: "ffmpeg": executable file not found in $PATH Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and already accepts any PCM/IEEE-float WAV at any sample rate, downmixes multi-channel input to mono, and resamples to 24 kHz internally. So the only inputs that genuinely need an external converter are non-WAV formats (MP3, OGG, FLAC, ...). Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them straight through with a no-op cleanup; everything else still goes through ffmpeg with the same 24 kHz mono s16le target. The result: * Container builds without ffmpeg keep working for WAV uploads (the e2e-backends fixture is jfk.wav at 16 kHz mono s16le). * MP3 and other non-WAV inputs still get the new ffmpeg conversion path so the diarization endpoint stays useful. * If the caller uploads a non-WAV but ffmpeg isn't on PATH, the surfaced error is still descriptive enough to act on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt install in backend/Dockerfile.golang and pointed update-alternatives at them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has gcc-14 in main), but every Go backend that builds on `nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails at the apt step: E: Unable to locate package gcc-14 This blocked unrelated jobs: backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper, acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually need gcc-14 anywhere else. Make the gcc-14 install conditional on the package being available in the configured apt repos. On noble: identical behaviour to today (gcc-14 installed, update-alternatives points at it). On jammy: skip the gcc-14 stanza entirely and let build-essential's default gcc take over, which is what the other Go backends compile with anyway. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> 2026-05-05 13:10:13 +00:00			`+++`
			`disableToc = false`
			`title = "Speaker Diarization"`
			`weight = 17`
			`url = "/features/audio-diarization/"`
			`+++`

			Speaker diarization answers the question "who spoke when?" — given an audio clip with multiple speakers, it returns time-stamped segments labelled with a stable speaker ID (`SPEAKER_00`, `SPEAKER_01`, …).

			LocalAI exposes this through the `/v1/audio/diarization` endpoint, modelled after `/v1/audio/transcriptions`. Two backends are supported today:

			`- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) — pyannote-3.0 segmentation + a speaker-embedding extractor (3D-Speaker, NeMo, WeSpeaker) + fast clustering. Pure diarization — no transcription cost. Recommended when you only need speaker turns.`
			`- [vibevoice.cpp](https://github.com/microsoft/VibeVoice) — produces speaker-labelled segments as a by-product of its long-form ASR pass, so you can optionally get a transcript per segment for free.`

			`Because diarization is exposed as a regular OpenAI-compatible endpoint, any HTTP client works. There is no Python dependency on pyannote or NeMo on the consumer side.`

			`## Endpoint`

			```
			`POST /v1/audio/diarization`
			`Content-Type: multipart/form-data`
			```

			`\| Field \| Type \| Description \|`
			`\|-------\|------\|-------------\|`
			\| `file` \| file (required) \| audio file in any format `ffmpeg` accepts \|
			\| `model` \| string (required) \| name of the diarization-capable model \|
			\| `num_speakers` \| int \| exact speaker count when known (>0 forces; 0 = auto) \|
			\| `min_speakers` \| int \| hint when auto-detecting \|
			\| `max_speakers` \| int \| hint when auto-detecting \|
			\| `clustering_threshold` \| float \| cosine distance threshold used when `num_speakers` is unknown \|
			\| `min_duration_on` \| float \| discard segments shorter than this many seconds \|
			\| `min_duration_off` \| float \| merge gaps shorter than this many seconds \|
			\| `language` \| string \| only meaningful for backends that bundle ASR (e.g. vibevoice) \|
			\| `include_text` \| bool \| when the backend can emit per-segment transcript for free, populate it \|
			\| `response_format` \| string \| `json` (default), `verbose_json`, or `rttm` \|

			### Response — `json` (default)

			`Compact payload, no transcription, no per-speaker summary:`

			```json
			`{`
			`"task": "diarize",`
			`"duration": 12.34,`
			`"num_speakers": 2,`
			`"segments": [`
			`{"id": 0, "speaker": "SPEAKER_00", "label": "0", "start": 0.00, "end": 2.34},`
			`{"id": 1, "speaker": "SPEAKER_01", "label": "1", "start": 2.34, "end": 4.10}`
			`]`
			`}`
			```

			`speaker` is the normalized, zero-padded label clients should display. `label` preserves the raw backend-emitted ID for clients that maintain their own speaker dictionary.

			### Response — `verbose_json`

			Adds per-speaker totals and (when the backend supports it and `include_text=true`) the per-segment transcript:

			```json
			`{`
			`"task": "diarize",`
			`"duration": 12.34,`
			`"language": "en",`
			`"num_speakers": 2,`
			`"segments": [`
			`{"id": 0, "speaker": "SPEAKER_00", "label": "0", "start": 0.00, "end": 2.34, "text": "Hello, world."},`
			`{"id": 1, "speaker": "SPEAKER_01", "label": "1", "start": 2.34, "end": 4.10, "text": "How are you?"}`
			`],`
			`"speakers": [`
			`{"id": "SPEAKER_00", "label": "0", "total_speech_duration": 5.6, "segment_count": 3},`
			`{"id": "SPEAKER_01", "label": "1", "total_speech_duration": 1.76, "segment_count": 1}`
			`]`
			`}`
			```

			### Response — `rttm`

			NIST RTTM, the standard interchange format used by `pyannote.metrics` / `dscore`:

			```
			`SPEAKER audio 1 0.000 2.340 <NA> <NA> SPEAKER_00 <NA> <NA>`
			`SPEAKER audio 1 2.340 1.760 <NA> <NA> SPEAKER_01 <NA> <NA>`
			```

			Returned as `Content-Type: text/plain; charset=utf-8`.

			`## Quick start`

			```bash
			`curl http://localhost:8080/v1/audio/diarization \`
			`-H "Content-Type: multipart/form-data" \`
			`-F file="@meeting.wav" \`
			`-F model="pyannote-diarization" \`
			`-F num_speakers=3`
			```

			`## Backend setup — sherpa-onnx (pure diarization)`

			`Sherpa-onnx needs two ONNX models: pyannote segmentation and a speaker-embedding extractor. Place them under your LocalAI models directory and reference them from the YAML:`

			```yaml
			`name: pyannote-diarization`
			`backend: sherpa-onnx`
			`type: diarization`
			`parameters:`
			`model: sherpa-onnx-pyannote-segmentation-3-0/model.onnx`
			`options:`
			`- diarize.embedding_model=3dspeaker_speech_campplus_sv_zh-cn_16k-common.onnx`
			`# Optional clustering knobs (per-call DiarizeRequest fields override these):`
			`- diarize.threshold=0.5`
			`- diarize.min_duration_on=0.3`
			`- diarize.min_duration_off=0.5`
			`known_usecases:`
			`- FLAG_DIARIZATION`
			```

			Both `model:` and `diarize.embedding_model=` are resolved relative to the LocalAI models directory.

			`## Backend setup — vibevoice.cpp (diarization + ASR)`

			vibevoice.cpp's ASR mode emits `[{Start, End, Speaker, Content}]` natively, so a single pass gives both diarization and transcription:

			```yaml
			`name: vibevoice-diarize`
			`backend: vibevoice-cpp`
			`parameters:`
			`model: vibevoice-asr.gguf`
			`options:`
			`- type=asr`
			`- tokenizer=vibevoice-tokenizer.gguf`
			`known_usecases:`
			`- FLAG_DIARIZATION`
			`- FLAG_TRANSCRIPT`
			```

			Pass `include_text=true` on the request to populate the `text` field on each diarization segment.

			```bash
			`curl http://localhost:8080/v1/audio/diarization \`
			`-H "Content-Type: multipart/form-data" \`
			`-F file="@interview.wav" \`
			`-F model="vibevoice-diarize" \`
			`-F include_text=true \`
			`-F response_format=verbose_json`
			```

			`## Notes`

			- Speaker identity across files: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store.
			- Hints vs. forces: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself.
			`- Sample rate: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz.`