LocalAI/docs/content/features/audio-transform.md

+++
disableToc = false
title = "Audio Transform"
weight = 17
url = "/features/audio-transform/"
+++

The audio-transform endpoints take **audio in** and emit **audio out**, optionally
conditioned on a second reference audio signal. The category is generic by
design — concrete operations include joint **acoustic echo cancellation +
noise suppression + dereverberation** (LocalVQE), voice conversion (reference
= target speaker), pitch shifting, audio super-resolution, and so on.

The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
is a derivative of the Microsoft DeepVQE paper.

## The mental model

Every audio-transform request carries:

- **`audio`** — the primary input file (required).
- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
  - For echo cancellation: the loopback / far-end signal played through the speakers.
  - For voice conversion: the target speaker's reference clip.
  - For pitch / style transfer: a tonal or style reference.
  - When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
    for example, does denoise + dereverb only when ref is empty).
- **`params`** — a generic `key=value` map forwarded to the backend.
  - LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.

This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
AEC challenge 2-channel WAV convention.

## Batch endpoint

`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
form-data, returns audio bytes.

| Field | Type | Required | Notes |
|---|---|---|---|
| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
| `audio` | file   | yes | Primary input audio |
| `reference` | file | no | Optional auxiliary signal |
| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
| `sample_rate` | int | no | Desired output sample rate |
| `params[<key>]` | string | no | Repeated; forwarded to backend |

Example (LocalVQE: cancel echo, suppress noise, gate residual):

```bash
curl -X POST http://localhost:8080/audio/transformations \
  -F model=localvqe \
  -F audio=@mic.wav \
  -F reference=@loopback.wav \
  -F 'params[noise_gate]=true' \
  -F 'params[noise_gate_threshold_dbfs]=-50' \
  -o enhanced.wav
```

When `reference` is omitted, LocalVQE zero-fills the reference channel and
the operation reduces to noise suppression + dereverberation.

## Streaming endpoint

`GET /audio/transformations/stream` — bidirectional WebSocket. The first
client message is a JSON envelope; subsequent client messages are binary
PCM frames; server emits binary PCM frames at the same cadence.

### Wire format

**Client → server** (text frame, first):

```json
{
  "type": "session.update",
  "model": "localvqe",
  "sample_format": "S16_LE",
  "sample_rate": 16000,
  "frame_samples": 256,
  "params": { "noise_gate": "true" }
}
```

`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
preferred hop length (256 = 16 ms for LocalVQE).

**Client → server** (binary frames, subsequent): interleaved stereo PCM,
channel 0 = audio (mic), channel 1 = reference. Frame size:
`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
silent (no auxiliary signal), send zeros on channel 1.

**Server → client** (binary frames): mono PCM in the same format,
`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).

**Mid-stream control** (text frame): another `session.update` resets the
streaming state when its `reset` field is true; a `session.close` text frame
ends the session cleanly.

### Latency

LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU
time per frame on a modern desktop, leaving the rest of the budget for
network and downstream playback.

## Backend-specific tuning (LocalVQE)

| `params[<key>]` | Type | Default | Effect |
|---|---|---|---|
| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |

The gate is most useful in far-end-only / silent-near-end stretches where the
model's residual would otherwise sound like buffering or amplified noise floor.
A reasonable starting point is `-50` dBFS.

## Configuring a model

```yaml
name: localvqe
backend: localvqe
parameters:
  model: localvqe-v1.1-1.3M-f32.gguf

# Backend-specific defaults can be set in Options[]; per-request
# params[*] form fields override.
#
# `backend` and `device` route through the upstream localvqe options
# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
# pin to a specific GPU index. Leave both unset to keep the CPU default.
options:
- noise_gate=true
- noise_gate_threshold_dbfs=-50
# - backend=Vulkan
# - device=0
```

## See also

- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
- [Audio to Text]({{< relref "audio-to-text.md" >}})
- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)
-												feat: add LocalVQE backend and audio transformations UI (#9640)

feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
											
										
										
											2026-05-04 20:07:11 +00:00
+								+++
 								disableToc = false
 								title = "Audio Transform"
 								weight = 17
 								url = "/features/audio-transform/"
 								+++
 								The audio-transform endpoints take **audio in** and emit **audio out**, optionally
 								conditioned on a second reference audio signal. The category is generic by
 								design — concrete operations include joint **acoustic echo cancellation +
 								noise suppression + dereverberation** (LocalVQE), voice conversion (reference
 								= target speaker), pitch shifting, audio super-resolution, and so on.
 								The first shipping backend is [LocalVQE](https://github.com/localai-org/LocalVQE),
 								a 1.3 M-parameter GGML-based model that performs joint AEC + noise suppression
 								+ dereverberation on 16 kHz mono speech, ~9.6× realtime on a desktop CPU. It
 								is a derivative of the Microsoft DeepVQE paper.
 								## The mental model
 								Every audio-transform request carries:
 								- **`audio`** — the primary input file (required).
 								- **`reference`** — an auxiliary signal whose meaning is backend-specific (optional).
 								  - For echo cancellation: the loopback / far-end signal played through the speakers.
 								  - For voice conversion: the target speaker's reference clip.
 								  - For pitch / style transfer: a tonal or style reference.
 								  - When omitted, the backend treats it as silence and degrades gracefully (LocalVQE,
 								    for example, does denoise + dereverb only when ref is empty).
 								- **`params`** — a generic `key=value` map forwarded to the backend.
 								  - LocalVQE keys: `noise_gate=true|false`, `noise_gate_threshold_dbfs=<float>`.
 								This shape mirrors WebRTC's `ProcessStream(near)` / `ProcessReverseStream(far)`
 								APM API, NVIDIA Maxine's `NvAFX_Run` paired-stream signature, and the ICASSP
 								AEC challenge 2-channel WAV convention.
 								## Batch endpoint
 								`POST /audio/transformations` (alias `POST /audio/transform`) — multipart
 								form-data, returns audio bytes.
 								| Field | Type | Required | Notes |
 								|---|---|---|---|
 								| `model` | string | yes | Audio-transform model id (e.g. `localvqe`) |
 								| `audio` | file   | yes | Primary input audio |
 								| `reference` | file | no | Optional auxiliary signal |
 								| `response_format` | string | no | `wav` (default), `mp3`, `ogg`, `flac` |
 								| `sample_rate` | int | no | Desired output sample rate |
 								| `params[<key>]` | string | no | Repeated; forwarded to backend |
 								Example (LocalVQE: cancel echo, suppress noise, gate residual):
 								```bash
 								curl -X POST http://localhost:8080/audio/transformations \
 								  -F model=localvqe \
 								  -F audio=@mic.wav \
 								  -F reference=@loopback.wav \
 								  -F 'params[noise_gate]=true' \
 								  -F 'params[noise_gate_threshold_dbfs]=-50' \
 								  -o enhanced.wav
 								```
 								When `reference` is omitted, LocalVQE zero-fills the reference channel and
 								the operation reduces to noise suppression + dereverberation.
 								## Streaming endpoint
 								`GET /audio/transformations/stream` — bidirectional WebSocket. The first
 								client message is a JSON envelope; subsequent client messages are binary
 								PCM frames; server emits binary PCM frames at the same cadence.
 								### Wire format
 								**Client → server** (text frame, first):
 								```json
 								{
 								  "type": "session.update",
 								  "model": "localvqe",
 								  "sample_format": "S16_LE",
 								  "sample_rate": 16000,
 								  "frame_samples": 256,
 								  "params": { "noise_gate": "true" }
 								}
 								```
 								`sample_format` is `S16_LE` (16-bit signed little-endian) or `F32_LE` (32-bit
 								float little-endian, [-1, 1]). `frame_samples` defaults to the backend's
 								preferred hop length (256 = 16 ms for LocalVQE).
 								**Client → server** (binary frames, subsequent): interleaved stereo PCM,
 								channel 0 = audio (mic), channel 1 = reference. Frame size:
 								`frame_samples × 2 channels × sample_size`. For `S16_LE` at 256 samples that
 								is 1024 bytes per frame; for `F32_LE` it is 2048 bytes. If the reference is
 								silent (no auxiliary signal), send zeros on channel 1.
 								**Server → client** (binary frames): mono PCM in the same format,
 								`frame_samples × sample_size` bytes (512 bytes for `S16_LE`, 1024 for `F32_LE`).
 								**Mid-stream control** (text frame): another `session.update` resets the
 								streaming state when its `reset` field is true; a `session.close` text frame
 								ends the session cleanly.
 								### Latency
 								LocalVQE has 16 ms algorithmic latency (one hop). At runtime, ~1.66 ms of CPU
 								time per frame on a modern desktop, leaving the rest of the budget for
 								network and downstream playback.
 								## Backend-specific tuning (LocalVQE)
 								| `params[<key>]` | Type | Default | Effect |
 								|---|---|---|---|
 								| `noise_gate` | bool | `false` | Enable post-OLA RMS-based residual-echo gate |
 								| `noise_gate_threshold_dbfs` | float | `-45.0` | Gate threshold in dBFS; frames below are zeroed |
 								The gate is most useful in far-end-only / silent-near-end stretches where the
 								model's residual would otherwise sound like buffering or amplified noise floor.
 								A reasonable starting point is `-50` dBFS.
 								## Configuring a model
 								```yaml
 								name: localvqe
 								backend: localvqe
 								parameters:
 								  model: localvqe-v1.1-1.3M-f32.gguf
 								# Backend-specific defaults can be set in Options[]; per-request
 								# params[*] form fields override.
 								#
 								# `backend` and `device` route through the upstream localvqe options
 								# builder so you can force a non-default GGML backend (e.g. `Vulkan`) or
 								# pin to a specific GPU index. Leave both unset to keep the CPU default.
 								options:
 								- noise_gate=true
 								- noise_gate_threshold_dbfs=-50
 								# - backend=Vulkan
 								# - device=0
 								```
 								## See also
-												fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)

* fix(docs): correct broken Hugo relrefs

The Hugo build has been failing on master since the relevant pages
landed:

- text-generation.md:720 referenced `/docs/features/distributed-mode`,
  but Hugo `relref` paths are relative to the content root, not the
  rendered URL. Drop the `/docs/` prefix so the lookup matches the
  existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
  `text-to-audio.md`.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(kokoros): stub Diarize and AudioTransform Backend trait methods

The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning

Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:

  3bd759c "1.5b: unify into a single tts entry point" inserted a
          ref_audio_path parameter between voice_path and dst_wav_path.
  ad856bd "1.5b: multi-speaker dialog support" promoted that to a
          (const char* const* ref_audio_paths, int n_ref_audio_paths)
          pair for per-speaker conditioning.

Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:

  * Update the CppTTS purego binding to the 9-arg form. purego
    marshals []*byte as a **char by handing the C side the underlying
    array address; nil/empty maps to NULL, which matches the C
    contract for "no reference audio" on the realtime-0.5B path.
  * Add a `ref_audio` gallery option (comma-separated, repeatable)
    that the 1.5B path consumes for runtime voice cloning. Multiple
    entries are interpreted as one WAV per speaker (Speaker 0..n-1).
  * TTSRequest.Voice now routes by extension/shape: `.wav` or a
    comma-separated list goes to ref_audio_paths; anything else stays
    on voice_path (realtime-0.5B's pre-baked voice gguf).
  * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
    the existing bump_deps matrix so future upstream rolls land as
    reviewable PRs instead of a silent CI break.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio

Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.

No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
											
										
										
											2026-05-06 08:36:59 +00:00
+								- [Text to Audio (TTS)]({{< relref "text-to-audio.md" >}})
-												feat: add LocalVQE backend and audio transformations UI (#9640)

feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
											
										
										
											2026-05-04 20:07:11 +00:00
+								- [Audio to Text]({{< relref "audio-to-text.md" >}})
 								- [LocalVQE upstream](https://github.com/localai-org/LocalVQE)
 								- [DeepVQE paper (Indenbom et al., Interspeech 2023)](https://arxiv.org/abs/2306.03177)