LocalAI/core/schema/audio_transform.go

53 lines
2.4 KiB
Go
Raw Normal View History

feat: add LocalVQE backend and audio transformations UI (#9640) feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[*]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 20:07:11 +00:00
package schema
// @Description Audio transform request body — multipart form-data only.
// `audio` (the primary input file) is required; `reference` (auxiliary
// signal: loopback for echo cancellation, target speaker for voice
// conversion, etc.) is optional. Backend-specific tuning lives in the
// `params[<key>]=<value>` form fields, collected into a generic map so
// the schema doesn't bake in any one transform's vocabulary.
type AudioTransformRequest struct {
BasicModelRequest
Format string `json:"response_format,omitempty" yaml:"response_format,omitempty"` // wav | mp3 | ogg | flac
SampleRate int `json:"sample_rate,omitempty" yaml:"sample_rate,omitempty"` // desired output sample rate; 0 = backend default
Params map[string]string `json:"params,omitempty" yaml:"params,omitempty"` // backend-specific tuning
}
// AudioTransformStreamControl is the JSON envelope used on the
// /audio/transformations/stream WebSocket. The first frame on a new
// connection MUST be a session.update; subsequent frames are binary PCM.
// Server may emit error / session.closed text frames.
type AudioTransformStreamControl struct {
Type string `json:"type"`
Model string `json:"model,omitempty"`
SampleFormat string `json:"sample_format,omitempty"`
SampleRate int `json:"sample_rate,omitempty"`
FrameSamples int `json:"frame_samples,omitempty"`
Params map[string]string `json:"params,omitempty"`
Reset bool `json:"reset,omitempty"`
Error string `json:"error,omitempty"`
}
// AudioTransformStreamControl Type values.
const (
AudioTransformCtrlSessionUpdate = "session.update"
AudioTransformCtrlSessionClose = "session.close"
AudioTransformCtrlSessionClosed = "session.closed"
AudioTransformCtrlError = "error"
)
// AudioTransformStreamControl SampleFormat values (mirror the proto enum
// names so the wire format stays self-describing).
const (
AudioTransformSampleFormatS16LE = "S16_LE"
AudioTransformSampleFormatF32LE = "F32_LE"
)
// LocalVQE param keys — backend-specific but referenced by both the
// HTTP layer (form-field shortcuts, defaults) and the localvqe backend
// itself. Hoisted so renames stay in lockstep.
const (
AudioTransformParamNoiseGate = "noise_gate"
AudioTransformParamNoiseGateThreshold = "noise_gate_threshold_dbfs"
)