LocalAI/core/schema/audio_transform.go

package schema

// @Description Audio transform request body — multipart form-data only.
// `audio` (the primary input file) is required; `reference` (auxiliary
// signal: loopback for echo cancellation, target speaker for voice
// conversion, etc.) is optional. Backend-specific tuning lives in the
// `params[<key>]=<value>` form fields, collected into a generic map so
// the schema doesn't bake in any one transform's vocabulary.
type AudioTransformRequest struct {
	BasicModelRequest
	Format     string            `json:"response_format,omitempty" yaml:"response_format,omitempty"` // wav | mp3 | ogg | flac
	SampleRate int               `json:"sample_rate,omitempty" yaml:"sample_rate,omitempty"`         // desired output sample rate; 0 = backend default
	Params     map[string]string `json:"params,omitempty" yaml:"params,omitempty"`                   // backend-specific tuning
}

// AudioTransformStreamControl is the JSON envelope used on the
// /audio/transformations/stream WebSocket. The first frame on a new
// connection MUST be a session.update; subsequent frames are binary PCM.
// Server may emit error / session.closed text frames.
type AudioTransformStreamControl struct {
	Type         string            `json:"type"`
	Model        string            `json:"model,omitempty"`
	SampleFormat string            `json:"sample_format,omitempty"`
	SampleRate   int               `json:"sample_rate,omitempty"`
	FrameSamples int               `json:"frame_samples,omitempty"`
	Params       map[string]string `json:"params,omitempty"`
	Reset        bool              `json:"reset,omitempty"`
	Error        string            `json:"error,omitempty"`
}

// AudioTransformStreamControl Type values.
const (
	AudioTransformCtrlSessionUpdate = "session.update"
	AudioTransformCtrlSessionClose  = "session.close"
	AudioTransformCtrlSessionClosed = "session.closed"
	AudioTransformCtrlError         = "error"
)

// AudioTransformStreamControl SampleFormat values (mirror the proto enum
// names so the wire format stays self-describing).
const (
	AudioTransformSampleFormatS16LE = "S16_LE"
	AudioTransformSampleFormatF32LE = "F32_LE"
)

// LocalVQE param keys — backend-specific but referenced by both the
// HTTP layer (form-field shortcuts, defaults) and the localvqe backend
// itself. Hoisted so renames stay in lockstep.
const (
	AudioTransformParamNoiseGate          = "noise_gate"
	AudioTransformParamNoiseGateThreshold = "noise_gate_threshold_dbfs"
)
feat: add LocalVQE backend and audio transformations UI (#9640) feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com> 2026-05-04 20:07:11 +00:00			`package schema`

			`// @Description Audio transform request body — multipart form-data only.`
			// `audio` (the primary input file) is required; `reference` (auxiliary
			`// signal: loopback for echo cancellation, target speaker for voice`
			`// conversion, etc.) is optional. Backend-specific tuning lives in the`
			// `params[<key>]=<value>` form fields, collected into a generic map so
			`// the schema doesn't bake in any one transform's vocabulary.`
			`type AudioTransformRequest struct {`
			`BasicModelRequest`
			Format string `json:"response_format,omitempty" yaml:"response_format,omitempty"` // wav \| mp3 \| ogg \| flac
			SampleRate int `json:"sample_rate,omitempty" yaml:"sample_rate,omitempty"` // desired output sample rate; 0 = backend default
			Params map[string]string `json:"params,omitempty" yaml:"params,omitempty"` // backend-specific tuning
			`}`

			`// AudioTransformStreamControl is the JSON envelope used on the`
			`// /audio/transformations/stream WebSocket. The first frame on a new`
			`// connection MUST be a session.update; subsequent frames are binary PCM.`
			`// Server may emit error / session.closed text frames.`
			`type AudioTransformStreamControl struct {`
			Type string `json:"type"`
			Model string `json:"model,omitempty"`
			SampleFormat string `json:"sample_format,omitempty"`
			SampleRate int `json:"sample_rate,omitempty"`
			FrameSamples int `json:"frame_samples,omitempty"`
			Params map[string]string `json:"params,omitempty"`
			Reset bool `json:"reset,omitempty"`
			Error string `json:"error,omitempty"`
			`}`

			`// AudioTransformStreamControl Type values.`
			`const (`
			`AudioTransformCtrlSessionUpdate = "session.update"`
			`AudioTransformCtrlSessionClose = "session.close"`
			`AudioTransformCtrlSessionClosed = "session.closed"`
			`AudioTransformCtrlError = "error"`
			`)`

			`// AudioTransformStreamControl SampleFormat values (mirror the proto enum`
			`// names so the wire format stays self-describing).`
			`const (`
			`AudioTransformSampleFormatS16LE = "S16_LE"`
			`AudioTransformSampleFormatF32LE = "F32_LE"`
			`)`

			`// LocalVQE param keys — backend-specific but referenced by both the`
			`// HTTP layer (form-field shortcuts, defaults) and the localvqe backend`
			`// itself. Hoisted so renames stay in lockstep.`
			`const (`
			`AudioTransformParamNoiseGate = "noise_gate"`
			`AudioTransformParamNoiseGateThreshold = "noise_gate_threshold_dbfs"`
			`)`