Switch from webrtcvad's binary is_speech to Silero VAD's calibrated float probability via direct ONNX session calls with numpy. The LSTM provides temporal smoothing natively, eliminating the sliding window hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end to match Silero's requirements. Consolidate pipeline/requirements.txt into root requirements.txt, swap webrtcvad+setuptools for silero-vad+onnxruntime.
9 KiB
Onju Voice Architecture
System Overview
ESP32-S3 voice assistant with bidirectional audio streaming over WiFi to a server running speech recognition and text-to-speech.
┌─────────────────────────────────────────────────────────────┐
│ ESP32-S3 │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ │
│ │ Mic │───→│ I2S RX │───→│ μ-law │──→│ UDP │ │
│ │ (INMP441)│ │ 16kHz │ │ encode │ │ 3000 │ │
│ └──────────┘ └─────────┘ └──────────┘ └─────┬───┘ │
│ │ │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────▼───┐ │
│ │ Speaker │◀───│ I2S TX │◀───│ Opus │◀──│ TCP │ │
│ │(MAX98357)│ │ 16kHz │ │ decode │ │ 3001 │ │
│ └──────────┘ └─────────┘ └──────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
WiFi
│
┌─────────────────────────────────▼───────────────────────────┐
│ Server │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ UDP │───→│ μ-law │───→│ Speech-to-Text │ │
│ │ 3000 │ │ decode │ │ (Whisper/Deepgram) │ │
│ └─────────┘ └──────────┘ └─────────────────────┘ │
│ │
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
│ │ TCP │◀───│ Opus │◀───│ Text-to-Speech │ │
│ │ 3001 │ │ encode │ │ (ElevenLabs/etc) │ │
│ └─────────┘ └──────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Audio Paths
Microphone → Server (UDP + μ-law)
- Sample rate: 16kHz mono, 512 samples/chunk (32ms)
- μ-law compressed: 512 bytes/chunk (16 KB/s) — 2x reduction
- UDP: no retransmissions, no connection overhead — old audio is stale anyway
- DC offset removed per-chunk before encoding
Why μ-law over Opus upstream: μ-law is stateless (sample-by-sample table lookup, ~1% CPU), zero buffering latency, and ASR models handle the quality fine. Opus would add 20-60ms frame buffering and 10-20% CPU for no practical benefit upstream.
Why UDP over TCP: Retransmissions add latency and head-of-line blocking delays newer audio. ASR handles occasional packet loss better than delayed old audio.
Server → Speaker (TCP + Opus)
- Sample rate: 16kHz mono, 320 samples/frame (20ms)
- Opus compressed: ~35-50 bytes/frame (1.5-2 KB/s) — 14-16x reduction
- TCP: reliable ordered delivery required for Opus frame decoding
Why Opus over μ-law downstream: Human ears need better quality than ASR. Opus gives 14-16x compression vs μ-law's 2x, turning a tight 2.2x WiFi margin into 30x+.
Why TCP over UDP: Lost or out-of-order Opus frames cause decode errors. TCP's reliability guarantees are worth the slight latency cost, especially with the playback buffer absorbing jitter.
Device Discovery & Connection
- ESP32 boots and joins WiFi
- Sends multicast announcement to
239.0.0.1:12345with hostname and git hash - Server discovers device, learns IP
- Server connects to ESP32's TCP server on port 3001 (ESP32 is the TCP server, not client)
- ESP32 learns server IP from first TCP connection, uses it for UDP mic packets
TCP Command Protocol
All commands use a 6-byte header. The server initiates TCP connections to the ESP32.
0xAA — Audio Playback
header[0] = 0xAA
header[1:2] = mic_timeout (seconds, big-endian) — enable mic after audio finishes
header[3] = volume (0-20, bit-shift)
header[4] = LED fade rate (0-255)
header[5] = compression type: 0=PCM, 2=Opus
Followed by length-prefixed Opus frames: [2-byte big-endian length][Opus data]...
0xBB — Set LEDs
header[0] = 0xBB
header[1] = LED bitmask (bits 0-5)
header[2:4] = RGB color
0xCC — LED Blink (VAD visualization)
header[0] = 0xCC
header[1] = starting intensity (0-255)
header[2:4] = RGB color
header[5] = fade rate
Also extends mic timeout if it's about to expire (VAD_MIC_EXTEND = 5s).
0xDD — Mic Timeout
header[0] = 0xDD
header[1:2] = timeout (seconds, big-endian)
Used to stop mic while server is processing (thinking animation).
FreeRTOS Task Architecture
The ESP32-S3's dual cores are used to separate concerns:
Core 0 — Arduino loop:
- TCP server: accepts connections, parses headers, handles PCM playback
- Touch/mute input handling
- UART debug commands
Core 1 — Dedicated tasks:
micTask(4KB stack, priority 1): continuous I2S read → μ-law encode → UDP sendopusDecodeTask(32KB stack, priority 1): created per-playback, reads TCP → Opus decode → I2S writeupdateLedTask(2KB stack, priority 2): 40Hz LED refresh with gamma-corrected fade
The 32KB stack for Opus decoding is necessary because the Opus decoder uses 10-20KB of stack internally.
State Machine
Key state variables controlling behavior:
isPlaying— blocks mic recording during playbackmic_timeout— millis() deadline for mic recording; 0 = offinterruptPlayback— set by center touch to abort current playbackmute— hardware mute switch state (currently disabled viaDISABLE_HARDWARE_MUTE)serverIP— learned from first TCP connection;0.0.0.0= no server yet
Activation flow:
- Center touch → sets
mic_timeoutto now + 60s, green LED pulse - Server sends 0xCC (VAD blink) during speech → extends timeout by 5s if nearly expired
- Server sends 0xDD (stop mic) when transcription complete → thinking animation
- Server sends 0xAA (audio) with response → plays audio, then re-enables mic per header timeout
Playback interruption:
- Center touch during playback → sets
interruptPlayback, clearsisPlaying - Opus/PCM task detects flag, stops decoding
- Remaining TCP data drained (up to 1s) without playing
- Mic enabled immediately for 60s
LED System
6 NeoPixel LEDs, only the inner 4 (indices 1-4) used for animations. Edge LEDs (1, 4) dimmed by half for a softer visual.
- Pulse-and-fade paradigm:
setLed()sets color, starting intensity, and fade rate.updateLedTaskramps intensity down at 40Hz. - Gamma correction: LUT with gamma 1.8 (lower than typical 2.2 to avoid visible flicker at low PWM levels)
- Audio-reactive: During playback, amplitude of PCM samples drives LED brightness (sampled every 32ms, only ramps up — natural fade handles the down)
Color semantics:
- Green pulse: listening / mic active
- White pulse: audio playback / VAD visualization
- Red pulse: error / cannot listen (muted or no server)
Volume Control
Bit-shift based: PCM samples are left-shifted by the volume value (0-20). Default 14. Set per-playback via the 0xAA header, configurable via NVS.
Playback Buffering
TCP → tcpBuffer (512B) → wavData (2MB PSRAM) → I2S DMA → Speaker
- Buffer threshold: 4096 samples (256ms) before starting I2S playback — balances latency vs jitter resilience
- Without PSRAM: falls back to 1024 samples (64ms), 4KB allocation
- I2S DMA: 4 buffers × 512 samples, hardware-driven (no CPU polling)
Configuration
Stored in NVS (ESP32 Preferences): WiFi credentials, server hostname, volume, mic timeout. Editable via UART config mode (c command).
UART Debug Commands
r restart, M mic on 10min, m mic off, W/w LED test fast/slow, L/l LEDs max/off, A multicast announce, c config mode.