mirror of https://github.com/justLV/onju-v2 synced 2026-04-21 15:47:55 +00:00

justLV 0c9c75b3bf Replace webrtcvad with Silero VAD (ONNX, no PyTorch)

Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.

Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.

2026-02-07 17:00:02 -08:00

9 KiB

Raw Blame History

Onju Voice Architecture

System Overview

ESP32-S3 voice assistant with bidirectional audio streaming over WiFi to a server running speech recognition and text-to-speech.

┌─────────────────────────────────────────────────────────────┐
│                         ESP32-S3                             │
│  ┌──────────┐    ┌─────────┐    ┌──────────┐   ┌─────────┐ │
│  │   Mic    │───→│ I2S RX  │───→│ μ-law    │──→│   UDP   │ │
│  │ (INMP441)│    │ 16kHz   │    │ encode   │   │  3000   │ │
│  └──────────┘    └─────────┘    └──────────┘   └─────┬───┘ │
│                                                        │     │
│  ┌──────────┐    ┌─────────┐    ┌──────────┐   ┌─────▼───┐ │
│  │ Speaker  │◀───│ I2S TX  │◀───│  Opus    │◀──│   TCP   │ │
│  │(MAX98357)│    │ 16kHz   │    │ decode   │   │  3001   │ │
│  └──────────┘    └─────────┘    └──────────┘   └─────────┘ │
└─────────────────────────────────────────────────────────────┘
                                 WiFi
                                  │
┌─────────────────────────────────▼───────────────────────────┐
│                           Server                             │
│  ┌─────────┐    ┌──────────┐    ┌─────────────────────┐    │
│  │   UDP   │───→│  μ-law   │───→│  Speech-to-Text     │    │
│  │  3000   │    │  decode  │    │  (Whisper/Deepgram) │    │
│  └─────────┘    └──────────┘    └─────────────────────┘    │
│                                                              │
│  ┌─────────┐    ┌──────────┐    ┌─────────────────────┐    │
│  │   TCP   │◀───│  Opus    │◀───│  Text-to-Speech     │    │
│  │  3001   │    │  encode  │    │  (ElevenLabs/etc)   │    │
│  └─────────┘    └──────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Audio Paths

Microphone → Server (UDP + μ-law)

Sample rate: 16kHz mono, 512 samples/chunk (32ms)
μ-law compressed: 512 bytes/chunk (16 KB/s) — 2x reduction
UDP: no retransmissions, no connection overhead — old audio is stale anyway
DC offset removed per-chunk before encoding

Why μ-law over Opus upstream: μ-law is stateless (sample-by-sample table lookup, ~1% CPU), zero buffering latency, and ASR models handle the quality fine. Opus would add 20-60ms frame buffering and 10-20% CPU for no practical benefit upstream.

Why UDP over TCP: Retransmissions add latency and head-of-line blocking delays newer audio. ASR handles occasional packet loss better than delayed old audio.

Server → Speaker (TCP + Opus)

Sample rate: 16kHz mono, 320 samples/frame (20ms)
Opus compressed: ~35-50 bytes/frame (1.5-2 KB/s) — 14-16x reduction
TCP: reliable ordered delivery required for Opus frame decoding

Why Opus over μ-law downstream: Human ears need better quality than ASR. Opus gives 14-16x compression vs μ-law's 2x, turning a tight 2.2x WiFi margin into 30x+.

Why TCP over UDP: Lost or out-of-order Opus frames cause decode errors. TCP's reliability guarantees are worth the slight latency cost, especially with the playback buffer absorbing jitter.

Device Discovery & Connection

ESP32 boots and joins WiFi
Sends multicast announcement to 239.0.0.1:12345 with hostname and git hash
Server discovers device, learns IP
Server connects to ESP32's TCP server on port 3001 (ESP32 is the TCP server, not client)
ESP32 learns server IP from first TCP connection, uses it for UDP mic packets

TCP Command Protocol

All commands use a 6-byte header. The server initiates TCP connections to the ESP32.

0xAA — Audio Playback

header[0]   = 0xAA
header[1:2] = mic_timeout (seconds, big-endian) — enable mic after audio finishes
header[3]   = volume (0-20, bit-shift)
header[4]   = LED fade rate (0-255)
header[5]   = compression type: 0=PCM, 2=Opus

Followed by length-prefixed Opus frames: [2-byte big-endian length][Opus data]...

0xBB — Set LEDs

header[0]   = 0xBB
header[1]   = LED bitmask (bits 0-5)
header[2:4] = RGB color

0xCC — LED Blink (VAD visualization)

header[0]   = 0xCC
header[1]   = starting intensity (0-255)
header[2:4] = RGB color
header[5]   = fade rate

Also extends mic timeout if it's about to expire (VAD_MIC_EXTEND = 5s).

0xDD — Mic Timeout

header[0]   = 0xDD
header[1:2] = timeout (seconds, big-endian)

Used to stop mic while server is processing (thinking animation).

FreeRTOS Task Architecture

The ESP32-S3's dual cores are used to separate concerns:

Core 0 — Arduino loop:

TCP server: accepts connections, parses headers, handles PCM playback
Touch/mute input handling
UART debug commands

Core 1 — Dedicated tasks:

micTask (4KB stack, priority 1): continuous I2S read → μ-law encode → UDP send
opusDecodeTask (32KB stack, priority 1): created per-playback, reads TCP → Opus decode → I2S write
updateLedTask (2KB stack, priority 2): 40Hz LED refresh with gamma-corrected fade

The 32KB stack for Opus decoding is necessary because the Opus decoder uses 10-20KB of stack internally.

State Machine

Key state variables controlling behavior:

isPlaying — blocks mic recording during playback
mic_timeout — millis() deadline for mic recording; 0 = off
interruptPlayback — set by center touch to abort current playback
mute — hardware mute switch state (currently disabled via DISABLE_HARDWARE_MUTE)
serverIP — learned from first TCP connection; 0.0.0.0 = no server yet

Activation flow:

Center touch → sets mic_timeout to now + 60s, green LED pulse
Server sends 0xCC (VAD blink) during speech → extends timeout by 5s if nearly expired
Server sends 0xDD (stop mic) when transcription complete → thinking animation
Server sends 0xAA (audio) with response → plays audio, then re-enables mic per header timeout

Playback interruption:

Center touch during playback → sets interruptPlayback, clears isPlaying
Opus/PCM task detects flag, stops decoding
Remaining TCP data drained (up to 1s) without playing
Mic enabled immediately for 60s

LED System

6 NeoPixel LEDs, only the inner 4 (indices 1-4) used for animations. Edge LEDs (1, 4) dimmed by half for a softer visual.

Pulse-and-fade paradigm: setLed() sets color, starting intensity, and fade rate. updateLedTask ramps intensity down at 40Hz.
Gamma correction: LUT with gamma 1.8 (lower than typical 2.2 to avoid visible flicker at low PWM levels)
Audio-reactive: During playback, amplitude of PCM samples drives LED brightness (sampled every 32ms, only ramps up — natural fade handles the down)

Color semantics:

Green pulse: listening / mic active
White pulse: audio playback / VAD visualization
Red pulse: error / cannot listen (muted or no server)

Volume Control

Bit-shift based: PCM samples are left-shifted by the volume value (0-20). Default 14. Set per-playback via the 0xAA header, configurable via NVS.

Playback Buffering

TCP → tcpBuffer (512B) → wavData (2MB PSRAM) → I2S DMA → Speaker

Buffer threshold: 4096 samples (256ms) before starting I2S playback — balances latency vs jitter resilience
Without PSRAM: falls back to 1024 samples (64ms), 4KB allocation
I2S DMA: 4 buffers × 512 samples, hardware-driven (no CPU polling)

Configuration

Stored in NVS (ESP32 Preferences): WiFi credentials, server hostname, volume, mic timeout. Editable via UART config mode (c command).

UART Debug Commands

r restart, M mic on 10min, m mic off, W/w LED test fast/slow, L/l LEDs max/off, A multicast announce, c config mode.

9 KiB Raw Blame History Unescape Escape