Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.
Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.
Implements Opus decoding on ESP32 for TTS playback, achieving 14-16x
compression over raw PCM. This improves WiFi throughput margin from 2.2x
to 30x+, enabling reliable operation throughout the home even with poor
WiFi conditions.
Key changes:
- Add Opus decoder to ESP32 firmware with dedicated 32KB FreeRTOS task
- Implement length-prefixed TCP framing for variable-bitrate Opus frames
- Update header protocol: header[5] = compression type (0=PCM, 1=μ-law, 2=Opus)
- Auto-detect USB port in flash and serial monitor scripts
- Add test script with opuslib encoder supporting WAV/M4A/MP3 input
- Document architecture and design rationale for μ-law/UDP (mic) vs Opus/TCP (speaker)
Performance:
- Compression: 640 bytes PCM → 35-50 bytes Opus per 20ms frame (14-16x)
- Bandwidth: 256 kbps → 16 kbps (94% reduction)
- WiFi margin: 2.2x → 30x+ throughput safety margin
- CPU usage: ~10-20% during playback on ESP32-S3
- Quality: High-fidelity voice suitable for human listening
🤖 Generated with [Claude Code](https://claude.com/claude-code)