onju-v2/OPUS_IMPLEMENTATION.md
justLV 0c9c75b3bf Replace webrtcvad with Silero VAD (ONNX, no PyTorch)
Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.

Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.
2026-02-07 17:00:02 -08:00

334 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Opus Compression Implementation Plan
## Overview
Add Opus decoding to ESP32 for receiving compressed TTS audio from server over TCP, achieving ~10x compression over raw PCM (or 5x over current μ-law).
## Why Opus?
- **10-16x compression** for 16kHz mono voice (vs 2x for μ-law)
- **High quality** - suitable for human listening (unlike μ-law)
- **Bandwidth target**: 12-16 kbps (vs current 128 kbps with μ-law, 256 kbps raw)
- **WiFi margin**: 4.4x → 20-30x throughput margin
- **Resource usage**: ~20% CPU, 15-20 KB heap on ESP32-S3
## Architecture
### Current Flow (μ-law)
```
Server → [PCM 16kHz] → μ-law encode → TCP → ESP32 → μ-law decode → I2S speaker
32 KB/s 16 KB/s 32 KB/s
```
### New Flow (Opus)
```
Server → [PCM 16kHz] → Opus encode → TCP → ESP32 → Opus decode → I2S speaker
32 KB/s 1.5-2 KB/s 32 KB/s
```
## Packet Framing
### Current (PCM/μ-law)
- Fixed size chunks: 512 bytes μ-law = 32ms audio
- No frame length needed (fixed size)
### Opus (variable bitrate)
```
[2-byte length][Opus frame data]
```
- Length: uint16_t in bytes (network byte order)
- Frame: Compressed Opus frame
- Target frame size: ~1KB raw Opus data = 320-640ms of audio @ 12-16 kbps
- Frame duration: Use 20ms frames (standard for voice)
### Example:
```
For 20ms @ 16kHz @ 12 kbps:
- PCM input: 20ms × 16000 Hz × 2 bytes = 640 bytes
- Opus output: ~30 bytes per 20ms frame
- Accumulate 32 frames (640ms) → ~960 bytes → send as one packet
```
## Implementation Steps
### 1. Add Opus Library to ESP32 Firmware
**Library:** [sh123/esp32_opus_arduino](https://github.com/sh123/esp32_opus_arduino)
**Installation:**
```bash
# Option A: Arduino Library Manager
# Search for "esp32_opus" and install
# Option B: Manual (recommended for control)
cd ~/Arduino/libraries
git clone https://github.com/sh123/esp32_opus_arduino.git
```
**Or use PlatformIO:**
```ini
lib_deps =
sh123/esp32_opus@^1.0.0
```
### 2. Modify ESP32 Firmware
**Changes to onjuino.ino:**
```cpp
#include <opus.h>
// Opus decoder state
OpusDecoder *opus_decoder = NULL;
const int OPUS_FRAME_SIZE = 320; // 20ms @ 16kHz
int16_t opus_pcm_buffer[OPUS_FRAME_SIZE];
void setup() {
// ... existing setup ...
// Initialize Opus decoder
int error;
opus_decoder = opus_decoder_create(16000, 1, &error); // 16kHz, mono
if (error != OPUS_OK) {
Serial.printf("Opus decoder create failed: %d\n", error);
} else {
Serial.println("Opus decoder initialized");
}
}
// In TCP handler (replacing current PCM reception)
void handleOpusAudio(WiFiClient &client) {
while (client.connected()) {
// Read 2-byte frame length
if (client.available() < 2) {
delay(1);
continue;
}
uint8_t len_bytes[2];
client.read(len_bytes, 2);
uint16_t frame_len = (len_bytes[0] << 8) | len_bytes[1];
// Sanity check
if (frame_len > 4000) {
Serial.printf("Invalid frame length: %d\n", frame_len);
break;
}
// Read Opus frame
uint8_t opus_frame[frame_len];
size_t bytes_read = 0;
while (bytes_read < frame_len) {
int avail = client.available();
if (avail > 0) {
int to_read = min(avail, (int)(frame_len - bytes_read));
bytes_read += client.read(opus_frame + bytes_read, to_read);
} else {
delay(1);
}
}
// Decode Opus frame
int num_samples = opus_decode(
opus_decoder,
opus_frame, frame_len,
opus_pcm_buffer, OPUS_FRAME_SIZE,
0 // decode_fec (forward error correction)
);
if (num_samples < 0) {
Serial.printf("Opus decode error: %d\n", num_samples);
continue;
}
// Write to I2S (same as before, but from opus_pcm_buffer)
// Convert to 32-bit and apply volume
for (int i = 0; i < num_samples; i++) {
wavData[totalSamplesRead++] = (int32_t)opus_pcm_buffer[i] << speaker_volume;
}
// Drain buffer when full (existing logic)
// ...
}
}
```
### 3. Server-Side Test Script
**test_opus_tts.py:**
```python
#!/usr/bin/env python3
"""
Test Opus-compressed TTS streaming to ESP32
"""
import socket
import struct
from pydub import AudioSegment
import opuslib
ESP32_IP = "192.168.68.97"
ESP32_PORT = 3001
WAV_FILE = "recording.wav"
# Opus settings
SAMPLE_RATE = 16000
CHANNELS = 1
FRAME_SIZE = 320 # 20ms @ 16kHz
BITRATE = 12000 # 12 kbps for voice
def main():
# Load audio
audio = AudioSegment.from_wav(WAV_FILE)
audio = audio.set_channels(CHANNELS)
audio = audio.set_frame_rate(SAMPLE_RATE)
audio = audio.set_sample_width(2) # 16-bit
pcm_data = audio.raw_data
print(f"Loaded {len(pcm_data)} bytes of PCM audio ({len(pcm_data)/32000:.1f}s)")
# Initialize Opus encoder
encoder = opuslib.Encoder(SAMPLE_RATE, CHANNELS, opuslib.APPLICATION_VOIP)
encoder.bitrate = BITRATE
# Connect to ESP32
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((ESP32_IP, ESP32_PORT))
print(f"Connected to {ESP32_IP}:{ESP32_PORT}")
# Send header (0xAA command with Opus flag)
header = bytes([0xAA, 0x00, 60, 14, 5, 0])
sock.send(header)
print("Header sent")
# Encode and send PCM in 20ms frames
frame_bytes = FRAME_SIZE * 2 # 320 samples * 2 bytes
total_opus_bytes = 0
frame_count = 0
for i in range(0, len(pcm_data), frame_bytes):
pcm_frame = pcm_data[i:i+frame_bytes]
# Pad last frame if needed
if len(pcm_frame) < frame_bytes:
pcm_frame += b'\x00' * (frame_bytes - len(pcm_frame))
# Encode to Opus
opus_frame = encoder.encode(pcm_frame, FRAME_SIZE)
# Send with length prefix
frame_len = len(opus_frame)
sock.send(struct.pack('>H', frame_len)) # Big-endian uint16
sock.send(opus_frame)
total_opus_bytes += frame_len
frame_count += 1
if frame_count % 50 == 0:
print(f"Sent {frame_count} frames, {total_opus_bytes:,} bytes")
sock.close()
# Statistics
compression_ratio = len(pcm_data) / total_opus_bytes
print(f"\nRESULTS:")
print(f"Original PCM: {len(pcm_data):,} bytes")
print(f"Opus compressed: {total_opus_bytes:,} bytes")
print(f"Compression: {compression_ratio:.1f}x")
print(f"Bandwidth: {(total_opus_bytes * 8 / (len(pcm_data)/32000)) / 1000:.1f} kbps")
print(f"Frames sent: {frame_count}")
if __name__ == '__main__':
main()
```
**Dependencies:**
```bash
pip install opuslib pydub
```
### 4. Modified Header Format
Add Opus flag to header to indicate compression type:
```cpp
/*
header[0] 0xAA for audio
header[1:2] mic timeout in seconds
header[3] volume
header[4] fade rate
header[5] compression type: 0=PCM, 1=μ-law, 2=Opus
*/
```
## Testing Plan
1. **Install Opus library** on ESP32
2. **Compile and flash** modified firmware
3. **Run test_opus_tts.py** with recording.wav
4. **Verify audio playback** quality
5. **Measure compression ratio** and bandwidth usage
## Expected Results
### Bandwidth Comparison
```
Raw PCM: 256 kbps (32 KB/s)
μ-law: 128 kbps (16 KB/s) [2x compression]
Opus: 12-16 kbps (1.5-2 KB/s) [16-21x compression]
```
### WiFi Margin
```
Current: 553.9 kbps throughput / 128 kbps μ-law = 4.3x margin
With Opus: 553.9 kbps throughput / 15 kbps opus = 36.9x margin
```
## Fallback Strategy
If Opus proves problematic:
1. **ADPCM**: 4x compression, simpler than Opus
2. **Lower sample rate**: 8kHz instead of 16kHz (2x savings)
3. **Variable bitrate μ-law**: Silence detection to skip packets
## Integration with ElevenLabs
ElevenLabs can output Opus directly:
```python
audio_stream = client.text_to_speech.convert(
voice_id=VOICE_ID,
text=TEXT,
model_id="eleven_monolingual_v1",
output_format="opus_16000" # Native Opus output!
)
```
This avoids double-encoding (PCM → Opus on server).
## Memory Considerations
**ESP32-S3 with 2MB PSRAM:**
- Opus decoder: ~20 KB heap (use PSRAM)
- PCM buffer: 8KB (existing)
- Opus frame buffer: ~4KB max
- **Total overhead: ~24 KB** (negligible with 2MB PSRAM)
## CPU Usage
Expected: **10-20% of one core @ 240MHz** for Opus decoding at 16kHz mono.
This leaves plenty of headroom for:
- WiFi/TCP handling
- I2S audio output
- LED visualization
- Touch sensor processing
## Next Steps
1. ✅ Research Opus libraries (DONE)
2. ⬜ Install sh123/esp32_opus_arduino library
3. ⬜ Modify onjuino.ino with Opus decoder
4. ⬜ Create test_opus_tts.py script
5. ⬜ Test and validate
6. ⬜ Integrate with ElevenLabs native Opus output
7. ⬜ Update server.py to use Opus for all TTS