mirror of
https://github.com/justLV/onju-v2
synced 2026-04-21 15:47:55 +00:00
Switch from webrtcvad's binary is_speech to Silero VAD's calibrated float probability via direct ONNX session calls with numpy. The LSTM provides temporal smoothing natively, eliminating the sliding window hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end to match Silero's requirements. Consolidate pipeline/requirements.txt into root requirements.txt, swap webrtcvad+setuptools for silero-vad+onnxruntime.
334 lines
8.7 KiB
Markdown
334 lines
8.7 KiB
Markdown
# Opus Compression Implementation Plan
|
||
|
||
## Overview
|
||
Add Opus decoding to ESP32 for receiving compressed TTS audio from server over TCP, achieving ~10x compression over raw PCM (or 5x over current μ-law).
|
||
|
||
## Why Opus?
|
||
- **10-16x compression** for 16kHz mono voice (vs 2x for μ-law)
|
||
- **High quality** - suitable for human listening (unlike μ-law)
|
||
- **Bandwidth target**: 12-16 kbps (vs current 128 kbps with μ-law, 256 kbps raw)
|
||
- **WiFi margin**: 4.4x → 20-30x throughput margin
|
||
- **Resource usage**: ~20% CPU, 15-20 KB heap on ESP32-S3
|
||
|
||
## Architecture
|
||
|
||
### Current Flow (μ-law)
|
||
```
|
||
Server → [PCM 16kHz] → μ-law encode → TCP → ESP32 → μ-law decode → I2S speaker
|
||
32 KB/s 16 KB/s 32 KB/s
|
||
```
|
||
|
||
### New Flow (Opus)
|
||
```
|
||
Server → [PCM 16kHz] → Opus encode → TCP → ESP32 → Opus decode → I2S speaker
|
||
32 KB/s 1.5-2 KB/s 32 KB/s
|
||
```
|
||
|
||
## Packet Framing
|
||
|
||
### Current (PCM/μ-law)
|
||
- Fixed size chunks: 512 bytes μ-law = 32ms audio
|
||
- No frame length needed (fixed size)
|
||
|
||
### Opus (variable bitrate)
|
||
```
|
||
[2-byte length][Opus frame data]
|
||
```
|
||
|
||
- Length: uint16_t in bytes (network byte order)
|
||
- Frame: Compressed Opus frame
|
||
- Target frame size: ~1KB raw Opus data = 320-640ms of audio @ 12-16 kbps
|
||
- Frame duration: Use 20ms frames (standard for voice)
|
||
|
||
### Example:
|
||
```
|
||
For 20ms @ 16kHz @ 12 kbps:
|
||
- PCM input: 20ms × 16000 Hz × 2 bytes = 640 bytes
|
||
- Opus output: ~30 bytes per 20ms frame
|
||
- Accumulate 32 frames (640ms) → ~960 bytes → send as one packet
|
||
```
|
||
|
||
## Implementation Steps
|
||
|
||
### 1. Add Opus Library to ESP32 Firmware
|
||
|
||
**Library:** [sh123/esp32_opus_arduino](https://github.com/sh123/esp32_opus_arduino)
|
||
|
||
**Installation:**
|
||
```bash
|
||
# Option A: Arduino Library Manager
|
||
# Search for "esp32_opus" and install
|
||
|
||
# Option B: Manual (recommended for control)
|
||
cd ~/Arduino/libraries
|
||
git clone https://github.com/sh123/esp32_opus_arduino.git
|
||
```
|
||
|
||
**Or use PlatformIO:**
|
||
```ini
|
||
lib_deps =
|
||
sh123/esp32_opus@^1.0.0
|
||
```
|
||
|
||
### 2. Modify ESP32 Firmware
|
||
|
||
**Changes to onjuino.ino:**
|
||
|
||
```cpp
|
||
#include <opus.h>
|
||
|
||
// Opus decoder state
|
||
OpusDecoder *opus_decoder = NULL;
|
||
const int OPUS_FRAME_SIZE = 320; // 20ms @ 16kHz
|
||
int16_t opus_pcm_buffer[OPUS_FRAME_SIZE];
|
||
|
||
void setup() {
|
||
// ... existing setup ...
|
||
|
||
// Initialize Opus decoder
|
||
int error;
|
||
opus_decoder = opus_decoder_create(16000, 1, &error); // 16kHz, mono
|
||
if (error != OPUS_OK) {
|
||
Serial.printf("Opus decoder create failed: %d\n", error);
|
||
} else {
|
||
Serial.println("Opus decoder initialized");
|
||
}
|
||
}
|
||
|
||
// In TCP handler (replacing current PCM reception)
|
||
void handleOpusAudio(WiFiClient &client) {
|
||
while (client.connected()) {
|
||
// Read 2-byte frame length
|
||
if (client.available() < 2) {
|
||
delay(1);
|
||
continue;
|
||
}
|
||
|
||
uint8_t len_bytes[2];
|
||
client.read(len_bytes, 2);
|
||
uint16_t frame_len = (len_bytes[0] << 8) | len_bytes[1];
|
||
|
||
// Sanity check
|
||
if (frame_len > 4000) {
|
||
Serial.printf("Invalid frame length: %d\n", frame_len);
|
||
break;
|
||
}
|
||
|
||
// Read Opus frame
|
||
uint8_t opus_frame[frame_len];
|
||
size_t bytes_read = 0;
|
||
while (bytes_read < frame_len) {
|
||
int avail = client.available();
|
||
if (avail > 0) {
|
||
int to_read = min(avail, (int)(frame_len - bytes_read));
|
||
bytes_read += client.read(opus_frame + bytes_read, to_read);
|
||
} else {
|
||
delay(1);
|
||
}
|
||
}
|
||
|
||
// Decode Opus frame
|
||
int num_samples = opus_decode(
|
||
opus_decoder,
|
||
opus_frame, frame_len,
|
||
opus_pcm_buffer, OPUS_FRAME_SIZE,
|
||
0 // decode_fec (forward error correction)
|
||
);
|
||
|
||
if (num_samples < 0) {
|
||
Serial.printf("Opus decode error: %d\n", num_samples);
|
||
continue;
|
||
}
|
||
|
||
// Write to I2S (same as before, but from opus_pcm_buffer)
|
||
// Convert to 32-bit and apply volume
|
||
for (int i = 0; i < num_samples; i++) {
|
||
wavData[totalSamplesRead++] = (int32_t)opus_pcm_buffer[i] << speaker_volume;
|
||
}
|
||
|
||
// Drain buffer when full (existing logic)
|
||
// ...
|
||
}
|
||
}
|
||
```
|
||
|
||
### 3. Server-Side Test Script
|
||
|
||
**test_opus_tts.py:**
|
||
|
||
```python
|
||
#!/usr/bin/env python3
|
||
"""
|
||
Test Opus-compressed TTS streaming to ESP32
|
||
"""
|
||
import socket
|
||
import struct
|
||
from pydub import AudioSegment
|
||
import opuslib
|
||
|
||
ESP32_IP = "192.168.68.97"
|
||
ESP32_PORT = 3001
|
||
WAV_FILE = "recording.wav"
|
||
|
||
# Opus settings
|
||
SAMPLE_RATE = 16000
|
||
CHANNELS = 1
|
||
FRAME_SIZE = 320 # 20ms @ 16kHz
|
||
BITRATE = 12000 # 12 kbps for voice
|
||
|
||
def main():
|
||
# Load audio
|
||
audio = AudioSegment.from_wav(WAV_FILE)
|
||
audio = audio.set_channels(CHANNELS)
|
||
audio = audio.set_frame_rate(SAMPLE_RATE)
|
||
audio = audio.set_sample_width(2) # 16-bit
|
||
pcm_data = audio.raw_data
|
||
|
||
print(f"Loaded {len(pcm_data)} bytes of PCM audio ({len(pcm_data)/32000:.1f}s)")
|
||
|
||
# Initialize Opus encoder
|
||
encoder = opuslib.Encoder(SAMPLE_RATE, CHANNELS, opuslib.APPLICATION_VOIP)
|
||
encoder.bitrate = BITRATE
|
||
|
||
# Connect to ESP32
|
||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||
sock.connect((ESP32_IP, ESP32_PORT))
|
||
print(f"Connected to {ESP32_IP}:{ESP32_PORT}")
|
||
|
||
# Send header (0xAA command with Opus flag)
|
||
header = bytes([0xAA, 0x00, 60, 14, 5, 0])
|
||
sock.send(header)
|
||
print("Header sent")
|
||
|
||
# Encode and send PCM in 20ms frames
|
||
frame_bytes = FRAME_SIZE * 2 # 320 samples * 2 bytes
|
||
total_opus_bytes = 0
|
||
frame_count = 0
|
||
|
||
for i in range(0, len(pcm_data), frame_bytes):
|
||
pcm_frame = pcm_data[i:i+frame_bytes]
|
||
|
||
# Pad last frame if needed
|
||
if len(pcm_frame) < frame_bytes:
|
||
pcm_frame += b'\x00' * (frame_bytes - len(pcm_frame))
|
||
|
||
# Encode to Opus
|
||
opus_frame = encoder.encode(pcm_frame, FRAME_SIZE)
|
||
|
||
# Send with length prefix
|
||
frame_len = len(opus_frame)
|
||
sock.send(struct.pack('>H', frame_len)) # Big-endian uint16
|
||
sock.send(opus_frame)
|
||
|
||
total_opus_bytes += frame_len
|
||
frame_count += 1
|
||
|
||
if frame_count % 50 == 0:
|
||
print(f"Sent {frame_count} frames, {total_opus_bytes:,} bytes")
|
||
|
||
sock.close()
|
||
|
||
# Statistics
|
||
compression_ratio = len(pcm_data) / total_opus_bytes
|
||
print(f"\nRESULTS:")
|
||
print(f"Original PCM: {len(pcm_data):,} bytes")
|
||
print(f"Opus compressed: {total_opus_bytes:,} bytes")
|
||
print(f"Compression: {compression_ratio:.1f}x")
|
||
print(f"Bandwidth: {(total_opus_bytes * 8 / (len(pcm_data)/32000)) / 1000:.1f} kbps")
|
||
print(f"Frames sent: {frame_count}")
|
||
|
||
if __name__ == '__main__':
|
||
main()
|
||
```
|
||
|
||
**Dependencies:**
|
||
```bash
|
||
pip install opuslib pydub
|
||
```
|
||
|
||
### 4. Modified Header Format
|
||
|
||
Add Opus flag to header to indicate compression type:
|
||
|
||
```cpp
|
||
/*
|
||
header[0] 0xAA for audio
|
||
header[1:2] mic timeout in seconds
|
||
header[3] volume
|
||
header[4] fade rate
|
||
header[5] compression type: 0=PCM, 1=μ-law, 2=Opus
|
||
*/
|
||
```
|
||
|
||
## Testing Plan
|
||
|
||
1. **Install Opus library** on ESP32
|
||
2. **Compile and flash** modified firmware
|
||
3. **Run test_opus_tts.py** with recording.wav
|
||
4. **Verify audio playback** quality
|
||
5. **Measure compression ratio** and bandwidth usage
|
||
|
||
## Expected Results
|
||
|
||
### Bandwidth Comparison
|
||
```
|
||
Raw PCM: 256 kbps (32 KB/s)
|
||
μ-law: 128 kbps (16 KB/s) [2x compression]
|
||
Opus: 12-16 kbps (1.5-2 KB/s) [16-21x compression]
|
||
```
|
||
|
||
### WiFi Margin
|
||
```
|
||
Current: 553.9 kbps throughput / 128 kbps μ-law = 4.3x margin
|
||
With Opus: 553.9 kbps throughput / 15 kbps opus = 36.9x margin
|
||
```
|
||
|
||
## Fallback Strategy
|
||
|
||
If Opus proves problematic:
|
||
1. **ADPCM**: 4x compression, simpler than Opus
|
||
2. **Lower sample rate**: 8kHz instead of 16kHz (2x savings)
|
||
3. **Variable bitrate μ-law**: Silence detection to skip packets
|
||
|
||
## Integration with ElevenLabs
|
||
|
||
ElevenLabs can output Opus directly:
|
||
|
||
```python
|
||
audio_stream = client.text_to_speech.convert(
|
||
voice_id=VOICE_ID,
|
||
text=TEXT,
|
||
model_id="eleven_monolingual_v1",
|
||
output_format="opus_16000" # Native Opus output!
|
||
)
|
||
```
|
||
|
||
This avoids double-encoding (PCM → Opus on server).
|
||
|
||
## Memory Considerations
|
||
|
||
**ESP32-S3 with 2MB PSRAM:**
|
||
- Opus decoder: ~20 KB heap (use PSRAM)
|
||
- PCM buffer: 8KB (existing)
|
||
- Opus frame buffer: ~4KB max
|
||
- **Total overhead: ~24 KB** (negligible with 2MB PSRAM)
|
||
|
||
## CPU Usage
|
||
|
||
Expected: **10-20% of one core @ 240MHz** for Opus decoding at 16kHz mono.
|
||
|
||
This leaves plenty of headroom for:
|
||
- WiFi/TCP handling
|
||
- I2S audio output
|
||
- LED visualization
|
||
- Touch sensor processing
|
||
|
||
## Next Steps
|
||
|
||
1. ✅ Research Opus libraries (DONE)
|
||
2. ⬜ Install sh123/esp32_opus_arduino library
|
||
3. ⬜ Modify onjuino.ino with Opus decoder
|
||
4. ⬜ Create test_opus_tts.py script
|
||
5. ⬜ Test and validate
|
||
6. ⬜ Integrate with ElevenLabs native Opus output
|
||
7. ⬜ Update server.py to use Opus for all TTS
|