mirror of https://github.com/justLV/onju-v2 synced 2026-04-21 15:47:55 +00:00

justLV 0c9c75b3bf Replace webrtcvad with Silero VAD (ONNX, no PyTorch)

Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.

Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.

2026-02-07 17:00:02 -08:00

8.7 KiB

Raw Blame History

Opus Compression Implementation Plan

Overview

Add Opus decoding to ESP32 for receiving compressed TTS audio from server over TCP, achieving ~10x compression over raw PCM (or 5x over current μ-law).

Why Opus?

10-16x compression for 16kHz mono voice (vs 2x for μ-law)
High quality - suitable for human listening (unlike μ-law)
Bandwidth target: 12-16 kbps (vs current 128 kbps with μ-law, 256 kbps raw)
WiFi margin: 4.4x → 20-30x throughput margin
Resource usage: ~20% CPU, 15-20 KB heap on ESP32-S3

Architecture

Current Flow (μ-law)

Server → [PCM 16kHz] → μ-law encode → TCP → ESP32 → μ-law decode → I2S speaker
         32 KB/s                       16 KB/s           32 KB/s

New Flow (Opus)

Server → [PCM 16kHz] → Opus encode → TCP → ESP32 → Opus decode → I2S speaker
         32 KB/s                      1.5-2 KB/s         32 KB/s

Packet Framing

Current (PCM/μ-law)

Fixed size chunks: 512 bytes μ-law = 32ms audio
No frame length needed (fixed size)

Opus (variable bitrate)

[2-byte length][Opus frame data]

Length: uint16_t in bytes (network byte order)
Frame: Compressed Opus frame
Target frame size: ~1KB raw Opus data = 320-640ms of audio @ 12-16 kbps
Frame duration: Use 20ms frames (standard for voice)

Example:

For 20ms @ 16kHz @ 12 kbps:
- PCM input: 20ms × 16000 Hz × 2 bytes = 640 bytes
- Opus output: ~30 bytes per 20ms frame
- Accumulate 32 frames (640ms) → ~960 bytes → send as one packet

Implementation Steps

1. Add Opus Library to ESP32 Firmware

Library: sh123/esp32_opus_arduino

Installation:

# Option A: Arduino Library Manager
# Search for "esp32_opus" and install

# Option B: Manual (recommended for control)
cd ~/Arduino/libraries
git clone https://github.com/sh123/esp32_opus_arduino.git

Or use PlatformIO:

lib_deps =
    sh123/esp32_opus@^1.0.0

2. Modify ESP32 Firmware

Changes to onjuino.ino:

#include <opus.h>

// Opus decoder state
OpusDecoder *opus_decoder = NULL;
const int OPUS_FRAME_SIZE = 320;  // 20ms @ 16kHz
int16_t opus_pcm_buffer[OPUS_FRAME_SIZE];

void setup() {
    // ... existing setup ...

    // Initialize Opus decoder
    int error;
    opus_decoder = opus_decoder_create(16000, 1, &error);  // 16kHz, mono
    if (error != OPUS_OK) {
        Serial.printf("Opus decoder create failed: %d\n", error);
    } else {
        Serial.println("Opus decoder initialized");
    }
}

// In TCP handler (replacing current PCM reception)
void handleOpusAudio(WiFiClient &client) {
    while (client.connected()) {
        // Read 2-byte frame length
        if (client.available() < 2) {
            delay(1);
            continue;
        }

        uint8_t len_bytes[2];
        client.read(len_bytes, 2);
        uint16_t frame_len = (len_bytes[0] << 8) | len_bytes[1];

        // Sanity check
        if (frame_len > 4000) {
            Serial.printf("Invalid frame length: %d\n", frame_len);
            break;
        }

        // Read Opus frame
        uint8_t opus_frame[frame_len];
        size_t bytes_read = 0;
        while (bytes_read < frame_len) {
            int avail = client.available();
            if (avail > 0) {
                int to_read = min(avail, (int)(frame_len - bytes_read));
                bytes_read += client.read(opus_frame + bytes_read, to_read);
            } else {
                delay(1);
            }
        }

        // Decode Opus frame
        int num_samples = opus_decode(
            opus_decoder,
            opus_frame, frame_len,
            opus_pcm_buffer, OPUS_FRAME_SIZE,
            0  // decode_fec (forward error correction)
        );

        if (num_samples < 0) {
            Serial.printf("Opus decode error: %d\n", num_samples);
            continue;
        }

        // Write to I2S (same as before, but from opus_pcm_buffer)
        // Convert to 32-bit and apply volume
        for (int i = 0; i < num_samples; i++) {
            wavData[totalSamplesRead++] = (int32_t)opus_pcm_buffer[i] << speaker_volume;
        }

        // Drain buffer when full (existing logic)
        // ...
    }
}

3. Server-Side Test Script

test_opus_tts.py:

#!/usr/bin/env python3
"""
Test Opus-compressed TTS streaming to ESP32
"""
import socket
import struct
from pydub import AudioSegment
import opuslib

ESP32_IP = "192.168.68.97"
ESP32_PORT = 3001
WAV_FILE = "recording.wav"

# Opus settings
SAMPLE_RATE = 16000
CHANNELS = 1
FRAME_SIZE = 320  # 20ms @ 16kHz
BITRATE = 12000   # 12 kbps for voice

def main():
    # Load audio
    audio = AudioSegment.from_wav(WAV_FILE)
    audio = audio.set_channels(CHANNELS)
    audio = audio.set_frame_rate(SAMPLE_RATE)
    audio = audio.set_sample_width(2)  # 16-bit
    pcm_data = audio.raw_data

    print(f"Loaded {len(pcm_data)} bytes of PCM audio ({len(pcm_data)/32000:.1f}s)")

    # Initialize Opus encoder
    encoder = opuslib.Encoder(SAMPLE_RATE, CHANNELS, opuslib.APPLICATION_VOIP)
    encoder.bitrate = BITRATE

    # Connect to ESP32
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((ESP32_IP, ESP32_PORT))
    print(f"Connected to {ESP32_IP}:{ESP32_PORT}")

    # Send header (0xAA command with Opus flag)
    header = bytes([0xAA, 0x00, 60, 14, 5, 0])
    sock.send(header)
    print("Header sent")

    # Encode and send PCM in 20ms frames
    frame_bytes = FRAME_SIZE * 2  # 320 samples * 2 bytes
    total_opus_bytes = 0
    frame_count = 0

    for i in range(0, len(pcm_data), frame_bytes):
        pcm_frame = pcm_data[i:i+frame_bytes]

        # Pad last frame if needed
        if len(pcm_frame) < frame_bytes:
            pcm_frame += b'\x00' * (frame_bytes - len(pcm_frame))

        # Encode to Opus
        opus_frame = encoder.encode(pcm_frame, FRAME_SIZE)

        # Send with length prefix
        frame_len = len(opus_frame)
        sock.send(struct.pack('>H', frame_len))  # Big-endian uint16
        sock.send(opus_frame)

        total_opus_bytes += frame_len
        frame_count += 1

        if frame_count % 50 == 0:
            print(f"Sent {frame_count} frames, {total_opus_bytes:,} bytes")

    sock.close()

    # Statistics
    compression_ratio = len(pcm_data) / total_opus_bytes
    print(f"\nRESULTS:")
    print(f"Original PCM:     {len(pcm_data):,} bytes")
    print(f"Opus compressed:  {total_opus_bytes:,} bytes")
    print(f"Compression:      {compression_ratio:.1f}x")
    print(f"Bandwidth:        {(total_opus_bytes * 8 / (len(pcm_data)/32000)) / 1000:.1f} kbps")
    print(f"Frames sent:      {frame_count}")

if __name__ == '__main__':
    main()

Dependencies:

pip install opuslib pydub

4. Modified Header Format

Add Opus flag to header to indicate compression type:

/*
header[0]   0xAA for audio
header[1:2] mic timeout in seconds
header[3]   volume
header[4]   fade rate
header[5]   compression type: 0=PCM, 1=μ-law, 2=Opus
*/

Testing Plan

Install Opus library on ESP32
Compile and flash modified firmware
Run test_opus_tts.py with recording.wav
Verify audio playback quality
Measure compression ratio and bandwidth usage

Expected Results

Bandwidth Comparison

Raw PCM:      256 kbps (32 KB/s)
μ-law:        128 kbps (16 KB/s)  [2x compression]
Opus:         12-16 kbps (1.5-2 KB/s)  [16-21x compression]

WiFi Margin

Current:      553.9 kbps throughput / 128 kbps μ-law = 4.3x margin
With Opus:    553.9 kbps throughput / 15 kbps opus = 36.9x margin

Fallback Strategy

If Opus proves problematic:

ADPCM: 4x compression, simpler than Opus
Lower sample rate: 8kHz instead of 16kHz (2x savings)
Variable bitrate μ-law: Silence detection to skip packets

Integration with ElevenLabs

ElevenLabs can output Opus directly:

audio_stream = client.text_to_speech.convert(
    voice_id=VOICE_ID,
    text=TEXT,
    model_id="eleven_monolingual_v1",
    output_format="opus_16000"  # Native Opus output!
)

This avoids double-encoding (PCM → Opus on server).

Memory Considerations

ESP32-S3 with 2MB PSRAM:

Opus decoder: ~20 KB heap (use PSRAM)
PCM buffer: 8KB (existing)
Opus frame buffer: ~4KB max
Total overhead: ~24 KB (negligible with 2MB PSRAM)

CPU Usage

Expected: 10-20% of one core @ 240MHz for Opus decoding at 16kHz mono.

This leaves plenty of headroom for:

WiFi/TCP handling
I2S audio output
LED visualization
Touch sensor processing

Next Steps

✅ Research Opus libraries (DONE)
⬜ Install sh123/esp32_opus_arduino library
⬜ Modify onjuino.ino with Opus decoder
⬜ Create test_opus_tts.py script
⬜ Test and validate
⬜ Integrate with ElevenLabs native Opus output
⬜ Update server.py to use Opus for all TTS

8.7 KiB Raw Blame History Unescape Escape