mirror of https://github.com/justLV/onju-v2 synced 2026-04-21 15:47:55 +00:00

justLV 0c9c75b3bf Replace webrtcvad with Silero VAD (ONNX, no PyTorch)

Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.

Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.

2026-02-07 17:00:02 -08:00

4.6 KiB

Raw Blame History

ESP32 Audio Streaming Test Guide

Quick Start

1. Configure ESP32 Settings

In onjuino/onjuino.ino, adjust these settings (lines 78-84):

#define USE_COMPRESSION true        // Enable μ-law compression (2x bandwidth reduction)
#define USE_LOCAL_VAD true          // Enable local VAD to sleep when silent
#define VAD_RMS_THRESHOLD 3000      // RMS threshold to detect voice (tune based on your mic)
#define VAD_SILENCE_FRAMES 100      // Frames of silence before sleep (100 * 32ms = 3.2 seconds)
#define VAD_WAKEUP_FRAMES 2         // Frames of voice to wake up (2 * 32ms = 64ms)

Testing configurations:

Test	USE_COMPRESSION	USE_LOCAL_VAD	Expected Bandwidth
Baseline	false	false	~32 kbps continuous
Compression only	true	false	~16 kbps continuous
VAD only	false	true	~32 kbps when talking
Both (optimal)	true	true	~16 kbps when talking

2. Flash ESP32

# Open Arduino IDE, select your board, upload onjuino/onjuino.ino

3. Run Test Receiver on Mac

Install dependencies:

pip3 install numpy

Run receiver:

# Auto-detect compression mode
python3 test_mic_receiver.py --duration 10 --output test.wav

# Or specify if you know the mode
python3 test_mic_receiver.py --compressed --duration 10 --output test_compressed.wav

4. Analyze Results

The receiver will show real-time stats:

[    5.1s] Packets: 167 | Bandwidth: 15.8 kbps | RMS: 2847 | Mode: μ-law

After recording, you'll see:

Recording complete!
Duration:          10.02 seconds
WAV file size:     320.6 KB
Bytes transmitted: 160.3 KB
Compression ratio: 0.50x
Average bandwidth: 15.9 kbps
Packets received:  334
Packet loss:       0.0%

Tuning VAD Threshold

The VAD_RMS_THRESHOLD value depends on your microphone sensitivity and ambient noise:

Test ambient noise:

# Record silence, watch RMS values
python3 test_mic_receiver.py --duration 5

Note the RMS during silence (e.g., 500-1000)

Test speaking:

# Record yourself talking, watch RMS values
python3 test_mic_receiver.py --duration 5

Note the RMS while speaking (e.g., 3000-8000)

Set threshold between them:

// If silence = 800, speech = 4000, set threshold around 2000-2500
#define VAD_RMS_THRESHOLD 2500

Bandwidth Comparison

Configuration	Bandwidth	Power Saving	Audio Quality
Raw PCM, always on	32 kbps	None	Perfect
μ-law, always on	16 kbps	None	Good (telephony quality)
Raw PCM, VAD	~10 kbps avg*	Moderate	Perfect
μ-law, VAD	~5 kbps avg*	High	Good

*Assuming 30% voice activity (typical conversation)

Compression Quality Check

Listen to the output WAV files:

# Mac built-in player
afplay test.wav
afplay test_compressed.wav

# Compare side-by-side

μ-law quality should be:

✅ Clear speech
✅ Good for voice recognition (Whisper handles it well)
⚠️ Slightly muffled compared to raw PCM
⚠️ Not suitable for music

Troubleshooting

No packets received:

Check ESP32 Serial output for IP address
Verify ESP32 and Mac are on same network
Check firewall settings

High packet loss:

Check WiFi signal strength
Reduce VAD_SILENCE_FRAMES to keep connection active
Try raw PCM mode first (simpler debugging)

VAD not working:

Adjust VAD_RMS_THRESHOLD (see tuning section above)
Check Serial monitor for "VAD: Woke up" / "VAD: Sleeping" messages
Set USE_LOCAL_VAD false to test without VAD

Compression artifacts:

μ-law is lossy - some quality loss is normal
If unacceptable, use USE_COMPRESSION false
Or try ADPCM (4x compression, better quality - future work)

Next Steps

Once basic UDP streaming is working:

Integrate with your existing server.py VAD pipeline
Update server to handle compressed packets
Consider WebSocket for playback direction
Add streaming TTS for lower latency

Server Integration

Update server/server.py to handle compression:

import numpy as np

# Add μ-law decode table (same as test receiver)
ULAW_DECODE_TABLE = np.array([...])

def decode_ulaw(ulaw_bytes):
    return ULAW_DECODE_TABLE[np.frombuffer(ulaw_bytes, dtype=np.uint8)]

# In listen_detect function:
data, addr = sock.recvfrom(2048)

if len(data) == 512:  # Compressed
    samples = decode_ulaw(data)
elif len(data) == 1024:  # Raw
    samples = np.frombuffer(data, dtype=np.int16)

# Continue with existing VAD pipeline...

4.6 KiB Raw Blame History