onju-v2/TESTING.md
justLV 0c9c75b3bf Replace webrtcvad with Silero VAD (ONNX, no PyTorch)
Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.

Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.
2026-02-07 17:00:02 -08:00

4.6 KiB

ESP32 Audio Streaming Test Guide

Quick Start

1. Configure ESP32 Settings

In onjuino/onjuino.ino, adjust these settings (lines 78-84):

#define USE_COMPRESSION true        // Enable μ-law compression (2x bandwidth reduction)
#define USE_LOCAL_VAD true          // Enable local VAD to sleep when silent
#define VAD_RMS_THRESHOLD 3000      // RMS threshold to detect voice (tune based on your mic)
#define VAD_SILENCE_FRAMES 100      // Frames of silence before sleep (100 * 32ms = 3.2 seconds)
#define VAD_WAKEUP_FRAMES 2         // Frames of voice to wake up (2 * 32ms = 64ms)

Testing configurations:

Test USE_COMPRESSION USE_LOCAL_VAD Expected Bandwidth
Baseline false false ~32 kbps continuous
Compression only true false ~16 kbps continuous
VAD only false true ~32 kbps when talking
Both (optimal) true true ~16 kbps when talking

2. Flash ESP32

# Open Arduino IDE, select your board, upload onjuino/onjuino.ino

3. Run Test Receiver on Mac

Install dependencies:

pip3 install numpy

Run receiver:

# Auto-detect compression mode
python3 test_mic_receiver.py --duration 10 --output test.wav

# Or specify if you know the mode
python3 test_mic_receiver.py --compressed --duration 10 --output test_compressed.wav

4. Analyze Results

The receiver will show real-time stats:

[    5.1s] Packets: 167 | Bandwidth: 15.8 kbps | RMS: 2847 | Mode: μ-law

After recording, you'll see:

Recording complete!
Duration:          10.02 seconds
WAV file size:     320.6 KB
Bytes transmitted: 160.3 KB
Compression ratio: 0.50x
Average bandwidth: 15.9 kbps
Packets received:  334
Packet loss:       0.0%

Tuning VAD Threshold

The VAD_RMS_THRESHOLD value depends on your microphone sensitivity and ambient noise:

  1. Test ambient noise:

    # Record silence, watch RMS values
    python3 test_mic_receiver.py --duration 5
    

    Note the RMS during silence (e.g., 500-1000)

  2. Test speaking:

    # Record yourself talking, watch RMS values
    python3 test_mic_receiver.py --duration 5
    

    Note the RMS while speaking (e.g., 3000-8000)

  3. Set threshold between them:

    // If silence = 800, speech = 4000, set threshold around 2000-2500
    #define VAD_RMS_THRESHOLD 2500
    

Bandwidth Comparison

Configuration Bandwidth Power Saving Audio Quality
Raw PCM, always on 32 kbps None Perfect
μ-law, always on 16 kbps None Good (telephony quality)
Raw PCM, VAD ~10 kbps avg* Moderate Perfect
μ-law, VAD ~5 kbps avg* High Good

*Assuming 30% voice activity (typical conversation)

Compression Quality Check

Listen to the output WAV files:

# Mac built-in player
afplay test.wav
afplay test_compressed.wav

# Compare side-by-side

μ-law quality should be:

  • Clear speech
  • Good for voice recognition (Whisper handles it well)
  • ⚠️ Slightly muffled compared to raw PCM
  • ⚠️ Not suitable for music

Troubleshooting

No packets received:

  • Check ESP32 Serial output for IP address
  • Verify ESP32 and Mac are on same network
  • Check firewall settings

High packet loss:

  • Check WiFi signal strength
  • Reduce VAD_SILENCE_FRAMES to keep connection active
  • Try raw PCM mode first (simpler debugging)

VAD not working:

  • Adjust VAD_RMS_THRESHOLD (see tuning section above)
  • Check Serial monitor for "VAD: Woke up" / "VAD: Sleeping" messages
  • Set USE_LOCAL_VAD false to test without VAD

Compression artifacts:

  • μ-law is lossy - some quality loss is normal
  • If unacceptable, use USE_COMPRESSION false
  • Or try ADPCM (4x compression, better quality - future work)

Next Steps

Once basic UDP streaming is working:

  1. Integrate with your existing server.py VAD pipeline
  2. Update server to handle compressed packets
  3. Consider WebSocket for playback direction
  4. Add streaming TTS for lower latency

Server Integration

Update server/server.py to handle compression:

import numpy as np

# Add μ-law decode table (same as test receiver)
ULAW_DECODE_TABLE = np.array([...])

def decode_ulaw(ulaw_bytes):
    return ULAW_DECODE_TABLE[np.frombuffer(ulaw_bytes, dtype=np.uint8)]

# In listen_detect function:
data, addr = sock.recvfrom(2048)

if len(data) == 512:  # Compressed
    samples = decode_ulaw(data)
elif len(data) == 1024:  # Raw
    samples = np.frombuffer(data, dtype=np.int16)

# Continue with existing VAD pipeline...