mirror of
https://github.com/justLV/onju-v2
synced 2026-04-21 15:47:55 +00:00
Switch from webrtcvad's binary is_speech to Silero VAD's calibrated float probability via direct ONNX session calls with numpy. The LSTM provides temporal smoothing natively, eliminating the sliding window hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end to match Silero's requirements. Consolidate pipeline/requirements.txt into root requirements.txt, swap webrtcvad+setuptools for silero-vad+onnxruntime.
171 lines
4.6 KiB
Markdown
171 lines
4.6 KiB
Markdown
# ESP32 Audio Streaming Test Guide
|
|
|
|
## Quick Start
|
|
|
|
### 1. Configure ESP32 Settings
|
|
|
|
In `onjuino/onjuino.ino`, adjust these settings (lines 78-84):
|
|
|
|
```cpp
|
|
#define USE_COMPRESSION true // Enable μ-law compression (2x bandwidth reduction)
|
|
#define USE_LOCAL_VAD true // Enable local VAD to sleep when silent
|
|
#define VAD_RMS_THRESHOLD 3000 // RMS threshold to detect voice (tune based on your mic)
|
|
#define VAD_SILENCE_FRAMES 100 // Frames of silence before sleep (100 * 32ms = 3.2 seconds)
|
|
#define VAD_WAKEUP_FRAMES 2 // Frames of voice to wake up (2 * 32ms = 64ms)
|
|
```
|
|
|
|
**Testing configurations:**
|
|
|
|
| Test | USE_COMPRESSION | USE_LOCAL_VAD | Expected Bandwidth |
|
|
|------|----------------|---------------|-------------------|
|
|
| Baseline | false | false | ~32 kbps continuous |
|
|
| Compression only | true | false | ~16 kbps continuous |
|
|
| VAD only | false | true | ~32 kbps when talking |
|
|
| Both (optimal) | true | true | ~16 kbps when talking |
|
|
|
|
### 2. Flash ESP32
|
|
|
|
```bash
|
|
# Open Arduino IDE, select your board, upload onjuino/onjuino.ino
|
|
```
|
|
|
|
### 3. Run Test Receiver on Mac
|
|
|
|
Install dependencies:
|
|
```bash
|
|
pip3 install numpy
|
|
```
|
|
|
|
Run receiver:
|
|
```bash
|
|
# Auto-detect compression mode
|
|
python3 test_mic_receiver.py --duration 10 --output test.wav
|
|
|
|
# Or specify if you know the mode
|
|
python3 test_mic_receiver.py --compressed --duration 10 --output test_compressed.wav
|
|
```
|
|
|
|
### 4. Analyze Results
|
|
|
|
The receiver will show real-time stats:
|
|
```
|
|
[ 5.1s] Packets: 167 | Bandwidth: 15.8 kbps | RMS: 2847 | Mode: μ-law
|
|
```
|
|
|
|
After recording, you'll see:
|
|
```
|
|
Recording complete!
|
|
Duration: 10.02 seconds
|
|
WAV file size: 320.6 KB
|
|
Bytes transmitted: 160.3 KB
|
|
Compression ratio: 0.50x
|
|
Average bandwidth: 15.9 kbps
|
|
Packets received: 334
|
|
Packet loss: 0.0%
|
|
```
|
|
|
|
## Tuning VAD Threshold
|
|
|
|
The `VAD_RMS_THRESHOLD` value depends on your microphone sensitivity and ambient noise:
|
|
|
|
1. **Test ambient noise:**
|
|
```bash
|
|
# Record silence, watch RMS values
|
|
python3 test_mic_receiver.py --duration 5
|
|
```
|
|
Note the RMS during silence (e.g., 500-1000)
|
|
|
|
2. **Test speaking:**
|
|
```bash
|
|
# Record yourself talking, watch RMS values
|
|
python3 test_mic_receiver.py --duration 5
|
|
```
|
|
Note the RMS while speaking (e.g., 3000-8000)
|
|
|
|
3. **Set threshold between them:**
|
|
```cpp
|
|
// If silence = 800, speech = 4000, set threshold around 2000-2500
|
|
#define VAD_RMS_THRESHOLD 2500
|
|
```
|
|
|
|
## Bandwidth Comparison
|
|
|
|
| Configuration | Bandwidth | Power Saving | Audio Quality |
|
|
|--------------|-----------|--------------|---------------|
|
|
| Raw PCM, always on | 32 kbps | None | Perfect |
|
|
| μ-law, always on | 16 kbps | None | Good (telephony quality) |
|
|
| Raw PCM, VAD | ~10 kbps avg* | Moderate | Perfect |
|
|
| μ-law, VAD | ~5 kbps avg* | High | Good |
|
|
|
|
*Assuming 30% voice activity (typical conversation)
|
|
|
|
## Compression Quality Check
|
|
|
|
Listen to the output WAV files:
|
|
```bash
|
|
# Mac built-in player
|
|
afplay test.wav
|
|
afplay test_compressed.wav
|
|
|
|
# Compare side-by-side
|
|
```
|
|
|
|
μ-law quality should be:
|
|
- ✅ Clear speech
|
|
- ✅ Good for voice recognition (Whisper handles it well)
|
|
- ⚠️ Slightly muffled compared to raw PCM
|
|
- ⚠️ Not suitable for music
|
|
|
|
## Troubleshooting
|
|
|
|
**No packets received:**
|
|
- Check ESP32 Serial output for IP address
|
|
- Verify ESP32 and Mac are on same network
|
|
- Check firewall settings
|
|
|
|
**High packet loss:**
|
|
- Check WiFi signal strength
|
|
- Reduce `VAD_SILENCE_FRAMES` to keep connection active
|
|
- Try raw PCM mode first (simpler debugging)
|
|
|
|
**VAD not working:**
|
|
- Adjust `VAD_RMS_THRESHOLD` (see tuning section above)
|
|
- Check Serial monitor for "VAD: Woke up" / "VAD: Sleeping" messages
|
|
- Set `USE_LOCAL_VAD false` to test without VAD
|
|
|
|
**Compression artifacts:**
|
|
- μ-law is lossy - some quality loss is normal
|
|
- If unacceptable, use `USE_COMPRESSION false`
|
|
- Or try ADPCM (4x compression, better quality - future work)
|
|
|
|
## Next Steps
|
|
|
|
Once basic UDP streaming is working:
|
|
1. Integrate with your existing server.py VAD pipeline
|
|
2. Update server to handle compressed packets
|
|
3. Consider WebSocket for playback direction
|
|
4. Add streaming TTS for lower latency
|
|
|
|
## Server Integration
|
|
|
|
Update `server/server.py` to handle compression:
|
|
|
|
```python
|
|
import numpy as np
|
|
|
|
# Add μ-law decode table (same as test receiver)
|
|
ULAW_DECODE_TABLE = np.array([...])
|
|
|
|
def decode_ulaw(ulaw_bytes):
|
|
return ULAW_DECODE_TABLE[np.frombuffer(ulaw_bytes, dtype=np.uint8)]
|
|
|
|
# In listen_detect function:
|
|
data, addr = sock.recvfrom(2048)
|
|
|
|
if len(data) == 512: # Compressed
|
|
samples = decode_ulaw(data)
|
|
elif len(data) == 1024: # Raw
|
|
samples = np.frombuffer(data, dtype=np.int16)
|
|
|
|
# Continue with existing VAD pipeline...
|
|
```
|