mirror of
https://github.com/justLV/onju-v2
synced 2026-04-21 07:37:34 +00:00
Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server
- Rewrite README with v2 features (OpenClaw, M5 Echo, Opus, pluggable backends), fold ARCHITECTURE.md and PIPELINE.md content inline - Remove dev-only test scripts (streaming TTS, UDP recv, qwen3 bench, etc.) - Remove redundant m5_echo/flash.sh and terminal.py (root scripts handle both) - Consolidate credentials to .template naming, remove .example - Embed parakeet-mlx ASR server as optional dependency (pipeline/services/asr_server.py) - Default LLM to Claude Haiku 4.5 via OpenRouter, local example uses Gemma 4 E4B - Update pyproject.toml with metadata, bump to 2.0.0 - Clean up .gitignore
This commit is contained in:
parent
81452009d7
commit
398f89dca7
21 changed files with 330 additions and 1897 deletions
17
.gitignore
vendored
17
.gitignore
vendored
|
|
@ -9,25 +9,22 @@ __pycache__/
|
|||
*.egg-info/
|
||||
dist/
|
||||
build/
|
||||
.venv/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
.AppleDouble
|
||||
.LSOverride
|
||||
|
||||
# Audio test files
|
||||
recording.wav
|
||||
|
||||
# Local development
|
||||
server/local.py
|
||||
# Secrets & local config
|
||||
pipeline/config.yaml
|
||||
.venv/
|
||||
m5_echo/credentials.h
|
||||
|
||||
# Runtime data
|
||||
# Runtime data & test output
|
||||
data/
|
||||
logs/
|
||||
*.wav
|
||||
!data/.gitkeep
|
||||
|
||||
# Claude
|
||||
.claude/
|
||||
m5_echo/credentials.h
|
||||
m5_echo/mic_test.wav
|
||||
m5_echo/build/
|
||||
|
|
|
|||
177
ARCHITECTURE.md
177
ARCHITECTURE.md
|
|
@ -1,177 +0,0 @@
|
|||
# Onju Voice Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
ESP32-S3 voice assistant with bidirectional audio streaming over WiFi to a server running speech recognition and text-to-speech.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ ESP32-S3 │
|
||||
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ │
|
||||
│ │ Mic │───→│ I2S RX │───→│ μ-law │──→│ UDP │ │
|
||||
│ │ (INMP441)│ │ 16kHz │ │ encode │ │ 3000 │ │
|
||||
│ └──────────┘ └─────────┘ └──────────┘ └─────┬───┘ │
|
||||
│ │ │
|
||||
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────▼───┐ │
|
||||
│ │ Speaker │◀───│ I2S TX │◀───│ Opus │◀──│ TCP │ │
|
||||
│ │(MAX98357)│ │ 16kHz │ │ decode │ │ 3001 │ │
|
||||
│ └──────────┘ └─────────┘ └──────────┘ └─────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
WiFi
|
||||
│
|
||||
┌─────────────────────────────────▼───────────────────────────┐
|
||||
│ Server │
|
||||
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
|
||||
│ │ UDP │───→│ μ-law │───→│ Speech-to-Text │ │
|
||||
│ │ 3000 │ │ decode │ │ (Whisper/Deepgram) │ │
|
||||
│ └─────────┘ └──────────┘ └─────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─────────┐ ┌──────────┐ ┌─────────────────────┐ │
|
||||
│ │ TCP │◀───│ Opus │◀───│ Text-to-Speech │ │
|
||||
│ │ 3001 │ │ encode │ │ (ElevenLabs/etc) │ │
|
||||
│ └─────────┘ └──────────┘ └─────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Audio Paths
|
||||
|
||||
### Microphone → Server (UDP + μ-law)
|
||||
|
||||
- Sample rate: 16kHz mono, 512 samples/chunk (32ms)
|
||||
- μ-law compressed: 512 bytes/chunk (16 KB/s) — 2x reduction
|
||||
- UDP: no retransmissions, no connection overhead — old audio is stale anyway
|
||||
- DC offset removed per-chunk before encoding
|
||||
|
||||
**Why μ-law over Opus upstream:** μ-law is stateless (sample-by-sample table lookup, ~1% CPU), zero buffering latency, and ASR models handle the quality fine. Opus would add 20-60ms frame buffering and 10-20% CPU for no practical benefit upstream.
|
||||
|
||||
**Why UDP over TCP:** Retransmissions add latency and head-of-line blocking delays newer audio. ASR handles occasional packet loss better than delayed old audio.
|
||||
|
||||
### Server → Speaker (TCP + Opus)
|
||||
|
||||
- Sample rate: 16kHz mono, 320 samples/frame (20ms)
|
||||
- Opus compressed: ~35-50 bytes/frame (1.5-2 KB/s) — 14-16x reduction
|
||||
- TCP: reliable ordered delivery required for Opus frame decoding
|
||||
|
||||
**Why Opus over μ-law downstream:** Human ears need better quality than ASR. Opus gives 14-16x compression vs μ-law's 2x, turning a tight 2.2x WiFi margin into 30x+.
|
||||
|
||||
**Why TCP over UDP:** Lost or out-of-order Opus frames cause decode errors. TCP's reliability guarantees are worth the slight latency cost, especially with the playback buffer absorbing jitter.
|
||||
|
||||
## Device Discovery & Connection
|
||||
|
||||
1. ESP32 boots and joins WiFi
|
||||
2. Sends multicast announcement to `239.0.0.1:12345` with hostname and git hash
|
||||
3. Server discovers device, learns IP
|
||||
4. **Server connects to ESP32's TCP server** on port 3001 (ESP32 is the TCP server, not client)
|
||||
5. ESP32 learns server IP from first TCP connection, uses it for UDP mic packets
|
||||
|
||||
## TCP Command Protocol
|
||||
|
||||
All commands use a 6-byte header. The server initiates TCP connections to the ESP32.
|
||||
|
||||
### 0xAA — Audio Playback
|
||||
```
|
||||
header[0] = 0xAA
|
||||
header[1:2] = mic_timeout (seconds, big-endian) — enable mic after audio finishes
|
||||
header[3] = volume (0-20, bit-shift)
|
||||
header[4] = LED fade rate (0-255)
|
||||
header[5] = compression type: 0=PCM, 2=Opus
|
||||
```
|
||||
Followed by length-prefixed Opus frames: `[2-byte big-endian length][Opus data]...`
|
||||
|
||||
A zero-length frame (`0x00 0x00`) signals end of speech — the ESP32 exits `opusDecodeTask`, clears `isPlaying`, and re-enables the mic. The TCP connection may stay open for reuse.
|
||||
|
||||
### 0xBB — Set LEDs
|
||||
```
|
||||
header[0] = 0xBB
|
||||
header[1] = LED bitmask (bits 0-5)
|
||||
header[2:4] = RGB color
|
||||
```
|
||||
|
||||
### 0xCC — LED Blink (VAD visualization)
|
||||
```
|
||||
header[0] = 0xCC
|
||||
header[1] = starting intensity (0-255)
|
||||
header[2:4] = RGB color
|
||||
header[5] = fade rate
|
||||
```
|
||||
Also extends mic timeout if it's about to expire (VAD_MIC_EXTEND = 5s).
|
||||
|
||||
### 0xDD — Mic Timeout
|
||||
```
|
||||
header[0] = 0xDD
|
||||
header[1:2] = timeout (seconds, big-endian)
|
||||
```
|
||||
Used to stop mic while server is processing (thinking animation).
|
||||
|
||||
## FreeRTOS Task Architecture
|
||||
|
||||
The ESP32-S3's dual cores are used to separate concerns:
|
||||
|
||||
**Core 0 — Arduino loop:**
|
||||
- TCP server: accepts connections, parses headers, handles PCM playback
|
||||
- Touch/mute input handling
|
||||
- UART debug commands
|
||||
|
||||
**Core 1 — Dedicated tasks:**
|
||||
- `micTask` (4KB stack, priority 1): continuous I2S read → μ-law encode → UDP send
|
||||
- `opusDecodeTask` (32KB stack, priority 1): created per-playback, reads TCP → Opus decode → I2S write
|
||||
- `updateLedTask` (2KB stack, priority 2): 40Hz LED refresh with gamma-corrected fade
|
||||
|
||||
The 32KB stack for Opus decoding is necessary because the Opus decoder uses 10-20KB of stack internally.
|
||||
|
||||
## State Machine
|
||||
|
||||
Key state variables controlling behavior:
|
||||
|
||||
- `isPlaying` — blocks mic recording during playback
|
||||
- `mic_timeout` — millis() deadline for mic recording; 0 = off
|
||||
- `interruptPlayback` — set by center touch to abort current playback
|
||||
- `mute` — hardware mute switch state (currently disabled via `DISABLE_HARDWARE_MUTE`)
|
||||
- `serverIP` — learned from first TCP connection; `0.0.0.0` = no server yet
|
||||
|
||||
**Activation flow:**
|
||||
1. Center touch → sets `mic_timeout` to now + 60s, green LED pulse
|
||||
2. Server sends 0xCC (VAD blink) during speech → extends timeout by 5s if nearly expired
|
||||
3. Server sends 0xDD (stop mic) when transcription complete → thinking animation
|
||||
4. Server sends 0xAA (audio) with response → plays audio, then re-enables mic per header timeout
|
||||
|
||||
**Playback interruption:**
|
||||
1. Center touch during playback → sets `interruptPlayback`, clears `isPlaying`
|
||||
2. Opus/PCM task detects flag, stops decoding
|
||||
3. Remaining TCP data drained (up to 1s) without playing
|
||||
4. Mic enabled immediately for 60s
|
||||
|
||||
## LED System
|
||||
|
||||
6 NeoPixel LEDs, only the inner 4 (indices 1-4) used for animations. Edge LEDs (1, 4) dimmed by half for a softer visual.
|
||||
|
||||
- **Pulse-and-fade paradigm:** `setLed()` sets color, starting intensity, and fade rate. `updateLedTask` ramps intensity down at 40Hz.
|
||||
- **Gamma correction:** LUT with gamma 1.8 (lower than typical 2.2 to avoid visible flicker at low PWM levels)
|
||||
- **Audio-reactive:** During playback, amplitude of PCM samples drives LED brightness (sampled every 32ms, only ramps up — natural fade handles the down)
|
||||
|
||||
**Color semantics:**
|
||||
- Green pulse: listening / mic active
|
||||
- White pulse: audio playback / VAD visualization
|
||||
- Red pulse: error / cannot listen (muted or no server)
|
||||
|
||||
## Volume Control
|
||||
|
||||
Bit-shift based: PCM samples are left-shifted by the volume value (0-20). Default 14. Set per-playback via the 0xAA header, configurable via NVS.
|
||||
|
||||
## Playback Buffering
|
||||
|
||||
```
|
||||
TCP → tcpBuffer (512B) → wavData (2MB PSRAM) → I2S DMA → Speaker
|
||||
```
|
||||
|
||||
- **Buffer threshold:** 4096 samples (256ms) before starting I2S playback — balances latency vs jitter resilience
|
||||
- **Without PSRAM:** falls back to 1024 samples (64ms), 4KB allocation
|
||||
- **I2S DMA:** 4 buffers × 512 samples, hardware-driven (no CPU polling)
|
||||
|
||||
## Configuration
|
||||
|
||||
Stored in NVS (ESP32 Preferences): WiFi credentials, server hostname, volume, mic timeout. Editable via UART config mode (`c` command).
|
||||
|
||||
## UART Debug Commands
|
||||
|
||||
`r` restart, `M` mic on 10min, `m` mic off, `W`/`w` LED test fast/slow, `L`/`l` LEDs max/off, `A` multicast announce, `c` config mode.
|
||||
|
|
@ -1,334 +0,0 @@
|
|||
# Opus Compression Implementation Plan
|
||||
|
||||
## Overview
|
||||
Add Opus decoding to ESP32 for receiving compressed TTS audio from server over TCP, achieving ~10x compression over raw PCM (or 5x over current μ-law).
|
||||
|
||||
## Why Opus?
|
||||
- **10-16x compression** for 16kHz mono voice (vs 2x for μ-law)
|
||||
- **High quality** - suitable for human listening (unlike μ-law)
|
||||
- **Bandwidth target**: 12-16 kbps (vs current 128 kbps with μ-law, 256 kbps raw)
|
||||
- **WiFi margin**: 4.4x → 20-30x throughput margin
|
||||
- **Resource usage**: ~20% CPU, 15-20 KB heap on ESP32-S3
|
||||
|
||||
## Architecture
|
||||
|
||||
### Current Flow (μ-law)
|
||||
```
|
||||
Server → [PCM 16kHz] → μ-law encode → TCP → ESP32 → μ-law decode → I2S speaker
|
||||
32 KB/s 16 KB/s 32 KB/s
|
||||
```
|
||||
|
||||
### New Flow (Opus)
|
||||
```
|
||||
Server → [PCM 16kHz] → Opus encode → TCP → ESP32 → Opus decode → I2S speaker
|
||||
32 KB/s 1.5-2 KB/s 32 KB/s
|
||||
```
|
||||
|
||||
## Packet Framing
|
||||
|
||||
### Current (PCM/μ-law)
|
||||
- Fixed size chunks: 512 bytes μ-law = 32ms audio
|
||||
- No frame length needed (fixed size)
|
||||
|
||||
### Opus (variable bitrate)
|
||||
```
|
||||
[2-byte length][Opus frame data]
|
||||
```
|
||||
|
||||
- Length: uint16_t in bytes (network byte order)
|
||||
- Frame: Compressed Opus frame
|
||||
- Target frame size: ~1KB raw Opus data = 320-640ms of audio @ 12-16 kbps
|
||||
- Frame duration: Use 20ms frames (standard for voice)
|
||||
|
||||
### Example:
|
||||
```
|
||||
For 20ms @ 16kHz @ 12 kbps:
|
||||
- PCM input: 20ms × 16000 Hz × 2 bytes = 640 bytes
|
||||
- Opus output: ~30 bytes per 20ms frame
|
||||
- Accumulate 32 frames (640ms) → ~960 bytes → send as one packet
|
||||
```
|
||||
|
||||
## Implementation Steps
|
||||
|
||||
### 1. Add Opus Library to ESP32 Firmware
|
||||
|
||||
**Library:** [sh123/esp32_opus_arduino](https://github.com/sh123/esp32_opus_arduino)
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
# Option A: Arduino Library Manager
|
||||
# Search for "esp32_opus" and install
|
||||
|
||||
# Option B: Manual (recommended for control)
|
||||
cd ~/Arduino/libraries
|
||||
git clone https://github.com/sh123/esp32_opus_arduino.git
|
||||
```
|
||||
|
||||
**Or use PlatformIO:**
|
||||
```ini
|
||||
lib_deps =
|
||||
sh123/esp32_opus@^1.0.0
|
||||
```
|
||||
|
||||
### 2. Modify ESP32 Firmware
|
||||
|
||||
**Changes to onjuino.ino:**
|
||||
|
||||
```cpp
|
||||
#include <opus.h>
|
||||
|
||||
// Opus decoder state
|
||||
OpusDecoder *opus_decoder = NULL;
|
||||
const int OPUS_FRAME_SIZE = 320; // 20ms @ 16kHz
|
||||
int16_t opus_pcm_buffer[OPUS_FRAME_SIZE];
|
||||
|
||||
void setup() {
|
||||
// ... existing setup ...
|
||||
|
||||
// Initialize Opus decoder
|
||||
int error;
|
||||
opus_decoder = opus_decoder_create(16000, 1, &error); // 16kHz, mono
|
||||
if (error != OPUS_OK) {
|
||||
Serial.printf("Opus decoder create failed: %d\n", error);
|
||||
} else {
|
||||
Serial.println("Opus decoder initialized");
|
||||
}
|
||||
}
|
||||
|
||||
// In TCP handler (replacing current PCM reception)
|
||||
void handleOpusAudio(WiFiClient &client) {
|
||||
while (client.connected()) {
|
||||
// Read 2-byte frame length
|
||||
if (client.available() < 2) {
|
||||
delay(1);
|
||||
continue;
|
||||
}
|
||||
|
||||
uint8_t len_bytes[2];
|
||||
client.read(len_bytes, 2);
|
||||
uint16_t frame_len = (len_bytes[0] << 8) | len_bytes[1];
|
||||
|
||||
// Sanity check
|
||||
if (frame_len > 4000) {
|
||||
Serial.printf("Invalid frame length: %d\n", frame_len);
|
||||
break;
|
||||
}
|
||||
|
||||
// Read Opus frame
|
||||
uint8_t opus_frame[frame_len];
|
||||
size_t bytes_read = 0;
|
||||
while (bytes_read < frame_len) {
|
||||
int avail = client.available();
|
||||
if (avail > 0) {
|
||||
int to_read = min(avail, (int)(frame_len - bytes_read));
|
||||
bytes_read += client.read(opus_frame + bytes_read, to_read);
|
||||
} else {
|
||||
delay(1);
|
||||
}
|
||||
}
|
||||
|
||||
// Decode Opus frame
|
||||
int num_samples = opus_decode(
|
||||
opus_decoder,
|
||||
opus_frame, frame_len,
|
||||
opus_pcm_buffer, OPUS_FRAME_SIZE,
|
||||
0 // decode_fec (forward error correction)
|
||||
);
|
||||
|
||||
if (num_samples < 0) {
|
||||
Serial.printf("Opus decode error: %d\n", num_samples);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Write to I2S (same as before, but from opus_pcm_buffer)
|
||||
// Convert to 32-bit and apply volume
|
||||
for (int i = 0; i < num_samples; i++) {
|
||||
wavData[totalSamplesRead++] = (int32_t)opus_pcm_buffer[i] << speaker_volume;
|
||||
}
|
||||
|
||||
// Drain buffer when full (existing logic)
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Server-Side Test Script
|
||||
|
||||
**test_opus_tts.py:**
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test Opus-compressed TTS streaming to ESP32
|
||||
"""
|
||||
import socket
|
||||
import struct
|
||||
from pydub import AudioSegment
|
||||
import opuslib
|
||||
|
||||
ESP32_IP = "192.168.68.97"
|
||||
ESP32_PORT = 3001
|
||||
WAV_FILE = "recording.wav"
|
||||
|
||||
# Opus settings
|
||||
SAMPLE_RATE = 16000
|
||||
CHANNELS = 1
|
||||
FRAME_SIZE = 320 # 20ms @ 16kHz
|
||||
BITRATE = 12000 # 12 kbps for voice
|
||||
|
||||
def main():
|
||||
# Load audio
|
||||
audio = AudioSegment.from_wav(WAV_FILE)
|
||||
audio = audio.set_channels(CHANNELS)
|
||||
audio = audio.set_frame_rate(SAMPLE_RATE)
|
||||
audio = audio.set_sample_width(2) # 16-bit
|
||||
pcm_data = audio.raw_data
|
||||
|
||||
print(f"Loaded {len(pcm_data)} bytes of PCM audio ({len(pcm_data)/32000:.1f}s)")
|
||||
|
||||
# Initialize Opus encoder
|
||||
encoder = opuslib.Encoder(SAMPLE_RATE, CHANNELS, opuslib.APPLICATION_VOIP)
|
||||
encoder.bitrate = BITRATE
|
||||
|
||||
# Connect to ESP32
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
sock.connect((ESP32_IP, ESP32_PORT))
|
||||
print(f"Connected to {ESP32_IP}:{ESP32_PORT}")
|
||||
|
||||
# Send header (0xAA command with Opus flag)
|
||||
header = bytes([0xAA, 0x00, 60, 14, 5, 0])
|
||||
sock.send(header)
|
||||
print("Header sent")
|
||||
|
||||
# Encode and send PCM in 20ms frames
|
||||
frame_bytes = FRAME_SIZE * 2 # 320 samples * 2 bytes
|
||||
total_opus_bytes = 0
|
||||
frame_count = 0
|
||||
|
||||
for i in range(0, len(pcm_data), frame_bytes):
|
||||
pcm_frame = pcm_data[i:i+frame_bytes]
|
||||
|
||||
# Pad last frame if needed
|
||||
if len(pcm_frame) < frame_bytes:
|
||||
pcm_frame += b'\x00' * (frame_bytes - len(pcm_frame))
|
||||
|
||||
# Encode to Opus
|
||||
opus_frame = encoder.encode(pcm_frame, FRAME_SIZE)
|
||||
|
||||
# Send with length prefix
|
||||
frame_len = len(opus_frame)
|
||||
sock.send(struct.pack('>H', frame_len)) # Big-endian uint16
|
||||
sock.send(opus_frame)
|
||||
|
||||
total_opus_bytes += frame_len
|
||||
frame_count += 1
|
||||
|
||||
if frame_count % 50 == 0:
|
||||
print(f"Sent {frame_count} frames, {total_opus_bytes:,} bytes")
|
||||
|
||||
sock.close()
|
||||
|
||||
# Statistics
|
||||
compression_ratio = len(pcm_data) / total_opus_bytes
|
||||
print(f"\nRESULTS:")
|
||||
print(f"Original PCM: {len(pcm_data):,} bytes")
|
||||
print(f"Opus compressed: {total_opus_bytes:,} bytes")
|
||||
print(f"Compression: {compression_ratio:.1f}x")
|
||||
print(f"Bandwidth: {(total_opus_bytes * 8 / (len(pcm_data)/32000)) / 1000:.1f} kbps")
|
||||
print(f"Frames sent: {frame_count}")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
```
|
||||
|
||||
**Dependencies:**
|
||||
```bash
|
||||
pip install opuslib pydub
|
||||
```
|
||||
|
||||
### 4. Modified Header Format
|
||||
|
||||
Add Opus flag to header to indicate compression type:
|
||||
|
||||
```cpp
|
||||
/*
|
||||
header[0] 0xAA for audio
|
||||
header[1:2] mic timeout in seconds
|
||||
header[3] volume
|
||||
header[4] fade rate
|
||||
header[5] compression type: 0=PCM, 1=μ-law, 2=Opus
|
||||
*/
|
||||
```
|
||||
|
||||
## Testing Plan
|
||||
|
||||
1. **Install Opus library** on ESP32
|
||||
2. **Compile and flash** modified firmware
|
||||
3. **Run test_opus_tts.py** with recording.wav
|
||||
4. **Verify audio playback** quality
|
||||
5. **Measure compression ratio** and bandwidth usage
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Bandwidth Comparison
|
||||
```
|
||||
Raw PCM: 256 kbps (32 KB/s)
|
||||
μ-law: 128 kbps (16 KB/s) [2x compression]
|
||||
Opus: 12-16 kbps (1.5-2 KB/s) [16-21x compression]
|
||||
```
|
||||
|
||||
### WiFi Margin
|
||||
```
|
||||
Current: 553.9 kbps throughput / 128 kbps μ-law = 4.3x margin
|
||||
With Opus: 553.9 kbps throughput / 15 kbps opus = 36.9x margin
|
||||
```
|
||||
|
||||
## Fallback Strategy
|
||||
|
||||
If Opus proves problematic:
|
||||
1. **ADPCM**: 4x compression, simpler than Opus
|
||||
2. **Lower sample rate**: 8kHz instead of 16kHz (2x savings)
|
||||
3. **Variable bitrate μ-law**: Silence detection to skip packets
|
||||
|
||||
## Integration with ElevenLabs
|
||||
|
||||
ElevenLabs can output Opus directly:
|
||||
|
||||
```python
|
||||
audio_stream = client.text_to_speech.convert(
|
||||
voice_id=VOICE_ID,
|
||||
text=TEXT,
|
||||
model_id="eleven_monolingual_v1",
|
||||
output_format="opus_16000" # Native Opus output!
|
||||
)
|
||||
```
|
||||
|
||||
This avoids double-encoding (PCM → Opus on server).
|
||||
|
||||
## Memory Considerations
|
||||
|
||||
**ESP32-S3 with 2MB PSRAM:**
|
||||
- Opus decoder: ~20 KB heap (use PSRAM)
|
||||
- PCM buffer: 8KB (existing)
|
||||
- Opus frame buffer: ~4KB max
|
||||
- **Total overhead: ~24 KB** (negligible with 2MB PSRAM)
|
||||
|
||||
## CPU Usage
|
||||
|
||||
Expected: **10-20% of one core @ 240MHz** for Opus decoding at 16kHz mono.
|
||||
|
||||
This leaves plenty of headroom for:
|
||||
- WiFi/TCP handling
|
||||
- I2S audio output
|
||||
- LED visualization
|
||||
- Touch sensor processing
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Research Opus libraries (DONE)
|
||||
2. ⬜ Install sh123/esp32_opus_arduino library
|
||||
3. ⬜ Modify onjuino.ino with Opus decoder
|
||||
4. ⬜ Create test_opus_tts.py script
|
||||
5. ⬜ Test and validate
|
||||
6. ⬜ Integrate with ElevenLabs native Opus output
|
||||
7. ⬜ Update server.py to use Opus for all TTS
|
||||
79
PIPELINE.md
79
PIPELINE.md
|
|
@ -1,79 +0,0 @@
|
|||
# Pipeline Server
|
||||
|
||||
Async voice pipeline that connects ESP32 onju-voice devices to ASR, LLM, and TTS services.
|
||||
|
||||
```
|
||||
ESP32 (mic) ──UDP/μ-law──▶ Pipeline ──HTTP──▶ ASR Service
|
||||
│
|
||||
├──▶ LLM (OpenAI-compatible)
|
||||
│
|
||||
├──▶ TTS (ElevenLabs)
|
||||
│
|
||||
ESP32 (speaker) ◀──TCP/Opus──┘
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**ASR Service** — [parakeet-asr-server](https://github.com/justLV/parakeet-asr-server) running on port 8100.
|
||||
|
||||
**LLM** — Any OpenAI-compatible server. Examples:
|
||||
```bash
|
||||
# Local (mlx_lm)
|
||||
mlx_lm.server --model mlx-community/gemma-3-4b-it-qat-4bit --port 8080
|
||||
|
||||
# Local (Ollama)
|
||||
ollama serve # default port 11434
|
||||
|
||||
# Hosted — just set base_url and api_key in config.yaml
|
||||
```
|
||||
|
||||
**TTS** — ElevenLabs API key (add to `config.yaml`).
|
||||
|
||||
## Setup
|
||||
|
||||
```bash
|
||||
# From repo root
|
||||
uv venv && source .venv/bin/activate
|
||||
uv pip install -r requirements.txt
|
||||
|
||||
# macOS: install system libraries
|
||||
brew install opus portaudio
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```bash
|
||||
cp pipeline/config.yaml.example pipeline/config.yaml
|
||||
# Edit config.yaml with your API keys and preferences
|
||||
```
|
||||
|
||||
## Running
|
||||
|
||||
Ensure the prerequisite services are running, then start the pipeline from the repo root:
|
||||
|
||||
```bash
|
||||
source .venv/bin/activate
|
||||
python -m pipeline.main
|
||||
```
|
||||
|
||||
## Test Client
|
||||
|
||||
A Python script that emulates an ESP32 device (TCP server, Opus decoding, mic streaming):
|
||||
|
||||
```bash
|
||||
# From repo root
|
||||
python test_client.py # localhost
|
||||
python test_client.py 192.168.1.50 # remote server
|
||||
python test_client.py --no-mic # playback only
|
||||
```
|
||||
|
||||
## Config Reference
|
||||
|
||||
| Section | Key | Description |
|
||||
|---------|-----|-------------|
|
||||
| `asr.url` | ASR service endpoint | Default: `http://localhost:8100` |
|
||||
| `llm.base_url` | OpenAI-compatible API base | Ollama, mlx_lm, OpenRouter, OpenAI |
|
||||
| `llm.model` | Model name | Passed to chat completions API |
|
||||
| `tts.backend` | TTS provider | Currently: `elevenlabs` |
|
||||
| `vad.*` | Voice activity detection | Tune thresholds for sensitivity |
|
||||
| `network.*` | Ports | UDP 3000 (mic), TCP 3001 (speaker), multicast 239.0.0.1:12345 |
|
||||
298
README.md
298
README.md
|
|
@ -1,174 +1,244 @@
|
|||
# Onju Voice 🍐🔈
|
||||
# Onju Voice v2
|
||||
|
||||
💫 [DEMO's](https://twitter.com/justLV)
|
||||
Enable multiple "Google Home" speakers to connect to a Mac Mini (or other local server) for talking to your agent(s) over your local WiFi.
|
||||
|
||||
A hackable AI home assistant platform using the Google Nest Mini (2nd gen) form factor, consisting of:
|
||||
* a custom PCB designed to be a drop-in replacement to the original, using the ESP32-S3 for audio processing
|
||||
* a server for handling the transcription, response generation and Text-to-Speech from multiple devices on the same network
|
||||
This repo consists of:
|
||||
* A custom PCB designed as a drop-in replacement to the original Google Nest Mini (2nd gen), using the ESP32-S3 for audio processing and WiFi connectivity
|
||||
* An async server pipeline handling ASR -> TTS from multiple devices on the same network to be compatible with any LLM or agent platforms like OpenClaw 🦞
|
||||
|
||||
_(This repo focuses on the experimental conversational LLM aspect to replicate some functionality shown in the demos, and not as a full fledged replacement to a home assistant. This is not being actively maintained, but I've released all source code and design files for anyone else to pick up from here.)_
|
||||
> This is the successor to [onju-voice](https://github.com/justLV/onju-voice). The original repo remains available as a reference but is no longer actively maintained.
|
||||
|
||||
<img src="images/header_white.jpg" width="960">
|
||||
|
||||
## Overview
|
||||
## What's new in v2
|
||||
|
||||
This repo contains firmware, server code and some example applications, intended to be as accessible as possible for getting up and running i.e.:
|
||||
* [Firmware](#-firmware) for the custom PCB can be programmed using the Arduino IDE and a USB cable (installation of ESP-IDF not required)
|
||||
* [Server code](#%EF%B8%8F-server) has minimal requirements besides running Whisper locally, and should be able to run on most devices that you can leave plugged in whether MacOS / Linux / Win etc.
|
||||
* [Hardware](#-hardware) can be ordered from [PCBWay](https://www.pcbway.com/project/shareproject/Onju_Voice_d33625a1.html) and Altium design files are included
|
||||
<img src="images/rich.png">
|
||||
* **OpenClaw managed backend** 🦞 -- delegate conversation history and session management to an [OpenClaw](https://github.com/openclaw) gateway for centralized, multi-device orchestration
|
||||
* **Opus compression** -- 14-16x downstream compression (server to speaker) for better audio quality over WiFi
|
||||
* **Streaming-ready architecture** -- designed for sentence-level TTS streaming and agentic tool-calling loops (see [Voice Agent Architecture](#voice-agent-architecture))
|
||||
* **Modular async pipeline** -- replaced the monolithic server with a pluggable architecture for ASR, LLM, and TTS backends etc.
|
||||
* **Any LLM** -- works with any OpenAI-compatible API (Ollama, mlx_lm, Gemini, OpenRouter, Claude, etc.)
|
||||
* **Pluggable TTS** -- ElevenLabs (recommended) or local via [mlx-audio](https://github.com/lucasnewman/mlx-audio) for fully offline operation
|
||||
* **Silero VAD** -- server-side voice activity detection with configurable thresholds, replacing webrtcvad
|
||||
* **VAD-aware interruption** -- tap to interrupt playback and start speaking immediately
|
||||
* **M5 Echo support** -- get started with a [$13 dev kit](https://shop.m5stack.com/products/atom-echo-smart-speaker-dev-kit) instead of ordering a custom PCB
|
||||
* **One-command flashing** -- `./flash.sh` handles compilation, WiFi credential generation (from macOS Keychain), and upload. No Arduino IDE or manual configuration required
|
||||
|
||||
## Example applications
|
||||
* 📩 Querying and replying to messages (using a [custom Maubot plugin](https://github.com/justLV/onju-voice-maubot) & Beeper)
|
||||
* 💡 Light control with [Home Assistant](#-home-assistant)
|
||||
* 📝 Adding and retrieving notes/memos for the LLM to craft a response with
|
||||
## Supported devices
|
||||
|
||||
*Not included:*
|
||||
* 👥 Multiple voice characters. I’ll leave it to the user to clone voices as they deem fair use. Also from experience LLM’s < GPT4 don’t consistently enough follow instructions to reliably respond in different characters AND perform multiple function calling with complicated prompts.
|
||||
| | Onjuino (custom PCB) | M5Stack ATOM Echo |
|
||||
|---|---|---|
|
||||
| **Board** | ESP32-S3 | ESP32-PICO-D4 |
|
||||
| **Interaction** | Capacitive touch: tap to start, double-tap to end | Physical button: hold to talk |
|
||||
| **Mic** | I2S (INMP441) | PDM (SPM1423) |
|
||||
| **Speaker** | MAX98357A, 6 NeoPixel LEDs | NS4168, 1 SK6812 LED |
|
||||
| **PSRAM** | Yes (2MB playback buffer) | No (smaller buffers) |
|
||||
| **Audio upstream** | mu-law 16kHz UDP (16 KB/s) | mu-law 16kHz UDP (16 KB/s) |
|
||||
| **Audio downstream** | Opus 16kHz TCP (~1.5 KB/s) | Opus 16kHz TCP (~1.5 KB/s) |
|
||||
|
||||
## Current features of the device <> server platform
|
||||
* Auto-discovery of devices using multicast announcements
|
||||
* Remembering conversation history and voice settings for each device
|
||||
* Sending & receiving audio data from the device, packed as 16-bit, 16kHz (UDP sending, TCP receiving partially buffered into PSRAM)
|
||||
* Speaker and microphone visualization with the LED’s, and custom LED control via the server
|
||||
* Mute switch functionality, tap-to-wake for enabling the microphone, and setting mic timeout via the server
|
||||
* Device-level logging to individual files and console output using `rich`
|
||||
Both targets use the same network protocol and connect to the same server. See the [M5 Echo README](m5_echo/README.md) for hardware-specific details.
|
||||
|
||||
## Limitations of this release:
|
||||
* The Arduino IDE doesn’t (yet) support the Espressif’s Audio SDK’s, such as [ESP-ADF](https://github.com/espressif/esp-adf), [ESP-Skainet](https://github.com/espressif/esp-skainet) etc. For these demo's it's not absolutely required, but if you use Espressif’s ESP-IDF with these SDK's you'd unlock features such as:
|
||||
* VAD (Voice Activity Detection) - in this example VAD is offloaded to the server using webrtcvad, and the listening period is extended by either tapping the device or by the server sending mic keep alive timeouts (network traffic is really minimal at 16-bit, 16kHz)
|
||||
* AEC (Acoustic Echo Cancellation) - to allow you to effectively talk over the assistant by removing the speaker output from audio input
|
||||
* BSS (Blind Source Separation) - let’s you use both mic’s for isolating speakers based on location, and other noise suppression
|
||||
* Wakewords and other on-device commands - I’m not a believer in this given how finicky these can be and don’t think these are and think all command logic should be handled by layers of language models on the server.
|
||||
* The server currently only does transcription locally and uses:
|
||||
* OpenAI for generating responses & functions calls, but if you have the hardware you could run a local LLM, using something like ToolLLM for calling API’s to add almost any capabilities you’d wish.
|
||||
* Text-to-speech from Elevenlabs - this is fair to say the easiest to get running, fastest and most expressive option out there but FWIR data policy is a little dubious so careful about sending anything too sensitive. I’d really like to see comparable performing open source options that you can run locally
|
||||
* Conversation flow is highly serialized, i.e. recording > transcription > LLM > TTS needs to finish each step before moving onto the next. Not included here is feeding incomplete transcriptions to a smaller model, and streaming slower LLM's like GPT4 to Elevenlabs and sending streaming responses back, it's currently a little too hacky to include in this release.
|
||||
* No wakeword usage, mostly done intentionally as I feel uttering a wake-word before every response is a terrible experience. This currently uses a combo of VAD, mic-timeouts sent from server, tap-to-wake, mute switch usage etc. Not included here is experiments running a smaller, faster LLM for classification with a running transcription before handing off to a larger LLM with specific prompt
|
||||
## Architecture
|
||||
|
||||
## Other areas for improvement
|
||||
These are things I didn't get time to implement but I believe would be invaluable and pretty achievable
|
||||
* Speaker diarization - know who is saying what, and have the LLM enage in multi-user conversations or infer when it isn't being spoken to
|
||||
* Interruptions - requires AEC for simultaneous listening and playback
|
||||
* Smaller local models/LLM's for running classification, detecting intent and routing to larger LLM's
|
||||
|
||||
# Installation
|
||||
|
||||
## 🖥️ Server
|
||||
|
||||
Ensure you can install [Whisper](https://github.com/openai/whisper) and run at least the base model, following any debugging steps they have if not. If you can get past that, it should be as simple as:
|
||||
```
|
||||
cd server
|
||||
pip install -r requirements.txt
|
||||
ESP32 Device Server Pipeline
|
||||
┌──────────────────────────────┐ ┌──────────────────────────────────────┐
|
||||
│ Mic ─→ I2S RX ─→ mu-law ──────UDP 3000──→ mu-law decode ─→ VAD ─→ ASR │
|
||||
│ │ │ │
|
||||
│ Speaker ←─ I2S TX ←─ Opus ←──TCP 3001──← Opus encode ←─ TTS ←─ LLM │
|
||||
└──────────────────────────────┘ └──────────────────────────────────────┘
|
||||
```
|
||||
|
||||
Adjust settings in the `config.yaml`, and tweak aspects such as how much silence is needed to start processing to trade-off snappiness vs avoiding cutting off the user.
|
||||
**Why mu-law upstream:** Stateless sample-by-sample encoding (~1% CPU), zero buffering latency. ASR models handle the quality fine.
|
||||
|
||||
Add your Elevenlabs token to `credentials.json` and ensure you have a cloned voice in your account that you set in the `config.yaml` under `elevenlabs_default_voice`
|
||||
|
||||
You'll also need a greeting WAV set in `config.yaml` under `greeting_wav`, that will be sent to devices on connecting to the WiFi. This is up to you to record or procure ([e.g.](https://github.com/ytdl-org/youtube-dl))
|
||||
**Why Opus downstream:** Human ears need better quality than ASR, and Opus decoding is easier for an ESP32. Opus gives 14-16x compression vs mu-law's 2x, and TCP ensures reliable ordered delivery for the stateful codec.
|
||||
|
||||
A small subset of the config parameters can be set as optional arguments when running the script. For e.g. the following will run the server with note-taking, Home Assistant, Maubot, real sending of messages enabled (a safe guard disabled by default), and a smaller English only Whisper model for transcription.
|
||||
### Device discovery
|
||||
|
||||
`python server.py --n --ha --mb --send --whisper base.en`
|
||||
1. ESP32 boots and joins WiFi
|
||||
2. Sends multicast announcement to `239.0.0.1:12345` with hostname, git hash, and PTT flag
|
||||
3. Server discovers device and connects to its TCP server on port 3001
|
||||
4. ESP32 learns server IP from the TCP connection and starts sending mic audio via UDP
|
||||
|
||||
### 🏡 Home Assistant
|
||||
I recommend setting this up on the same server or one that is always plugged in on your network, following the [Docker Compose instructions](https://www.home-assistant.io/installation/linux#docker-compose)
|
||||
### TCP command protocol
|
||||
|
||||
Then go through the onboarding, setup a user, name your devices and get a Long Lived token to add to `credentials.json` together with the URL e.g. `http://my-local-server:8123/`
|
||||
All commands use a 6-byte header. The server initiates TCP connections to the ESP32.
|
||||
|
||||
### 🤖 Maubot
|
||||
Follow instructions [here](https://github.com/justLV/onju-home-maubot) to setup Maubot with your Beeper account. Ensure the correct URL is setup in `config.yaml`, set `send_replies` to True if your friends are forgiving of the odd mistakes, and set a `footer`.
|
||||
| Byte 0 | Command | Payload |
|
||||
|---|---|---|
|
||||
| `0xAA` | Audio playback | mic_timeout(2B), volume, LED fade, compression type, then length-prefixed Opus frames |
|
||||
| `0xBB` | Set LEDs | LED bitmask, RGB color |
|
||||
| `0xCC` | LED blink (VAD) | intensity, RGB color, fade rate |
|
||||
| `0xDD` | Mic timeout | timeout in seconds (2B) |
|
||||
|
||||
Don’t have Beeper yet and can’t wait? [Try setup a Matrix bridge yourself](https://docs.mau.fi/bridges/go/imessage/mac/setup.html) and a custom function definition for OpenAI function calling (and share how you did it!)
|
||||
A zero-length Opus frame (`0x00 0x00`) signals end of speech.
|
||||
|
||||
Following this example you can also integrate e-mail.
|
||||
### FreeRTOS task layout
|
||||
|
||||
## 📟 Firmware
|
||||
| Core | Task | Purpose |
|
||||
|---|---|---|
|
||||
| Core 0 | Arduino loop | TCP server, touch/mute input, UART debug |
|
||||
| Core 1 | `micTask` | I2S read, mu-law encode, UDP send |
|
||||
| Core 1 | `opusDecodeTask` | TCP read, Opus decode, I2S write (created per playback) |
|
||||
| Core 1 | `updateLedTask` | 40Hz LED refresh with gamma-corrected fade |
|
||||
|
||||
Irrespective of what you use for development, the quickest & least error prone setup for building & flashing firmware is probably installing the Arduino IDE [Software](https://www.arduino.cc/en/software), and then using this IDE or your preference i.e. VSCode for development (Copilot)
|
||||
### Conversation backends
|
||||
|
||||
* Add the ESP32 boards as detailed [here](https://docs.espressif.com/projects/arduino-esp32/en/latest/installing.html)
|
||||
(TL;DR add `https://espressif.github.io/arduino-esp32/package_esp32_index.json` to `Preferences > Additional Boards Manager URL’s`)
|
||||
* Under Boards Manager, install “esp32” by Espressif Systems
|
||||
* Under Library Manager, install “Adafruit NeoPixel Library” and “esp32_opus” by sh123 (the `flash.sh` script installs these automatically)
|
||||
* Clone this repo to `Documents/Arduino` for simplicity.
|
||||
* Add your WiFi credentials to `credentials.h`
|
||||
* Run `bash setup-git-hash.sh` to add a header with the git-hash (optional). This will then automatically update after commits, and help track the firmware that your devices are running from the server side.
|
||||
* Open File > Sketchbook > onju-home > onjuino
|
||||
* Select Tools > Board > esp32 > ESP32S3 Dev Module
|
||||
* Under Tools ensure:
|
||||
* USB CDC on Boot set to Enabled
|
||||
* PSRAM set to OPI PSRAM
|
||||
* Board is plugged in and Port is selected (you may need to install USB bridge drivers as detailed by Espressif, don’t worry if name is incorrect)
|
||||
* Build and upload
|
||||
* If not reset, press the reset button. In Serial Monitor you can also send `r` to reset the device (assuming it is already booted)
|
||||
The pipeline supports two conversation backends, selectable via `config.yaml`:
|
||||
|
||||
### Using flash.sh
|
||||
**Local** (`conversation.backend: "local"`): Manages conversation history locally with per-device JSON persistence. Sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint.
|
||||
|
||||
For command-line compilation and flashing, use the `flash.sh` script:
|
||||
**OpenClaw Managed** (`conversation.backend: "managed"`): Delegates session management to an [OpenClaw](https://github.com/openclaw) gateway. Only sends the latest user message -- OpenClaw tracks history server-side using the device ID as the session key. Set `OPENCLAW_GATEWAY_TOKEN` in your environment and point `base_url` at your gateway.
|
||||
|
||||
## Installation
|
||||
|
||||
### Server
|
||||
|
||||
```bash
|
||||
# Compile only (verify code compiles without ESP32 connected)
|
||||
./flash.sh compile
|
||||
# Clone and set up Python environment
|
||||
git clone https://github.com/justLV/onju-v2.git
|
||||
cd onju-v2
|
||||
uv venv && source .venv/bin/activate
|
||||
uv pip install -e .
|
||||
|
||||
# Auto-detect ESP32 and flash (defaults to onjuino)
|
||||
# macOS: install system libraries for Opus encoding
|
||||
brew install opus portaudio
|
||||
|
||||
# Configure
|
||||
cp pipeline/config.yaml.example pipeline/config.yaml
|
||||
# Edit config.yaml with your API keys and preferences
|
||||
```
|
||||
|
||||
**ASR** -- an embedded [parakeet-mlx](https://github.com/senstella/parakeet-mlx) server is included (Apple Silicon):
|
||||
```bash
|
||||
uv pip install -e ".[asr]"
|
||||
python -m pipeline.services.asr_server # runs on port 8100
|
||||
```
|
||||
Or point `asr.url` in config.yaml at any Whisper-compatible endpoint.
|
||||
|
||||
**LLM** -- any OpenAI-compatible server:
|
||||
```bash
|
||||
# Local (mlx_lm on Apple Silicon)
|
||||
mlx_lm.server --model unsloth/gemma-4-E4B-it-UD-MLX-4bit --port 8080
|
||||
|
||||
# Local (Ollama)
|
||||
ollama run gemma4:e4b
|
||||
|
||||
# Cloud -- just set base_url and api_key in config.yaml (default: Haiku via OpenRouter)
|
||||
```
|
||||
|
||||
**TTS** -- [ElevenLabs](https://elevenlabs.io) is the default (set your API key in config.yaml). For fully offline TTS, you can use [mlx-audio](https://github.com/lucasnewman/mlx-audio) (`uv pip install -e ".[tts-local]"`, then set `tts.backend: "qwen3"` for example in config.yaml - I don't think this is the best quality, just including as reference for a local TTS!).
|
||||
|
||||
**Run:**
|
||||
```bash
|
||||
source .venv/bin/activate
|
||||
python -m pipeline.main
|
||||
```
|
||||
|
||||
### Firmware
|
||||
|
||||
Both targets can be compiled and flashed from the command line:
|
||||
|
||||
```bash
|
||||
# Flash onjuino (default)
|
||||
./flash.sh
|
||||
|
||||
# Flash m5_echo target
|
||||
# Flash M5 Echo
|
||||
./flash.sh m5_echo
|
||||
|
||||
# Flash to specific port
|
||||
./flash.sh /dev/cu.usbmodem1234
|
||||
# Compile only (no device needed)
|
||||
./flash.sh compile
|
||||
|
||||
# Regenerate WiFi credentials
|
||||
# Regenerate WiFi credentials from macOS Keychain, defaults to manual entry
|
||||
./flash.sh --regen
|
||||
```
|
||||
|
||||
**Note:** Always run `./flash.sh compile` after making changes to `onjuino.ino` to verify your code compiles before committing.
|
||||
Requires `arduino-cli`:
|
||||
```bash
|
||||
# macOS
|
||||
brew install arduino-cli
|
||||
arduino-cli core install esp32:esp32
|
||||
```
|
||||
The flash script auto-installs required libraries (Adafruit NeoPixel, esp32_opus).
|
||||
|
||||
## 🧩 Hardware
|
||||
WiFi credentials are generated from your macOS Keychain on first flash, or you can edit the `credentials.h.template` files manually.
|
||||
|
||||
For Arduino IDE users: select **ESP32S3 Dev Module** (onjuino) or **ESP32 Dev Module** (M5 Echo), enable **USB CDC on Boot** and **OPI PSRAM** (onjuino only), then build and upload.
|
||||
|
||||
### Hardware
|
||||
|
||||
<p float="left">
|
||||
<img src="images/copper.png" width="48%" />
|
||||
<img src="images/render.png" width="48%" />
|
||||
<img src="images/render.png" width="48%" />
|
||||
</p>
|
||||
|
||||
[Preview schematics & PCB here](https://365.altium.com/files/77C755F4-7195-4B29-93AA-0C10A2471AC3)
|
||||
You should be able to download files, otherwise they are in the folder `hardware` in Altium format. Feel free to modify & improve this design and share your updates!
|
||||
[Preview schematics & PCB](https://365.altium.com/files/77C755F4-7195-4B29-93AA-0C10A2471AC3) | [Order from PCBWay](https://www.pcbway.com/project/shareproject/Onju_Voice_d33625a1.html) | Altium source files and schematics in `hardware/`. You can order these PCBA's directly from PCBWay [here](https://www.pcbway.com/project/shareproject/Onju_Voice_d33625a1.html).
|
||||
|
||||
You can order PCBA's directly from PCBWay [here](https://www.pcbway.com/project/shareproject/Onju_Voice_d33625a1.html). I've used a few suppliers and they are of the most reliable I've experienced for turnkey assembly at that pricepoint so I'm happy to point business their way. (Other options of selling single units, with margins, ended up forcing a pricepoint > Google Nest Mini itself, and wouldn't allow shipment into EU/UK without certification so I abandoned this)
|
||||
If you don't have a custom PCB, you can use the [M5Stack ATOM Echo](https://shop.m5stack.com/products/atom-echo-smart-speaker-dev-kit). I'd recommend adding a Battery ([Biscuit](https://www.youtube.com/watch?v=OMg3epr53Ns)) Base ([link](https://shop.m5stack.com/products/atomic-battery-base-200mah))
|
||||
|
||||
I will be sharing more detailed instructions for replacement.
|
||||
## Configuration reference
|
||||
|
||||
Replacement gaskets for the microphone & LED's can be made using [adhesive foam](https://www.amazon.com/gp/product/B07KCJ31J9) and a [punch set](https://www.amazon.com/gp/product/B087D2Z43F)) for example
|
||||
See [`pipeline/config.yaml.example`](pipeline/config.yaml.example) for all options. Key sections:
|
||||
|
||||
## ❓Questions
|
||||
| Section | What it controls |
|
||||
|---|---|
|
||||
| `asr` | Speech-to-text service URL |
|
||||
| `conversation.backend` | `"local"` or `"managed"` (OpenClaw) |
|
||||
| `conversation.local` | LLM endpoint, model, system prompt, message history |
|
||||
| `conversation.managed` | OpenClaw gateway URL, auth token, message channel |
|
||||
| `tts` | TTS backend (`"elevenlabs"` or `"qwen3"`), voice settings |
|
||||
| `vad` | Voice activity detection thresholds and timing |
|
||||
| `network` | UDP/TCP/multicast ports |
|
||||
| `device` | Volume, mic timeout, LED settings, greeting audio |
|
||||
|
||||
### Does this replace my Google Nest Mini?
|
||||
### Environment variables
|
||||
|
||||
While this replicates the interfaces of the Google Nest Mini, don’t expect this to be a 1:1 replacement, for e.g. it is not intended to be a music playback device (although there is probably no reason it couldn’t be developed to be used as such). It’s also worth re-iterating that like the Google Nest Mini, this requires a separate server, although this can be in your home running local models instead of in a Google datacenter.
|
||||
**The original is well tested, maintained, certified and works out the box, while this is essentially a dev board with some neat examples for you to build on top of**
|
||||
| Variable | Used by |
|
||||
|---|---|
|
||||
| `OPENROUTER_API_KEY` | Local backend via OpenRouter (default) |
|
||||
| `ANTHROPIC_API_KEY` | Local backend via Anthropic API directly |
|
||||
| `OPENCLAW_GATEWAY_TOKEN` | Managed (OpenClaw) backend |
|
||||
|
||||
### What if I don’t have a Google Nest Mini but still want to use this?
|
||||
## Testing
|
||||
|
||||
Fortunately they’re still being sold, you may find deals for <$40 which is pretty good for the quality of speaker and form factor. I picked up quite a few from eBay, just make sure you get the 2nd gen.
|
||||
```bash
|
||||
# Emulate an ESP32 device (no hardware needed)
|
||||
python test_client.py # localhost
|
||||
python test_client.py 192.168.1.50 # remote server
|
||||
|
||||
The adventurous can get try replacement shells from [AliExpress](https://www.aliexpress.us/item/3256803723188315.html) for e.g., but you’ll still need a base, power input, mute switch, speaker & mount, capacitive touch panels, and replacement gaskets etc. A hero out there could design a custom enclosure that fits an off-the-shelf speaker.
|
||||
# Test speaker output (send audio file to device w/ TCP and Opus encoding)
|
||||
python test_speaker.py <device-ip>
|
||||
|
||||
### But I’m really impatient and want to get hacking away! What can I do?
|
||||
# Test mic input (receive and record UDP audio)
|
||||
python test_mic.py --duration 10
|
||||
|
||||
a) if you can commit to making significant contributions to the codebase and/or major contributions to the board design or RF review, we may be able to make early samples available
|
||||
# Serial monitor (auto-detects USB port)
|
||||
python serial_monitor.py test.wav
|
||||
```
|
||||
|
||||
b) if you don’t need the form factor, don’t mind rolling up my sleeves, and have some HW experience, you can breadboard it out with readily available components until you can get your hands on an order. Here are the components that should be able to get a demo running (🌸 Adafruit link for convenience but shop around wherever you’d like)
|
||||
## Voice agent architecture
|
||||
|
||||
* ESP32-S3 devboard, ideally w/ PSRAM (e.g. [QT Py S3](https://www.adafruit.com/product/5700) or [ESP32-S3](https://www.adafruit.com/product/5364))
|
||||
* [Microphone](https://www.adafruit.com/product/3421) (only need 1 for the Arduino implementation, ensure it's a SPH0645 to limit debugging)
|
||||
* [Amplifier](https://www.adafruit.com/product/3006)
|
||||
* [Speaker](https://www.adafruit.com/product/1313)
|
||||
* [Neopixel LED strip](https://www.adafruit.com/product/1426) - just set the firmware to the correct #
|
||||
* [Breadboard & wire kit](https://www.adafruit.com/product/3314) (you can use protruding pieces of wire for cap touch)
|
||||
The current pipeline implements a simple listen-transcribe-respond-speak loop. The architecture is designed to evolve toward a voice-native agent loop where the LLM can call tools, narrate results, and stream speech in real-time:
|
||||
|
||||
You'll need to update the `custom_boards.h` with your pin mapping
|
||||
```
|
||||
LLM (streaming) ──→ sentence buffer ──→ TTS ──→ Opus encode ──→ ESP32
|
||||
│
|
||||
└──→ tool calls ──→ execute ──→ feed results back ──→ LLM continues
|
||||
```
|
||||
|
||||
## **🍐 PR's, issues, suggestions & general feedback welcome!🏡**
|
||||
## UART debug commands
|
||||
|
||||
Both firmware targets support serial commands at 115200 baud:
|
||||
|
||||
| Key | Action |
|
||||
|---|---|
|
||||
| `r` | Reboot |
|
||||
| `M` | Enable mic for 10 minutes |
|
||||
| `m` | Disable mic |
|
||||
| `A` | Re-send multicast announcement |
|
||||
| `c` | Enter config mode (WiFi, server, volume) |
|
||||
| `W`/`w` | LED test fast/slow (onjuino) |
|
||||
| `P` | Play 440Hz test tone (M5 Echo) |
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
|
|
|||
171
TESTING.md
171
TESTING.md
|
|
@ -1,171 +0,0 @@
|
|||
# ESP32 Audio Streaming Test Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Configure ESP32 Settings
|
||||
|
||||
In `onjuino/onjuino.ino`, adjust these settings (lines 78-84):
|
||||
|
||||
```cpp
|
||||
#define USE_COMPRESSION true // Enable μ-law compression (2x bandwidth reduction)
|
||||
#define USE_LOCAL_VAD true // Enable local VAD to sleep when silent
|
||||
#define VAD_RMS_THRESHOLD 3000 // RMS threshold to detect voice (tune based on your mic)
|
||||
#define VAD_SILENCE_FRAMES 100 // Frames of silence before sleep (100 * 32ms = 3.2 seconds)
|
||||
#define VAD_WAKEUP_FRAMES 2 // Frames of voice to wake up (2 * 32ms = 64ms)
|
||||
```
|
||||
|
||||
**Testing configurations:**
|
||||
|
||||
| Test | USE_COMPRESSION | USE_LOCAL_VAD | Expected Bandwidth |
|
||||
|------|----------------|---------------|-------------------|
|
||||
| Baseline | false | false | ~32 kbps continuous |
|
||||
| Compression only | true | false | ~16 kbps continuous |
|
||||
| VAD only | false | true | ~32 kbps when talking |
|
||||
| Both (optimal) | true | true | ~16 kbps when talking |
|
||||
|
||||
### 2. Flash ESP32
|
||||
|
||||
```bash
|
||||
# Open Arduino IDE, select your board, upload onjuino/onjuino.ino
|
||||
```
|
||||
|
||||
### 3. Run Test Receiver on Mac
|
||||
|
||||
Install dependencies:
|
||||
```bash
|
||||
pip3 install numpy
|
||||
```
|
||||
|
||||
Run receiver:
|
||||
```bash
|
||||
# Auto-detect compression mode
|
||||
python3 test_mic_receiver.py --duration 10 --output test.wav
|
||||
|
||||
# Or specify if you know the mode
|
||||
python3 test_mic_receiver.py --compressed --duration 10 --output test_compressed.wav
|
||||
```
|
||||
|
||||
### 4. Analyze Results
|
||||
|
||||
The receiver will show real-time stats:
|
||||
```
|
||||
[ 5.1s] Packets: 167 | Bandwidth: 15.8 kbps | RMS: 2847 | Mode: μ-law
|
||||
```
|
||||
|
||||
After recording, you'll see:
|
||||
```
|
||||
Recording complete!
|
||||
Duration: 10.02 seconds
|
||||
WAV file size: 320.6 KB
|
||||
Bytes transmitted: 160.3 KB
|
||||
Compression ratio: 0.50x
|
||||
Average bandwidth: 15.9 kbps
|
||||
Packets received: 334
|
||||
Packet loss: 0.0%
|
||||
```
|
||||
|
||||
## Tuning VAD Threshold
|
||||
|
||||
The `VAD_RMS_THRESHOLD` value depends on your microphone sensitivity and ambient noise:
|
||||
|
||||
1. **Test ambient noise:**
|
||||
```bash
|
||||
# Record silence, watch RMS values
|
||||
python3 test_mic_receiver.py --duration 5
|
||||
```
|
||||
Note the RMS during silence (e.g., 500-1000)
|
||||
|
||||
2. **Test speaking:**
|
||||
```bash
|
||||
# Record yourself talking, watch RMS values
|
||||
python3 test_mic_receiver.py --duration 5
|
||||
```
|
||||
Note the RMS while speaking (e.g., 3000-8000)
|
||||
|
||||
3. **Set threshold between them:**
|
||||
```cpp
|
||||
// If silence = 800, speech = 4000, set threshold around 2000-2500
|
||||
#define VAD_RMS_THRESHOLD 2500
|
||||
```
|
||||
|
||||
## Bandwidth Comparison
|
||||
|
||||
| Configuration | Bandwidth | Power Saving | Audio Quality |
|
||||
|--------------|-----------|--------------|---------------|
|
||||
| Raw PCM, always on | 32 kbps | None | Perfect |
|
||||
| μ-law, always on | 16 kbps | None | Good (telephony quality) |
|
||||
| Raw PCM, VAD | ~10 kbps avg* | Moderate | Perfect |
|
||||
| μ-law, VAD | ~5 kbps avg* | High | Good |
|
||||
|
||||
*Assuming 30% voice activity (typical conversation)
|
||||
|
||||
## Compression Quality Check
|
||||
|
||||
Listen to the output WAV files:
|
||||
```bash
|
||||
# Mac built-in player
|
||||
afplay test.wav
|
||||
afplay test_compressed.wav
|
||||
|
||||
# Compare side-by-side
|
||||
```
|
||||
|
||||
μ-law quality should be:
|
||||
- ✅ Clear speech
|
||||
- ✅ Good for voice recognition (Whisper handles it well)
|
||||
- ⚠️ Slightly muffled compared to raw PCM
|
||||
- ⚠️ Not suitable for music
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**No packets received:**
|
||||
- Check ESP32 Serial output for IP address
|
||||
- Verify ESP32 and Mac are on same network
|
||||
- Check firewall settings
|
||||
|
||||
**High packet loss:**
|
||||
- Check WiFi signal strength
|
||||
- Reduce `VAD_SILENCE_FRAMES` to keep connection active
|
||||
- Try raw PCM mode first (simpler debugging)
|
||||
|
||||
**VAD not working:**
|
||||
- Adjust `VAD_RMS_THRESHOLD` (see tuning section above)
|
||||
- Check Serial monitor for "VAD: Woke up" / "VAD: Sleeping" messages
|
||||
- Set `USE_LOCAL_VAD false` to test without VAD
|
||||
|
||||
**Compression artifacts:**
|
||||
- μ-law is lossy - some quality loss is normal
|
||||
- If unacceptable, use `USE_COMPRESSION false`
|
||||
- Or try ADPCM (4x compression, better quality - future work)
|
||||
|
||||
## Next Steps
|
||||
|
||||
Once basic UDP streaming is working:
|
||||
1. Integrate with your existing server.py VAD pipeline
|
||||
2. Update server to handle compressed packets
|
||||
3. Consider WebSocket for playback direction
|
||||
4. Add streaming TTS for lower latency
|
||||
|
||||
## Server Integration
|
||||
|
||||
Update `server/server.py` to handle compression:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Add μ-law decode table (same as test receiver)
|
||||
ULAW_DECODE_TABLE = np.array([...])
|
||||
|
||||
def decode_ulaw(ulaw_bytes):
|
||||
return ULAW_DECODE_TABLE[np.frombuffer(ulaw_bytes, dtype=np.uint8)]
|
||||
|
||||
# In listen_detect function:
|
||||
data, addr = sock.recvfrom(2048)
|
||||
|
||||
if len(data) == 512: # Compressed
|
||||
samples = decode_ulaw(data)
|
||||
elif len(data) == 1024: # Raw
|
||||
samples = np.frombuffer(data, dtype=np.int16)
|
||||
|
||||
# Continue with existing VAD pipeline...
|
||||
```
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
# M5 Echo - Push-to-Talk Voice Client
|
||||
|
||||
Firmware for the M5Stack ATOM Echo that connects to the esp32-bridge server using the same protocol as onjuino, but with push-to-talk (PTT) instead of voice-activity detection (VAD).
|
||||
Firmware for the [M5Stack ATOM Echo](https://shop.m5stack.com/products/atom-echo-smart-speaker-dev-kit) that connects to the pipeline server using the same protocol as onjuino, but with push-to-talk (PTT) instead of voice-activity detection (VAD).
|
||||
|
||||
## How it differs from onjuino
|
||||
|
||||
|
|
@ -69,17 +69,18 @@ Config mode is also accessible during WiFi connection (if the stored SSID is wro
|
|||
|
||||
## Building and flashing
|
||||
|
||||
From the repo root:
|
||||
|
||||
```bash
|
||||
./flash.sh # auto-detect port, compile and flash
|
||||
./flash.sh compile # compile only
|
||||
./flash.sh --regen # regenerate WiFi credentials from keychain
|
||||
./flash.sh --force # force recompile
|
||||
./flash.sh m5_echo # auto-detect port, compile and flash
|
||||
./flash.sh m5_echo compile # compile only
|
||||
./flash.sh m5_echo --regen # regenerate WiFi credentials from keychain
|
||||
```
|
||||
|
||||
Requires `arduino-cli` with the `esp32:esp32` core and `Adafruit NeoPixel` library installed. The Opus decoder uses the `esp32_opus` library (same as onjuino).
|
||||
Requires `arduino-cli` with the `esp32:esp32` core. The flash script auto-installs required libraries (Adafruit NeoPixel, esp32_opus).
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
python3 terminal.py # serial monitor with command input
|
||||
python serial_monitor.py # serial monitor (auto-detects USB port)
|
||||
```
|
||||
|
|
|
|||
150
m5_echo/flash.sh
150
m5_echo/flash.sh
|
|
@ -1,150 +0,0 @@
|
|||
#!/bin/bash
|
||||
set -euo pipefail
|
||||
|
||||
REPO="$(cd "$(dirname "$0")" && pwd)"
|
||||
TEMPLATE="$REPO/credentials.h.template"
|
||||
OUTPUT="$REPO/credentials.h"
|
||||
FQBN="esp32:esp32:pico32:UploadSpeed=115200"
|
||||
BUILD_DIR="$REPO/build"
|
||||
|
||||
COMPILE_ONLY=false
|
||||
REGEN=false
|
||||
FORCE_COMPILE=false
|
||||
NO_MONITOR=false
|
||||
PORT=""
|
||||
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
compile|compile-only) COMPILE_ONLY=true ;;
|
||||
--regen) REGEN=true; FORCE_COMPILE=true ;;
|
||||
--force) FORCE_COMPILE=true ;;
|
||||
--no-monitor) NO_MONITOR=true ;;
|
||||
-h|--help)
|
||||
echo "Usage: flash.sh [options] [port]"
|
||||
echo " flash.sh # Auto-detect port and upload"
|
||||
echo " flash.sh /dev/cu.usbserial-xxx # Upload to specific port"
|
||||
echo " flash.sh compile # Compile only"
|
||||
echo " flash.sh --regen # Regenerate WiFi credentials"
|
||||
echo " flash.sh --force # Force recompile"
|
||||
echo " flash.sh --no-monitor # Skip serial monitor after flash"
|
||||
exit 0 ;;
|
||||
/dev/*) PORT="$arg" ;;
|
||||
*) echo "Unknown option: $arg"; exit 1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# ── Credentials ──────────────────────────────────────────────
|
||||
if [ -f "$OUTPUT" ] && [ "$REGEN" = false ]; then
|
||||
echo "Using existing credentials.h (pass --regen to regenerate)"
|
||||
else
|
||||
WIFI_SSID=""
|
||||
WIFI_IF=$(networksetup -listallhardwareports 2>/dev/null | awk '/Wi-Fi/{getline; print $2}')
|
||||
WIFI_IF="${WIFI_IF:-en0}"
|
||||
|
||||
WIFI_SSID=$(networksetup -getairportnetwork "$WIFI_IF" 2>/dev/null | sed 's/Current Wi-Fi Network: //')
|
||||
if [ -z "$WIFI_SSID" ] || [[ "$WIFI_SSID" == *"not associated"* ]] || [[ "$WIFI_SSID" == *"Error"* ]]; then
|
||||
WIFI_SSID=""
|
||||
fi
|
||||
|
||||
if [ -z "$WIFI_SSID" ]; then
|
||||
PREFERRED=$(networksetup -listpreferredwirelessnetworks "$WIFI_IF" 2>/dev/null | tail -n +2 | sed 's/^[[:space:]]*//')
|
||||
if [ -n "$PREFERRED" ]; then
|
||||
TOP_SSID=$(echo "$PREFERRED" | head -1)
|
||||
echo "Known WiFi networks:"
|
||||
echo "$PREFERRED" | head -5 | cat -n
|
||||
echo ""
|
||||
read -p "WiFi SSID [$TOP_SSID]: " WIFI_SSID
|
||||
WIFI_SSID="${WIFI_SSID:-$TOP_SSID}"
|
||||
fi
|
||||
fi
|
||||
|
||||
[ -z "$WIFI_SSID" ] && read -p "WiFi SSID: " WIFI_SSID
|
||||
[ -z "$WIFI_SSID" ] && { echo "ERROR: No WiFi SSID"; exit 1; }
|
||||
echo "WiFi SSID: $WIFI_SSID"
|
||||
|
||||
echo "Retrieving WiFi password from Keychain..."
|
||||
WIFI_PASSWORD=$(security find-generic-password -wa "$WIFI_SSID" 2>/dev/null || true)
|
||||
if [ -z "$WIFI_PASSWORD" ]; then
|
||||
read -sp "WiFi password: " WIFI_PASSWORD
|
||||
echo ""
|
||||
fi
|
||||
[ -z "$WIFI_PASSWORD" ] && { echo "ERROR: No WiFi password"; exit 1; }
|
||||
|
||||
sed -e "s|{{WIFI_SSID}}|${WIFI_SSID}|g" \
|
||||
-e "s|{{WIFI_PASSWORD}}|${WIFI_PASSWORD}|g" \
|
||||
"$TEMPLATE" > "$OUTPUT"
|
||||
echo "Generated credentials.h"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# ── Check if compile needed ──────────────────────────────────
|
||||
NEEDS_COMPILE=true
|
||||
if [ "$FORCE_COMPILE" = false ] && [ -d "$BUILD_DIR" ]; then
|
||||
BIN=$(find "$BUILD_DIR" -name "*.bin" -maxdepth 1 2>/dev/null | head -1)
|
||||
if [ -n "$BIN" ]; then
|
||||
NEWER=$(find "$REPO" -maxdepth 1 \( -name "*.ino" -o -name "*.h" \) -newer "$BIN" 2>/dev/null | head -1)
|
||||
[ -z "$NEWER" ] && NEEDS_COMPILE=false
|
||||
fi
|
||||
fi
|
||||
|
||||
# ── Compile ──────────────────────────────────────────────────
|
||||
compile_firmware() {
|
||||
if [ "$NEEDS_COMPILE" = true ]; then
|
||||
echo "Compiling..."
|
||||
cd "$REPO"
|
||||
arduino-cli compile --fqbn "$FQBN" --build-path "$BUILD_DIR" m5_echo.ino || exit 1
|
||||
echo "Compilation successful!"
|
||||
else
|
||||
echo "No source changes, skipping compile"
|
||||
fi
|
||||
}
|
||||
|
||||
if [ "$COMPILE_ONLY" = true ]; then
|
||||
compile_firmware
|
||||
exit 0
|
||||
fi
|
||||
|
||||
compile_firmware
|
||||
|
||||
# ── Detect port ──────────────────────────────────────────────
|
||||
if [ -z "$PORT" ]; then
|
||||
PORT=$(ls /dev/cu.usbserial-* 2>/dev/null | head -n 1 || true)
|
||||
if [ -z "$PORT" ]; then
|
||||
PORT=$(ls /dev/cu.usbmodem* 2>/dev/null | head -n 1 || true)
|
||||
fi
|
||||
if [ -z "$PORT" ]; then
|
||||
echo "Error: No USB serial port found"
|
||||
exit 1
|
||||
fi
|
||||
echo "Auto-detected port: $PORT"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Flashing to $PORT..."
|
||||
|
||||
pkill -f "serial_monitor" 2>/dev/null || true
|
||||
pkill -f "python.*serial" 2>/dev/null || true
|
||||
sleep 1
|
||||
|
||||
cd "$REPO"
|
||||
arduino-cli upload --fqbn "$FQBN" --port "$PORT" --input-dir "$BUILD_DIR" m5_echo.ino
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo ""
|
||||
echo "Upload successful!"
|
||||
if [ "$NO_MONITOR" = true ]; then
|
||||
exit 0
|
||||
fi
|
||||
echo ""
|
||||
echo "Starting serial monitor..."
|
||||
sleep 2
|
||||
# Try the repo-level serial monitor, fall back to arduino-cli
|
||||
if [ -f "$REPO/../serial_monitor.py" ]; then
|
||||
python3 "$REPO/../serial_monitor.py" "$PORT"
|
||||
else
|
||||
arduino-cli monitor -p "$PORT" -c baudrate=115200
|
||||
fi
|
||||
else
|
||||
echo "Upload failed. Try: hold the button while plugging in USB, then run again."
|
||||
exit 1
|
||||
fi
|
||||
|
|
@ -1,59 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
M5 Echo serial terminal + test tools.
|
||||
|
||||
Commands (type in terminal):
|
||||
P - Play local 440Hz tone (pure I2S, no network)
|
||||
T - Raw mic test (prints sample values)
|
||||
M - Force mic on for 10s
|
||||
m - Force mic off
|
||||
A - Send multicast announcement
|
||||
r - Reboot device
|
||||
c - Enter config mode (ssid/pass/server/volume)
|
||||
q - Quit this terminal
|
||||
"""
|
||||
|
||||
import serial
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
|
||||
PORT = sys.argv[1] if len(sys.argv) > 1 else None
|
||||
|
||||
if not PORT:
|
||||
import glob
|
||||
ports = glob.glob('/dev/cu.usbserial-*')
|
||||
if not ports:
|
||||
print("No /dev/cu.usbserial-* found")
|
||||
sys.exit(1)
|
||||
PORT = ports[0]
|
||||
|
||||
print(f"Connecting to {PORT}...")
|
||||
s = serial.Serial(PORT, 115200, timeout=0.1)
|
||||
|
||||
def reader():
|
||||
while True:
|
||||
try:
|
||||
line = s.readline()
|
||||
if line:
|
||||
print(line.decode('utf-8', errors='replace'), end='')
|
||||
except:
|
||||
break
|
||||
|
||||
t = threading.Thread(target=reader, daemon=True)
|
||||
t.start()
|
||||
|
||||
print(f"Connected. Type single-char commands (P=tone, T=mic test, r=reboot, q=quit)")
|
||||
print()
|
||||
|
||||
try:
|
||||
while True:
|
||||
ch = input()
|
||||
if ch == 'q':
|
||||
break
|
||||
if ch:
|
||||
s.write(ch[0].encode())
|
||||
except (KeyboardInterrupt, EOFError):
|
||||
pass
|
||||
|
||||
s.close()
|
||||
|
|
@ -1,179 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Integration test for M5 Echo firmware.
|
||||
1. Listens for multicast announcement
|
||||
2. Waits for button press (UDP mic audio)
|
||||
3. Records a few seconds of mic audio, saves to wav
|
||||
4. Sends a 440Hz test tone back via TCP
|
||||
"""
|
||||
|
||||
import socket
|
||||
import struct
|
||||
import time
|
||||
import wave
|
||||
import math
|
||||
import sys
|
||||
import threading
|
||||
|
||||
DEVICE_IP = None
|
||||
UDP_PORT = 3000
|
||||
TCP_PORT = 3001
|
||||
SAMPLE_RATE = 16000
|
||||
TONE_FREQ = 440
|
||||
TONE_DURATION = 2.0 # seconds
|
||||
SPEAKER_VOLUME = 12
|
||||
|
||||
# μ-law decode table (same as firmware)
|
||||
def ulaw_to_linear(ulaw_byte):
|
||||
BIAS = 0x84
|
||||
ulaw_byte = ~ulaw_byte & 0xFF
|
||||
sign = ulaw_byte & 0x80
|
||||
exponent = (ulaw_byte >> 4) & 0x07
|
||||
mantissa = ulaw_byte & 0x0F
|
||||
sample = ((mantissa << 3) + BIAS) << exponent
|
||||
sample -= BIAS
|
||||
if sign:
|
||||
sample = -sample
|
||||
return sample
|
||||
|
||||
def generate_tone(freq, duration, sample_rate, volume):
|
||||
"""Generate PCM 16-bit tone, returns bytes (little-endian)"""
|
||||
samples = int(sample_rate * duration)
|
||||
data = bytearray()
|
||||
for i in range(samples):
|
||||
t = i / sample_rate
|
||||
# Use a moderate amplitude so volume shift doesn't clip
|
||||
val = int(8000 * math.sin(2 * math.pi * freq * t))
|
||||
data += struct.pack('<h', val)
|
||||
return bytes(data)
|
||||
|
||||
def listen_multicast():
|
||||
"""Listen for device multicast announcement"""
|
||||
global DEVICE_IP
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
|
||||
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
|
||||
sock.bind(('', 12345))
|
||||
mreq = struct.pack("4sl", socket.inet_aton("239.0.0.1"), socket.INADDR_ANY)
|
||||
sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)
|
||||
sock.settimeout(5)
|
||||
try:
|
||||
data, addr = sock.recvfrom(1024)
|
||||
DEVICE_IP = addr[0]
|
||||
print(f"Device announced: {data.decode()} from {DEVICE_IP}")
|
||||
except socket.timeout:
|
||||
print("No multicast received (device may have already booted)")
|
||||
sock.close()
|
||||
|
||||
def receive_mic_audio(duration=3.0):
|
||||
"""Listen for UDP mic audio packets, decode μ-law, save to wav"""
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
|
||||
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
|
||||
sock.bind(('', UDP_PORT))
|
||||
sock.settimeout(30)
|
||||
|
||||
print(f"\nWaiting for mic audio on UDP port {UDP_PORT}...")
|
||||
print("Press and hold the button on the ATOM Echo to talk")
|
||||
|
||||
all_samples = []
|
||||
packets = 0
|
||||
start = None
|
||||
|
||||
try:
|
||||
while True:
|
||||
data, addr = sock.recvfrom(2048)
|
||||
if start is None:
|
||||
start = time.time()
|
||||
if DEVICE_IP is None:
|
||||
globals()['DEVICE_IP'] = addr[0]
|
||||
print(f"Receiving audio from {addr[0]}... (hold button for {duration}s)")
|
||||
|
||||
# Decode μ-law
|
||||
for byte in data:
|
||||
all_samples.append(ulaw_to_linear(byte))
|
||||
packets += 1
|
||||
|
||||
if time.time() - start > duration:
|
||||
break
|
||||
except socket.timeout:
|
||||
print("Timeout waiting for audio")
|
||||
sock.close()
|
||||
return False
|
||||
|
||||
sock.close()
|
||||
print(f"Received {packets} packets, {len(all_samples)} samples ({len(all_samples)/SAMPLE_RATE:.1f}s)")
|
||||
|
||||
# Save to wav
|
||||
with wave.open('mic_test.wav', 'w') as wf:
|
||||
wf.setnchannels(1)
|
||||
wf.setsampwidth(2)
|
||||
wf.setframerate(SAMPLE_RATE)
|
||||
for s in all_samples:
|
||||
wf.writeframes(struct.pack('<h', max(-32768, min(32767, s))))
|
||||
|
||||
print("Saved to mic_test.wav")
|
||||
|
||||
# Check if we got real audio (not just silence)
|
||||
peak = max(abs(s) for s in all_samples) if all_samples else 0
|
||||
rms = (sum(s*s for s in all_samples) / len(all_samples)) ** 0.5 if all_samples else 0
|
||||
print(f"Peak: {peak}, RMS: {rms:.0f}")
|
||||
if peak < 100:
|
||||
print("WARNING: Audio appears silent - check mic")
|
||||
return True
|
||||
|
||||
def send_tone():
|
||||
"""Send a test tone to the device via TCP"""
|
||||
if DEVICE_IP is None:
|
||||
print("No device IP known, skipping tone test")
|
||||
return False
|
||||
|
||||
print(f"\nSending {TONE_FREQ}Hz tone to {DEVICE_IP}:{TCP_PORT}...")
|
||||
|
||||
tone_data = generate_tone(TONE_FREQ, TONE_DURATION, SAMPLE_RATE, SPEAKER_VOLUME)
|
||||
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
sock.settimeout(5)
|
||||
try:
|
||||
sock.connect((DEVICE_IP, TCP_PORT))
|
||||
except (socket.timeout, ConnectionRefusedError) as e:
|
||||
print(f"Failed to connect: {e}")
|
||||
return False
|
||||
|
||||
# Send 6-byte header: 0xAA, timeout(2), volume, fade_rate, compression(0=PCM)
|
||||
timeout_sec = 5
|
||||
header = bytes([
|
||||
0xAA,
|
||||
(timeout_sec >> 8) & 0xFF, timeout_sec & 0xFF,
|
||||
SPEAKER_VOLUME,
|
||||
10, # fade rate
|
||||
0, # compression = PCM
|
||||
])
|
||||
sock.send(header)
|
||||
sock.send(tone_data)
|
||||
sock.close()
|
||||
print(f"Sent {len(tone_data)} bytes ({TONE_DURATION}s tone)")
|
||||
return True
|
||||
|
||||
if __name__ == '__main__':
|
||||
print("=== M5 Echo Audio Integration Test ===\n")
|
||||
|
||||
# If device IP passed as arg, use it
|
||||
if len(sys.argv) > 1:
|
||||
DEVICE_IP = sys.argv[1]
|
||||
print(f"Using device IP: {DEVICE_IP}")
|
||||
else:
|
||||
listen_multicast()
|
||||
|
||||
# Step 1: Test mic (receive audio)
|
||||
mic_ok = receive_mic_audio(duration=3.0)
|
||||
|
||||
# Step 2: Test speaker (send tone)
|
||||
if mic_ok:
|
||||
print("\nRelease the button now, then press Enter to send test tone...")
|
||||
input()
|
||||
spk_ok = send_tone()
|
||||
|
||||
if spk_ok:
|
||||
time.sleep(TONE_DURATION + 0.5)
|
||||
print("\nDid you hear the tone? (y/n)")
|
||||
|
||||
print("\n=== Test complete ===")
|
||||
|
|
@ -1,7 +0,0 @@
|
|||
#ifndef CREDENTIALS_H
|
||||
#define CREDENTIALS_H
|
||||
|
||||
#define WIFI_SSID "your_ssid"
|
||||
#define WIFI_PASSWORD "your_password"
|
||||
|
||||
#endif
|
||||
|
|
@ -5,17 +5,17 @@ conversation:
|
|||
backend: "local" # "local" or "managed" (e.g. OpenClaw)
|
||||
|
||||
local:
|
||||
base_url: "http://localhost:8080/v1" # mlx_lm.server, Ollama, OpenRouter, Gemini, etc.
|
||||
api_key: "none" # set key or use ${ENV_VAR} reference
|
||||
model: "mlx-community/gemma-3-4b-it-qat-4bit"
|
||||
base_url: "https://openrouter.ai/api/v1" # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
|
||||
api_key: "${OPENROUTER_API_KEY}" # set key or use ${ENV_VAR} reference
|
||||
model: "anthropic/claude-haiku-4.5"
|
||||
max_messages: 20
|
||||
max_tokens: 300
|
||||
system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
|
||||
persist_dir: "data/conversations" # per-device message history (omit to disable)
|
||||
# Gemini example:
|
||||
# base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
|
||||
# api_key: "${GEMINI_API_KEY}"
|
||||
# model: "gemini-3.1-flash-lite-preview"
|
||||
persist_dir: "data/conversations" # per-device message history (omit to disable)
|
||||
# Local example (Ollama):
|
||||
# base_url: "http://localhost:11434/v1"
|
||||
# api_key: "none"
|
||||
# model: "gemma4:e4b"
|
||||
|
||||
managed:
|
||||
base_url: "http://127.0.0.1:18789/v1" # OpenClaw gateway
|
||||
|
|
|
|||
111
pipeline/services/asr_server.py
Normal file
111
pipeline/services/asr_server.py
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
"""
|
||||
Embedded ASR server using parakeet-mlx (Apple Silicon).
|
||||
|
||||
Run as a separate process:
|
||||
python -m pipeline.services.asr_server
|
||||
python -m pipeline.services.asr_server --port 8100 --model mlx-community/parakeet-tdt-0.6b-v3
|
||||
|
||||
Install dependencies:
|
||||
uv pip install -e ".[asr]"
|
||||
"""
|
||||
|
||||
import logging
|
||||
import os
|
||||
import tempfile
|
||||
import time
|
||||
import traceback
|
||||
|
||||
from fastapi import FastAPI, File, Request, UploadFile
|
||||
from fastapi.responses import JSONResponse
|
||||
|
||||
MODEL_ID = os.environ.get("ASR_MODEL", "mlx-community/parakeet-tdt-0.6b-v3")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
||||
)
|
||||
logger = logging.getLogger("parakeet")
|
||||
|
||||
app = FastAPI()
|
||||
model = None
|
||||
|
||||
|
||||
@app.exception_handler(Exception)
|
||||
async def unhandled_exception_handler(request: Request, exc: Exception):
|
||||
logger.error(
|
||||
"Unhandled exception on %s %s\n%s",
|
||||
request.method,
|
||||
request.url.path,
|
||||
traceback.format_exc(),
|
||||
)
|
||||
return JSONResponse(status_code=500, content={"error": str(exc)})
|
||||
|
||||
|
||||
@app.on_event("startup")
|
||||
async def load_model():
|
||||
global model
|
||||
from parakeet_mlx import from_pretrained
|
||||
|
||||
tic = time.time()
|
||||
try:
|
||||
model = from_pretrained(MODEL_ID)
|
||||
logger.info("Model %s loaded in %.1fs", MODEL_ID, time.time() - tic)
|
||||
except Exception:
|
||||
logger.error("Failed to load model %s\n%s", MODEL_ID, traceback.format_exc())
|
||||
raise
|
||||
|
||||
|
||||
@app.get("/health")
|
||||
async def health():
|
||||
return {"status": "ok" if model else "loading", "model": MODEL_ID}
|
||||
|
||||
|
||||
@app.post("/transcribe")
|
||||
async def transcribe(audio: UploadFile = File(...)):
|
||||
raw = await audio.read()
|
||||
|
||||
ext = os.path.splitext(audio.filename or "audio.wav")[1] or ".wav"
|
||||
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as f:
|
||||
f.write(raw)
|
||||
tmp_path = f.name
|
||||
|
||||
try:
|
||||
tic = time.time()
|
||||
result = model.transcribe(tmp_path)
|
||||
elapsed = time.time() - tic
|
||||
except Exception:
|
||||
logger.error(
|
||||
"Transcription failed for file %s (ext=%s)\n%s",
|
||||
audio.filename,
|
||||
ext,
|
||||
traceback.format_exc(),
|
||||
)
|
||||
raise
|
||||
finally:
|
||||
os.unlink(tmp_path)
|
||||
|
||||
text = result.text.strip()
|
||||
duration_s = result.sentences[-1].end if result.sentences else 0.0
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"duration_s": round(duration_s, 2),
|
||||
"transcribe_time_s": round(elapsed, 3),
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
|
||||
import uvicorn
|
||||
|
||||
parser = argparse.ArgumentParser(description="Parakeet ASR server")
|
||||
parser.add_argument("--port", type=int, default=8100)
|
||||
parser.add_argument("--host", default="0.0.0.0")
|
||||
parser.add_argument("--model", default=None, help="Override ASR_MODEL env var")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.model:
|
||||
MODEL_ID = args.model
|
||||
|
||||
uvicorn.run(app, host=args.host, port=args.port)
|
||||
|
|
@ -1,7 +1,11 @@
|
|||
[project]
|
||||
name = "onju-voice"
|
||||
version = "0.1.0"
|
||||
version = "2.0.0"
|
||||
description = "A hackable AI home assistant platform using the Google Nest Mini form factor"
|
||||
requires-python = ">=3.11"
|
||||
license = "MIT"
|
||||
authors = [{ name = "justLV" }]
|
||||
readme = "README.md"
|
||||
dependencies = [
|
||||
"httpx",
|
||||
"numpy",
|
||||
|
|
@ -15,9 +19,14 @@ dependencies = [
|
|||
"pyserial",
|
||||
]
|
||||
|
||||
[project.urls]
|
||||
Homepage = "https://github.com/justLV/onju-voice"
|
||||
Repository = "https://github.com/justLV/onju-v2"
|
||||
|
||||
[tool.setuptools.packages.find]
|
||||
include = ["pipeline*"]
|
||||
|
||||
[project.optional-dependencies]
|
||||
asr = ["fastapi", "uvicorn", "parakeet-mlx", "python-multipart"]
|
||||
tts-local = ["mlx-audio>=0.3.1"]
|
||||
mic = ["pyaudio"]
|
||||
|
|
|
|||
|
|
@ -1,97 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Greet an ESP32 (enables mic) then record its audio via UDP.
|
||||
|
||||
Usage:
|
||||
python record_from_esp32.py <ip>
|
||||
python record_from_esp32.py 192.168.1.50 --duration 15 --output recording.wav
|
||||
"""
|
||||
import argparse
|
||||
import asyncio
|
||||
import socket
|
||||
import time
|
||||
import wave
|
||||
|
||||
import numpy as np
|
||||
|
||||
from pipeline.audio import decode_ulaw
|
||||
from pipeline.protocol import send_audio
|
||||
|
||||
SAMPLE_RATE = 16000
|
||||
CHUNK_SIZE = 512
|
||||
|
||||
|
||||
async def main():
|
||||
parser = argparse.ArgumentParser(description="Greet ESP32 and record mic audio")
|
||||
parser.add_argument("ip", help="ESP32 IP address")
|
||||
parser.add_argument("--port", type=int, default=3001, help="TCP port (default: 3001)")
|
||||
parser.add_argument("--udp-port", type=int, default=3000, help="UDP port (default: 3000)")
|
||||
parser.add_argument("--duration", type=int, default=10, help="Recording duration in seconds")
|
||||
parser.add_argument("--output", type=str, default="recording.wav", help="Output WAV file")
|
||||
parser.add_argument("--mic-timeout", type=int, default=60, help="Mic timeout in seconds")
|
||||
parser.add_argument("--volume", type=int, default=14)
|
||||
args = parser.parse_args()
|
||||
|
||||
print(f"Greeting {args.ip}:{args.port} (enabling mic for {args.mic_timeout}s)...")
|
||||
await send_audio(args.ip, args.port, b"",
|
||||
mic_timeout=args.mic_timeout, volume=args.volume, fade=5)
|
||||
print("Mic enabled")
|
||||
await asyncio.sleep(0.5)
|
||||
|
||||
print(f"Recording from UDP :{args.udp_port} for {args.duration}s...")
|
||||
print("Talk now!\n")
|
||||
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
|
||||
sock.bind(("0.0.0.0", args.udp_port))
|
||||
sock.settimeout(1.0)
|
||||
|
||||
audio_frames = []
|
||||
packet_count = 0
|
||||
start_time = time.time()
|
||||
|
||||
try:
|
||||
while (time.time() - start_time) < args.duration:
|
||||
try:
|
||||
data, addr = sock.recvfrom(2048)
|
||||
packet_count += 1
|
||||
|
||||
if len(data) == CHUNK_SIZE:
|
||||
samples = decode_ulaw(data)
|
||||
elif len(data) == CHUNK_SIZE * 2:
|
||||
samples = np.frombuffer(data, dtype=np.int16)
|
||||
else:
|
||||
continue
|
||||
|
||||
audio_frames.append(samples)
|
||||
|
||||
if packet_count % 10 == 0:
|
||||
elapsed = time.time() - start_time
|
||||
rms = np.sqrt(np.mean(samples.astype(np.float32) ** 2))
|
||||
print(f"\r[{elapsed:4.1f}s] Packets: {packet_count:3d} | RMS: {rms:5.0f}", end="", flush=True)
|
||||
|
||||
except socket.timeout:
|
||||
continue
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n\nStopped")
|
||||
|
||||
sock.close()
|
||||
|
||||
if not audio_frames:
|
||||
print("\nNo audio received!")
|
||||
return
|
||||
|
||||
audio_data = np.concatenate(audio_frames)
|
||||
duration = len(audio_data) / SAMPLE_RATE
|
||||
|
||||
with wave.open(args.output, "wb") as wf:
|
||||
wf.setnchannels(1)
|
||||
wf.setsampwidth(2)
|
||||
wf.setframerate(SAMPLE_RATE)
|
||||
wf.writeframes(audio_data.tobytes())
|
||||
|
||||
print(f"\n\nSaved {duration:.1f}s to {args.output} ({packet_count} packets)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
|
@ -1,38 +0,0 @@
|
|||
#!/bin/bash
|
||||
# Run Opus test while monitoring serial output
|
||||
|
||||
ESP32_IP=${1:-192.168.68.95}
|
||||
AUDIO_FILE=${2:-/Users/justin/Desktop/her_8s.m4a}
|
||||
|
||||
echo "Starting serial monitor in background..."
|
||||
~/.local/share/mise/installs/python/3.12.12/bin/python3 -c "
|
||||
import glob, serial, time
|
||||
|
||||
port = sorted(glob.glob('/dev/cu.usbmodem*'))[0]
|
||||
ser = serial.Serial(port, 115200, timeout=0.1)
|
||||
|
||||
print('=== SERIAL OUTPUT ===')
|
||||
while True:
|
||||
try:
|
||||
if ser.in_waiting:
|
||||
line = ser.readline().decode('utf-8', errors='ignore').strip()
|
||||
if line:
|
||||
print(f'[ESP32] {line}')
|
||||
except:
|
||||
break
|
||||
time.sleep(0.01)
|
||||
" &
|
||||
SERIAL_PID=$!
|
||||
|
||||
sleep 2
|
||||
|
||||
echo ""
|
||||
echo "Running Opus test..."
|
||||
~/.local/share/mise/installs/python/3.12.12/bin/python3 test_opus_tts.py "$ESP32_IP" "$AUDIO_FILE"
|
||||
|
||||
echo ""
|
||||
echo "Waiting for remaining serial output..."
|
||||
sleep 3
|
||||
|
||||
kill $SERIAL_PID 2>/dev/null
|
||||
echo "Done"
|
||||
|
|
@ -16,8 +16,8 @@ import glob
|
|||
|
||||
def find_usb_port():
|
||||
"""Auto-detect USB serial port"""
|
||||
# Look for ESP32 USB ports
|
||||
ports = glob.glob('/dev/cu.usbmodem*')
|
||||
# Look for ESP32 USB ports (usbmodem = ESP32-S3, usbserial = ESP32-PICO/M5)
|
||||
ports = glob.glob('/dev/cu.usbmodem*') + glob.glob('/dev/cu.usbserial-*')
|
||||
if ports:
|
||||
return sorted(ports)[0] # Return first match
|
||||
return None
|
||||
|
|
|
|||
|
|
@ -1,334 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Benchmark Qwen3-TTS 1.7B (4-bit vs 8-bit) via mlx-audio server.
|
||||
|
||||
Restarts the server between models for accurate memory measurement.
|
||||
Also generates voice-cloned samples using data/her.mp3.
|
||||
|
||||
Usage: .venv/bin/python test_qwen3_tts.py
|
||||
"""
|
||||
|
||||
import io
|
||||
import os
|
||||
import signal
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import json
|
||||
import urllib.request
|
||||
import wave
|
||||
|
||||
SERVER = "http://localhost:8880"
|
||||
VENV_PYTHON = os.path.join(os.path.dirname(__file__), ".venv", "bin", "python")
|
||||
REF_AUDIO = os.path.join(os.path.dirname(__file__), "data", "her.mp3")
|
||||
|
||||
# Focus on 1.7B: 4-bit vs 8-bit
|
||||
MODELS = [
|
||||
("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-4bit", "Vivian"),
|
||||
("mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit", "Vivian"),
|
||||
]
|
||||
|
||||
# For voice cloning, need Base models (CustomVoice doesn't support ref_audio)
|
||||
CLONE_MODELS = [
|
||||
"mlx-community/Qwen3-TTS-12Hz-1.7B-Base-4bit",
|
||||
"mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit",
|
||||
]
|
||||
|
||||
SENTENCES = [
|
||||
"Hey there!",
|
||||
"I'd be happy to help you with that.",
|
||||
"So the main thing to keep in mind is that the process has three steps.",
|
||||
"First, you'll want to gather all the required documents and submit them through the online portal.",
|
||||
"Let me know if you need any clarification!",
|
||||
]
|
||||
|
||||
CLONE_TEXT = "I think I understand what you're saying. Sometimes things just don't work out the way we planned, and that's okay."
|
||||
|
||||
|
||||
def get_memory_mb():
|
||||
"""Get wired+active memory on macOS via vm_stat."""
|
||||
try:
|
||||
out = subprocess.check_output(["vm_stat"], text=True)
|
||||
page_size = 16384 # Apple Silicon
|
||||
stats = {}
|
||||
for line in out.strip().split("\n")[1:]:
|
||||
parts = line.split(":")
|
||||
if len(parts) == 2:
|
||||
key = parts[0].strip()
|
||||
val = parts[1].strip().rstrip(".")
|
||||
try:
|
||||
stats[key] = int(val)
|
||||
except ValueError:
|
||||
pass
|
||||
wired = stats.get("Pages wired down", 0) * page_size
|
||||
active = stats.get("Pages active", 0) * page_size
|
||||
return (wired + active) / (1024 * 1024)
|
||||
except Exception:
|
||||
return 0
|
||||
|
||||
|
||||
def kill_server():
|
||||
"""Kill any process on port 8880."""
|
||||
try:
|
||||
pids = subprocess.check_output(["lsof", "-ti", ":8880"], text=True).strip()
|
||||
for pid in pids.split("\n"):
|
||||
if pid:
|
||||
os.kill(int(pid), signal.SIGKILL)
|
||||
time.sleep(2)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def start_server():
|
||||
"""Start mlx-audio server and wait for it to be ready."""
|
||||
kill_server()
|
||||
proc = subprocess.Popen(
|
||||
[VENV_PYTHON, "-m", "mlx_audio.server", "--host", "0.0.0.0", "--port", "8880"],
|
||||
stdout=open("/tmp/qwen3-tts.log", "w"),
|
||||
stderr=subprocess.STDOUT,
|
||||
)
|
||||
# Wait for server to respond
|
||||
for i in range(30):
|
||||
time.sleep(2)
|
||||
try:
|
||||
urllib.request.urlopen(f"{SERVER}/", timeout=3)
|
||||
return proc
|
||||
except Exception:
|
||||
pass
|
||||
print(" ERROR: Server failed to start in 60s")
|
||||
return proc
|
||||
|
||||
|
||||
def wav_duration(wav_bytes):
|
||||
"""Get duration in seconds from WAV bytes."""
|
||||
with wave.open(io.BytesIO(wav_bytes)) as wf:
|
||||
return wf.getnframes() / wf.getframerate()
|
||||
|
||||
|
||||
def tts_request(text, model, voice="Vivian"):
|
||||
"""Make a TTS request and measure timings."""
|
||||
payload = json.dumps({
|
||||
"model": model,
|
||||
"voice": voice,
|
||||
"input": text,
|
||||
"response_format": "wav",
|
||||
}).encode()
|
||||
|
||||
req = urllib.request.Request(
|
||||
f"{SERVER}/v1/audio/speech",
|
||||
data=payload,
|
||||
headers={"Content-Type": "application/json"},
|
||||
)
|
||||
|
||||
t_start = time.perf_counter()
|
||||
resp = urllib.request.urlopen(req, timeout=180)
|
||||
|
||||
first_chunk = resp.read(4096)
|
||||
t_first_byte = time.perf_counter()
|
||||
|
||||
chunks = [first_chunk]
|
||||
while True:
|
||||
chunk = resp.read(65536)
|
||||
if not chunk:
|
||||
break
|
||||
chunks.append(chunk)
|
||||
t_done = time.perf_counter()
|
||||
resp.close()
|
||||
|
||||
audio_bytes = b"".join(chunks)
|
||||
audio_dur = wav_duration(audio_bytes)
|
||||
|
||||
return {
|
||||
"ttfab": t_first_byte - t_start,
|
||||
"total_time": t_done - t_start,
|
||||
"audio_duration": audio_dur,
|
||||
"audio_bytes": len(audio_bytes),
|
||||
"rtf": (t_done - t_start) / audio_dur if audio_dur > 0 else 0,
|
||||
"raw": audio_bytes,
|
||||
}
|
||||
|
||||
|
||||
def run_benchmark(model, voice):
|
||||
"""Benchmark a single model with fresh server restart."""
|
||||
short_name = model.split("/")[-1]
|
||||
print(f"\n{'='*70}")
|
||||
print(f"Model: {short_name}")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Fresh server start for accurate memory
|
||||
print("Starting fresh server...")
|
||||
mem_before = get_memory_mb()
|
||||
proc = start_server()
|
||||
|
||||
# Warmup
|
||||
print("Warming up (downloads model on first run)...")
|
||||
try:
|
||||
warmup = tts_request("Warming up.", model, voice)
|
||||
print(f" Warmup: {warmup['total_time']:.2f}s")
|
||||
except Exception as e:
|
||||
print(f" ERROR: {e}")
|
||||
try:
|
||||
with open("/tmp/qwen3-tts.log") as f:
|
||||
for line in reversed(f.readlines()):
|
||||
if "Error" in line or "error" in line:
|
||||
print(f" Log: {line.strip()[:120]}")
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
print(" Skipping.\n")
|
||||
kill_server()
|
||||
return None
|
||||
|
||||
# Measure memory after model loaded
|
||||
time.sleep(1)
|
||||
mem_after = get_memory_mb()
|
||||
mem_delta = mem_after - mem_before
|
||||
print(f" Memory delta: ~{mem_delta:.0f} MB")
|
||||
|
||||
# Run benchmark (2 passes — use second for stable numbers)
|
||||
for pass_num in range(2):
|
||||
results = []
|
||||
for sentence in SENTENCES:
|
||||
r = tts_request(sentence, model, voice)
|
||||
results.append(r)
|
||||
time.sleep(0.1)
|
||||
|
||||
if pass_num == 0:
|
||||
print(" Pass 1 done (warmup pass), running pass 2...")
|
||||
|
||||
# Print results from pass 2
|
||||
hdr = f"{'#':<3} {'TTFAB':>7} {'Total':>7} {'Audio':>7} {'RTF':>6} Text"
|
||||
print(f"\n{hdr}")
|
||||
print("-" * 75)
|
||||
|
||||
for i, r in enumerate(results):
|
||||
preview = SENTENCES[i][:45] + "..." if len(SENTENCES[i]) > 45 else SENTENCES[i]
|
||||
print(f"{i+1:<3} {r['ttfab']:>6.2f}s {r['total_time']:>6.2f}s {r['audio_duration']:>6.2f}s {r['rtf']:>5.2f}x {preview}")
|
||||
|
||||
avg_ttfab = sum(r["ttfab"] for r in results) / len(results)
|
||||
avg_rtf = sum(r["rtf"] for r in results) / len(results)
|
||||
total_tts = sum(r["total_time"] for r in results)
|
||||
total_audio = sum(r["audio_duration"] for r in results)
|
||||
min_ttfab = min(r["ttfab"] for r in results)
|
||||
max_ttfab = max(r["ttfab"] for r in results)
|
||||
|
||||
print(f"\n Avg TTFAB: {avg_ttfab:.2f}s (min: {min_ttfab:.2f}s, max: {max_ttfab:.2f}s)")
|
||||
print(f" Avg RTF: {avg_rtf:.2f}x (lower = faster)")
|
||||
print(f" Totals: {total_tts:.1f}s generation -> {total_audio:.1f}s audio")
|
||||
print(f" Memory: ~{mem_delta:.0f} MB over baseline")
|
||||
|
||||
kill_server()
|
||||
|
||||
return {
|
||||
"model": short_name,
|
||||
"avg_ttfab": avg_ttfab,
|
||||
"min_ttfab": min_ttfab,
|
||||
"avg_rtf": avg_rtf,
|
||||
"mem_delta_mb": mem_delta,
|
||||
"total_audio": total_audio,
|
||||
}
|
||||
|
||||
|
||||
def generate_clone_samples(ref_audio_path):
|
||||
"""Generate voice-cloned samples from reference audio."""
|
||||
print(f"\n{'='*70}")
|
||||
print("VOICE CLONING SAMPLES")
|
||||
print(f"Reference: {ref_audio_path} (first 3 seconds)")
|
||||
print(f"{'='*70}")
|
||||
|
||||
# Extract first 3s as wav for reference
|
||||
from pydub import AudioSegment
|
||||
audio = AudioSegment.from_mp3(ref_audio_path)
|
||||
clip = audio[:3000] # first 3 seconds
|
||||
clip = clip.set_channels(1).set_frame_rate(24000)
|
||||
ref_wav_path = "/tmp/her_ref_3s.wav"
|
||||
clip.export(ref_wav_path, format="wav")
|
||||
print(f" Extracted 3s reference -> {ref_wav_path}")
|
||||
|
||||
# We need to use the Python API directly for voice cloning
|
||||
# since the server API for ref_audio requires file upload
|
||||
for clone_model in CLONE_MODELS:
|
||||
short = clone_model.split("/")[-1]
|
||||
print(f"\n--- {short} ---")
|
||||
print(" Starting server...")
|
||||
proc = start_server()
|
||||
|
||||
# The server's /v1/audio/speech doesn't easily support ref_audio
|
||||
# Use Python API directly instead
|
||||
kill_server()
|
||||
|
||||
print(" Loading model directly (no server)...")
|
||||
try:
|
||||
mem_before = get_memory_mb()
|
||||
t0 = time.perf_counter()
|
||||
|
||||
from mlx_audio.tts.utils import load_model
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
|
||||
model = load_model(clone_model)
|
||||
t_load = time.perf_counter() - t0
|
||||
mem_after = get_memory_mb()
|
||||
print(f" Model loaded in {t_load:.1f}s (mem delta: ~{mem_after - mem_before:.0f} MB)")
|
||||
|
||||
# Generate with voice cloning
|
||||
t0 = time.perf_counter()
|
||||
results = list(model.generate(
|
||||
text=CLONE_TEXT,
|
||||
ref_audio=ref_wav_path,
|
||||
ref_text="", # let it auto-detect or leave empty
|
||||
))
|
||||
t_gen = time.perf_counter() - t0
|
||||
|
||||
if results and hasattr(results[0], 'audio') and len(results[0].audio) > 0:
|
||||
audio_data = np.array(results[0].audio)
|
||||
sr = getattr(results[0], 'sample_rate', 24000)
|
||||
if isinstance(sr, type(None)):
|
||||
sr = 24000
|
||||
duration = len(audio_data) / sr
|
||||
rtf = t_gen / duration if duration > 0 else 0
|
||||
|
||||
out_path = f"data/clone_sample_{short}.wav"
|
||||
sf.write(out_path, audio_data, sr)
|
||||
print(f" Generated: {out_path}")
|
||||
print(f" Duration: {duration:.2f}s, Time: {t_gen:.2f}s, RTF: {rtf:.2f}x")
|
||||
else:
|
||||
print(f" WARNING: No audio generated")
|
||||
|
||||
# Clean up model from memory
|
||||
del model
|
||||
|
||||
except Exception as e:
|
||||
print(f" ERROR: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("Qwen3-TTS 1.7B Benchmark: 4-bit vs 8-bit")
|
||||
print(f"Server: {SERVER}")
|
||||
print(f"Machine: {os.uname().machine}, {os.uname().sysname}\n")
|
||||
|
||||
# Part 1: Benchmark 1.7B 4-bit vs 8-bit with preset voices
|
||||
summaries = []
|
||||
for model, voice in MODELS:
|
||||
result = run_benchmark(model, voice)
|
||||
if result:
|
||||
summaries.append(result)
|
||||
|
||||
if summaries:
|
||||
print(f"\n{'='*70}")
|
||||
print("1.7B COMPARISON: 4-bit vs 8-bit")
|
||||
print(f"{'='*70}")
|
||||
print(f"{'Model':<45} {'TTFAB':>7} {'Min TTFAB':>10} {'RTF':>6} {'Mem':>8}")
|
||||
print("-" * 80)
|
||||
for s in summaries:
|
||||
print(f"{s['model']:<45} {s['avg_ttfab']:>6.2f}s {s['min_ttfab']:>9.2f}s {s['avg_rtf']:>5.2f}x {s['mem_delta_mb']:>6.0f}MB")
|
||||
|
||||
# Part 2: Voice cloning samples
|
||||
if os.path.exists(REF_AUDIO):
|
||||
generate_clone_samples(REF_AUDIO)
|
||||
else:
|
||||
print(f"\nSkipping voice cloning: {REF_AUDIO} not found")
|
||||
|
||||
print("\nDone!")
|
||||
|
|
@ -1,115 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test streaming TTS to an ESP32 device.
|
||||
Generates speech via ElevenLabs, Opus-encodes, and streams over TCP.
|
||||
|
||||
Usage:
|
||||
python test_streaming_tts.py <ip> [--text "Hello world"]
|
||||
python test_streaming_tts.py 192.168.1.50
|
||||
python test_streaming_tts.py 192.168.1.50 --text "Testing one two three"
|
||||
"""
|
||||
import argparse
|
||||
import asyncio
|
||||
import io
|
||||
import struct
|
||||
import socket
|
||||
import time
|
||||
|
||||
import opuslib
|
||||
from pydub import AudioSegment
|
||||
from elevenlabs import ElevenLabs
|
||||
|
||||
from pipeline.main import load_config
|
||||
|
||||
SAMPLE_RATE = 16000
|
||||
OPUS_FRAME_SIZE = 320
|
||||
DEFAULT_TEXT = "Hello! This is a streaming text to speech test. Notice how the audio starts playing before the full sentence is generated."
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Stream TTS to ESP32")
|
||||
parser.add_argument("ip", help="Device IP address")
|
||||
parser.add_argument("--port", type=int, default=3001, help="TCP port (default: 3001)")
|
||||
parser.add_argument("--volume", type=int, default=14)
|
||||
parser.add_argument("--text", default=DEFAULT_TEXT)
|
||||
parser.add_argument("--voice", default=None, help="ElevenLabs voice ID (default: from config)")
|
||||
args = parser.parse_args()
|
||||
|
||||
config = load_config()
|
||||
el_cfg = config["tts"]["elevenlabs"]
|
||||
api_key = el_cfg["api_key"]
|
||||
voice_id = args.voice or el_cfg["voices"].get(el_cfg.get("default_voice", "Rachel"), "21m00Tcm4TlvDq8ikWAM")
|
||||
|
||||
print(f"Text: {args.text}")
|
||||
print(f"Target: {args.ip}:{args.port}")
|
||||
print()
|
||||
|
||||
client = ElevenLabs(api_key=api_key)
|
||||
encoder = opuslib.Encoder(SAMPLE_RATE, 1, opuslib.APPLICATION_VOIP)
|
||||
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
sock.settimeout(10.0)
|
||||
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
|
||||
sock.connect((args.ip, args.port))
|
||||
|
||||
header = bytes([0xAA, 0x00, 60, args.volume, 5, 2]) # compression=2 (Opus)
|
||||
sock.send(header)
|
||||
|
||||
start_time = time.time()
|
||||
first_chunk_time = None
|
||||
total_pcm = 0
|
||||
total_opus = 0
|
||||
opus_frames = 0
|
||||
pcm_buffer = b""
|
||||
|
||||
audio_stream = client.text_to_speech.convert(
|
||||
voice_id=voice_id,
|
||||
text=args.text,
|
||||
model_id="eleven_monolingual_v1",
|
||||
output_format="pcm_16000",
|
||||
)
|
||||
|
||||
for pcm_chunk in audio_stream:
|
||||
if first_chunk_time is None:
|
||||
first_chunk_time = time.time()
|
||||
print(f"First audio chunk: {first_chunk_time - start_time:.3f}s")
|
||||
|
||||
total_pcm += len(pcm_chunk)
|
||||
pcm_buffer += pcm_chunk
|
||||
|
||||
frame_bytes = OPUS_FRAME_SIZE * 2
|
||||
while len(pcm_buffer) >= frame_bytes:
|
||||
frame = pcm_buffer[:frame_bytes]
|
||||
pcm_buffer = pcm_buffer[frame_bytes:]
|
||||
opus_frame = encoder.encode(frame, OPUS_FRAME_SIZE)
|
||||
sock.send(struct.pack(">H", len(opus_frame)))
|
||||
sock.send(opus_frame)
|
||||
total_opus += len(opus_frame)
|
||||
opus_frames += 1
|
||||
|
||||
# Flush remaining
|
||||
if pcm_buffer:
|
||||
frame_bytes = OPUS_FRAME_SIZE * 2
|
||||
pcm_buffer += b"\x00" * (frame_bytes - len(pcm_buffer))
|
||||
opus_frame = encoder.encode(pcm_buffer[:frame_bytes], OPUS_FRAME_SIZE)
|
||||
sock.send(struct.pack(">H", len(opus_frame)))
|
||||
sock.send(opus_frame)
|
||||
total_opus += len(opus_frame)
|
||||
opus_frames += 1
|
||||
|
||||
sock.close()
|
||||
end_time = time.time()
|
||||
|
||||
audio_duration = total_pcm / (SAMPLE_RATE * 2)
|
||||
ratio = total_pcm / total_opus if total_opus else 0
|
||||
|
||||
print(f"\nResults:")
|
||||
print(f" Pipeline time: {end_time - start_time:.2f}s")
|
||||
print(f" Time to first audio: {first_chunk_time - start_time:.3f}s")
|
||||
print(f" Audio duration: {audio_duration:.1f}s")
|
||||
print(f" Opus frames: {opus_frames}")
|
||||
print(f" Compression: {ratio:.1f}x ({total_pcm:,} PCM -> {total_opus:,} Opus)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,15 +0,0 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Quick test: can we receive UDP on port 3000?"""
|
||||
import socket
|
||||
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
|
||||
sock.bind(("0.0.0.0", 3000))
|
||||
sock.settimeout(15)
|
||||
print("Listening on :3000 — press PTT on M5 within 15s...")
|
||||
try:
|
||||
data, addr = sock.recvfrom(2048)
|
||||
print(f"Got {len(data)}B from {addr}")
|
||||
except socket.timeout:
|
||||
print("No packets received (timeout)")
|
||||
finally:
|
||||
sock.close()
|
||||
Loading…
Reference in a new issue