mirror of https://github.com/justLV/onju-v2 synced 2026-04-21 15:47:55 +00:00

Google Home mini "jailbreak" for conversational AI agents

Find a file

justLV bf1ceb3e69 Remove redundant top-level default_voice from TTS config device.py now reads default_voice from tts.elevenlabs directly.		2026-04-08 13:37:20 -07:00
hardware	Move Schematic.pdf from images/ to hardware/	2026-04-08 13:02:08 -07:00
images	Move Schematic.pdf from images/ to hardware/	2026-04-08 13:02:08 -07:00
m5_echo	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
onjuino	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
pipeline	Remove redundant top-level default_voice from TTS config	2026-04-08 13:37:20 -07:00
.gitignore	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
flash.sh	Check for .ino.bin artifact to detect stale/missing builds	2026-04-08 10:53:58 -07:00
LICENSE	Create LICENSE	2023-08-08 19:17:47 -07:00
pyproject.toml	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
README.md	Add OpenClaw setup script and documentation	2026-04-08 13:22:09 -07:00
serial_monitor.py	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
setup-git-hash.sh	add to public repo	2023-08-08 20:32:52 -07:00
setup_openclaw.sh	Add OpenClaw setup script and documentation	2026-04-08 13:22:09 -07:00
test_client.py	Replace webrtcvad with Silero VAD (ONNX, no PyTorch)	2026-02-07 17:00:02 -08:00
test_mic.py	Prepare repo for v2 release: rewrite README, clean up dev scripts, embed ASR server	2026-04-08 13:00:15 -07:00
test_speaker.py	Add PTT device support, IIR DC offset fix, control API, test script updates	2026-04-06 14:22:20 -07:00

README.md

Onju Voice v2 (OnjuClaw 🍐🦞 ?)

Enable multiple "Google Home" speakers to connect to a Mac Mini (or other local server) for talking to your agent(s) over your local WiFi.

This repo consists of:

A custom PCB designed as a drop-in replacement to the original Google Nest Mini (2nd gen), using the ESP32-S3 for audio processing and WiFi connectivity
An async server pipeline handling ASR -> TTS from multiple devices on the same network to be compatible with any LLM or agent platforms like OpenClaw 🦞

This is the successor to onju-voice. The original repo remains available as a reference but is no longer actively maintained.

What's new in v2

OpenClaw managed backend 🦞 -- delegate conversation history and session management to an OpenClaw gateway for centralized, multi-device orchestration
Opus compression -- 14-16x downstream compression (server to speaker) for better audio quality over WiFi
Streaming-ready architecture -- designed for sentence-level TTS streaming and agentic tool-calling loops
Modular async pipeline -- replaced the monolithic server with a pluggable architecture for ASR, LLM, and TTS backends etc.
Any LLM -- works with any OpenAI-compatible API (Ollama, mlx_lm, Gemini, OpenRouter, Claude, etc.)
Pluggable TTS -- ElevenLabs (recommended) or local via mlx-audio for fully offline operation
Silero VAD -- server-side voice activity detection with configurable thresholds, replacing webrtcvad
VAD-aware interruption -- tap to interrupt playback and start speaking immediately
M5 Echo support -- get started with a $13 dev kit instead of ordering a custom PCB
One-command flashing -- ./flash.sh handles compilation, WiFi credential generation (from macOS Keychain), and upload. No Arduino IDE or manual configuration required

Supported devices

	Onjuino (custom PCB)	M5Stack ATOM Echo
Board	ESP32-S3	ESP32-PICO-D4
Interaction	Capacitive touch: tap to start, double-tap to end	Physical button: hold to talk
Mic	I2S (INMP441)	PDM (SPM1423)
Speaker	MAX98357A, 6 NeoPixel LEDs	NS4168, 1 SK6812 LED
PSRAM	Yes (2MB playback buffer)	No (smaller buffers)
Audio upstream	mu-law 16kHz UDP (16 KB/s)	mu-law 16kHz UDP (16 KB/s)
Audio downstream	Opus 16kHz TCP (~1.5 KB/s)	Opus 16kHz TCP (~1.5 KB/s)

Both targets use the same network protocol and connect to the same server. See the M5 Echo README for hardware-specific details.

Architecture

                ESP32 Device                              Server Pipeline
  ┌──────────────────────────────┐       ┌──────────────────────────────────────┐
  │  Mic > I2S RX > mu-law =======UDP 3000===> mu-law decode > VAD > ASR        │
  │                              │       │                                      │
  │  Speaker < I2S TX < Opus <===TCP 3001<=== Opus encode < TTS < LLM           │
  └──────────────────────────────┘       └──────────────────────────────────────┘

Why mu-law upstream: Stateless sample-by-sample encoding (~1% CPU), zero buffering latency. ASR models handle the quality fine.

Why Opus downstream: Human ears need better quality than ASR, and Opus decoding is easier for an ESP32. Opus gives 14-16x compression vs mu-law's 2x, and TCP ensures reliable ordered delivery for the stateful codec.

Device discovery

ESP32 boots and joins WiFi
Sends multicast announcement to 239.0.0.1:12345 with hostname, git hash, and PTT flag
Server discovers device and connects to its TCP server on port 3001
ESP32 learns server IP from the TCP connection and starts sending mic audio via UDP

TCP command protocol

All commands use a 6-byte header. The server initiates TCP connections to the ESP32.

Byte 0	Command	Payload
`0xAA`	Audio playback	mic_timeout(2B), volume, LED fade, compression type, then length-prefixed Opus frames
`0xBB`	Set LEDs	LED bitmask, RGB color
`0xCC`	LED blink (VAD)	intensity, RGB color, fade rate
`0xDD`	Mic timeout	timeout in seconds (2B)

A zero-length Opus frame (0x00 0x00) signals end of speech.

FreeRTOS task layout

Core	Task	Purpose
Core 0	Arduino loop	TCP server, touch/mute input, UART debug
Core 1	`micTask`	I2S read, mu-law encode, UDP send
Core 1	`opusDecodeTask`	TCP read, Opus decode, I2S write (created per playback)
Core 1	`updateLedTask`	40Hz LED refresh with gamma-corrected fade

Conversation backends

The pipeline supports two conversation backends, selectable via config.yaml:

Local (conversation.backend: "local"): Manages conversation history locally with per-device JSON persistence. Sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint.

OpenClaw Managed (conversation.backend: "managed"): Delegates session management to an OpenClaw gateway. Only sends the latest user message -- OpenClaw tracks history server-side using the device ID as the session key. Set OPENCLAW_GATEWAY_TOKEN in your environment and point base_url at your gateway.

Setting up OpenClaw

If you have OpenClaw installed, a setup script is included:

./setup_openclaw.sh

This will:

Enable the chat completions HTTP endpoint on the gateway
Append a voice mode prompt to ~/.openclaw/workspace/AGENTS.md (tells the agent to respond in concise, speech-friendly prose when the message channel is onju-voice)
Restart the gateway

Then set conversation.backend: "managed" in pipeline/config.yaml and ensure OPENCLAW_GATEWAY_TOKEN is set in your environment.

Installation

Server

# Clone and set up Python environment
git clone https://github.com/justLV/onju-v2.git
cd onju-v2
uv venv && source .venv/bin/activate
uv pip install -e .

# macOS: install system libraries for Opus encoding
brew install opus portaudio

# Configure
cp pipeline/config.yaml.example pipeline/config.yaml
# Edit config.yaml with your API keys and preferences

ASR -- an embedded parakeet-mlx server is included (Apple Silicon):

uv pip install -e ".[asr]"
python -m pipeline.services.asr_server  # runs on port 8100

Or point asr.url in config.yaml at any Whisper-compatible endpoint.

LLM -- any OpenAI-compatible server:

# Local (mlx_lm on Apple Silicon)
mlx_lm.server --model unsloth/gemma-4-E4B-it-UD-MLX-4bit --port 8080

# Local (Ollama)
ollama run gemma4:e4b

# Cloud -- just set base_url and api_key in config.yaml (default: Haiku via OpenRouter)

TTS -- ElevenLabs is the default (set your API key in config.yaml). For fully offline TTS, you can use mlx-audio (uv pip install -e ".[tts-local]", then set tts.backend: "qwen3" for example in config.yaml - I don't think this is the best quality, just including as reference for a local TTS!).

Run:

source .venv/bin/activate
python -m pipeline.main

Firmware

Both targets can be compiled and flashed from the command line:

# Flash onjuino (default)
./flash.sh

# Flash M5 Echo
./flash.sh m5_echo

# Compile only (no device needed)
./flash.sh compile

# Regenerate WiFi credentials from macOS Keychain, defaults to manual entry
./flash.sh --regen

Requires arduino-cli:

# macOS
brew install arduino-cli
arduino-cli core install esp32:esp32

The flash script auto-installs required libraries (Adafruit NeoPixel, esp32_opus).

WiFi credentials are generated from your macOS Keychain on first flash, or you can edit the credentials.h.template files manually.

For Arduino IDE users: select ESP32S3 Dev Module (onjuino) or ESP32 Dev Module (M5 Echo), enable USB CDC on Boot and OPI PSRAM (onjuino only), then build and upload.

Hardware

Preview schematics & PCB | Order from PCBWay | Altium source files and schematics in hardware/.

If you don't have a custom PCB, you can use the M5Stack ATOM Echo. I'd recommend adding a Battery (Biscuit) Base (link)

Configuration reference

See pipeline/config.yaml.example for all options. Key sections:

Section	What it controls
`asr`	Speech-to-text service URL
`conversation.backend`	`"local"` or `"managed"` (OpenClaw)
`conversation.local`	LLM endpoint, model, system prompt, message history
`conversation.managed`	OpenClaw gateway URL, auth token, message channel
`tts`	TTS backend (`"elevenlabs"` or `"qwen3"`), voice settings
`vad`	Voice activity detection thresholds and timing
`network`	UDP/TCP/multicast ports
`device`	Volume, mic timeout, LED settings, greeting audio

Environment variables

Variable	Used by
`OPENROUTER_API_KEY`	Local backend via OpenRouter (default)
`ANTHROPIC_API_KEY`	Local backend via Anthropic API directly
`OPENCLAW_GATEWAY_TOKEN`	Managed (OpenClaw) backend

Testing

# Emulate an ESP32 device (no hardware needed)
python test_client.py                  # localhost
python test_client.py 192.168.1.50     # remote server

# Test speaker output (send audio file to device w/ TCP and Opus encoding)
python test_speaker.py <device-ip>

# Test mic input (receive and record UDP audio)
python test_mic.py --duration 10

# Serial monitor (auto-detects USB port)
python serial_monitor.py test.wav

UART debug commands

Both firmware targets support serial commands at 115200 baud:

Key	Action
`r`	Reboot
`M`	Enable mic for 10 minutes
`m`	Disable mic
`A`	Re-send multicast announcement
`c`	Enter config mode (WiFi, server, volume)
`W`/`w`	LED test fast/slow (onjuino)
`P`	Play 440Hz test tone (M5 Echo)

License

MIT