Long-press detection was in loop() which blocks during TCP audio handling.
Moved to dedicated touchTask on Core 1 that polls every 20ms regardless
of what loop() is doing.
- Add long-press detection (1.5s hold) on center touch to explicitly end call:
stops mic, interrupts playback, shows slow amber LED pulse
- Rewrite touch handler: ISR records touch start, loop() polls for release
to distinguish short press (<1.5s) from long press (>=1.5s)
- Add callActive state to track call lifecycle (tap to start, long-hold to end)
- Short press when idle shows subtle green flash (server confirms with full
pulse once WebRTC call is established)
- Reduce default mic timeout from 60s to 20s (server VAD extends when active)
- Guard 0xCC handler: don't extend mic after user explicitly ended call
- Reset callActive on natural mic timeout
--warmup validates LLM and TTS backends on startup with test requests,
logging timing and response validation. --persist (off by default)
restores device state across restarts with message sanitization to
ensure proper role alternation for Gemma 3's chat template.
Adds mlx-audio-based Qwen3-TTS as an alternative to ElevenLabs,
enabling fully offline voice synthesis with voice cloning from a
short reference audio clip. Benchmarked at 0.52x RTF (sub-realtime)
on Apple Silicon with the 1.7B-Base-4bit model.
Switch from webrtcvad's binary is_speech to Silero VAD's calibrated
float probability via direct ONNX session calls with numpy. The LSTM
provides temporal smoothing natively, eliminating the sliding window
hack. Frame size changes from 480 (30ms) to 512 (32ms) end-to-end
to match Silero's requirements.
Consolidate pipeline/requirements.txt into root requirements.txt,
swap webrtcvad+setuptools for silero-vad+onnxruntime.
Move venv to repo root with combined requirements.txt, fix libopus/portaudio
discovery on macOS, replace deprecated audioop with numpy u-law encoder,
add colored pipeline logging with suppressed third-party noise, fix mic
deadlock on non-speech rejection, fix localhost IP mismatch for test client,
add VAD visualization bar, tune VAD for conversational speech, and move
runtime data to gitignored data/ directory.
Pipeline: async voice pipeline replacing monolithic threaded server.
ASR, LLM, and TTS are independent pluggable services. ASR calls
external parakeet-asr-server, LLM uses any OpenAI-compatible
endpoint, TTS uses ElevenLabs with pluggable backend interface.
Firmware: add mDNS hostname resolution as fallback when multicast
discovery doesn't work. Resolves configured server_hostname via
MDNS.queryHost() on boot, falls back to multicast if resolution fails.
Also adds test_client.py that emulates an ESP32 device for testing
without hardware (TCP server, Opus decode, mic streaming).
With Opus compression providing consistent frame delivery, we can
safely reduce the jitter buffer from 8192 samples (512ms) to 4096
samples (256ms), cutting latency in half.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added section explaining how to use flash_firmware.sh for:
- Compile-only mode (no ESP32 needed)
- Auto-detect and flash
- Flash to specific port
Emphasized using compile-only mode to verify code before committing.
🤖 Generated with Claude Code (https://claude.com/claude-code)
Added compile-only mode that skips upload:
./flash_firmware.sh compile
Updated usage:
- flash_firmware.sh # Auto-detect and upload
- flash_firmware.sh /dev/cu.usbmodem1 # Upload to specific port
- flash_firmware.sh compile # Compile only, no upload
Useful for verifying code compiles without needing ESP32 connected.
🤖 Generated with Claude Code (https://claude.com/claude-code)
Changed all mic activation paths to use 60s timeout:
- Center tap to start call: 30s → 60s
- Center tap to interrupt: 30s → 60s
- After assistant audio: Enforce minimum 60s (was using server value)
Behavior:
- Tap center → mic enabled for 60s
- Assistant speaks → mic auto-enabled for 60s after playback
- Tap during playback → interrupts and mic enabled for 60s
This ensures users always have adequate time to respond without
premature timeout, matching the intended UX.
🤖 Generated with Claude Code (https://claude.com/claude-code)
Added 800ms debounce to all touch pads (left, center, right) to prevent
accidental multiple touches from interrupting audio playback.
Changes:
- Added debounce timing variables (lines 57-61)
- Implemented debounce logic in gotTouch1/2/3 handlers (lines 967-1032)
- Each touch pad has independent debounce timer
- Touches within 800ms of previous touch are ignored
This prevents issues where:
- Center tap would trigger multiple times from single press
- Audio playback would be interrupted repeatedly
- User experience was degraded by touch sensitivity
The 800ms window provides good balance between preventing hardware
bounces and maintaining responsive feel for legitimate user input.
🤖 Generated with Claude Code
Enhances test_streaming_tts.py to support optional Opus encoding for
streaming TTS audio from ElevenLabs to ESP32.
Features:
- Add --opus flag to enable Opus compression
- Accept ESP32 IP as command-line argument
- Buffer PCM chunks into 20ms frames (640 bytes) for Opus encoding
- Send with length-prefixed framing (compatible with ESP32 decoder)
- Display compression statistics when using Opus
Usage:
python test_streaming_tts.py [ESP32_IP] [--opus]
Results with Opus:
- Compression ratio: ~14.5x (248KB PCM → 17KB Opus)
- Bandwidth: 256 kbps → ~17 kbps (93% reduction)
- Maintains streaming latency (~2s to first chunk)
- High quality voice for human listening
Tested successfully with ElevenLabs API streaming to ESP32-S3.
Allows user to interrupt TTS playback mid-stream by tapping the center
touch button. Enables immediate voice input without waiting for assistant
to finish speaking.
Implementation:
- Add interruptPlayback volatile flag for ISR-safe signaling
- Opus decode task checks flag on each frame decode iteration
- PCM playback checks flag on each buffer read iteration
- On interrupt: stop decoding, clear I2S DMA buffers, drain TCP
- TCP drain runs for 1s to discard in-flight audio from server
- Skip silence buffer flush when interrupted (exit immediately)
- Enable microphone with 30s timeout for user response
Behavior:
- Latency: ~500ms (acceptable - next buffer iteration)
- Visual feedback: Green LED indicates listening mode
- Server timeout value still respected (gives user time to speak)
- Works for both Opus and PCM audio streams
User flow:
1. User taps during playback
2. Audio stops within ~500ms
3. Green LED pulses (listening mode)
4. Microphone enabled for 30s
5. User can speak immediately
Implements Opus decoding on ESP32 for TTS playback, achieving 14-16x
compression over raw PCM. This improves WiFi throughput margin from 2.2x
to 30x+, enabling reliable operation throughout the home even with poor
WiFi conditions.
Key changes:
- Add Opus decoder to ESP32 firmware with dedicated 32KB FreeRTOS task
- Implement length-prefixed TCP framing for variable-bitrate Opus frames
- Update header protocol: header[5] = compression type (0=PCM, 1=μ-law, 2=Opus)
- Auto-detect USB port in flash and serial monitor scripts
- Add test script with opuslib encoder supporting WAV/M4A/MP3 input
- Document architecture and design rationale for μ-law/UDP (mic) vs Opus/TCP (speaker)
Performance:
- Compression: 640 bytes PCM → 35-50 bytes Opus per 20ms frame (14-16x)
- Bandwidth: 256 kbps → 16 kbps (94% reduction)
- WiFi margin: 2.2x → 30x+ throughput safety margin
- CPU usage: ~10-20% during playback on ESP32-S3
- Quality: High-fidelity voice suitable for human listening
🤖 Generated with [Claude Code](https://claude.com/claude-code)
This commit adds audio compression and fixes critical I2S configuration
issues that prevented audio playback on ESP32-S3 V3 boards.
Key Changes:
- Fix I2S channel from RIGHT to LEFT (V3 board requirement)
- Fix deprecated I2S_COMM_FORMAT_I2S to I2S_COMM_FORMAT_STAND_I2S
- Add μ-law compression for 2x bandwidth reduction (960→480 bytes/packet)
- Add DC offset removal for microphone to fix compression artifacts
- Add DISABLE_HARDWARE_MUTE option for boards without mute switch
- Improve mute button behavior to control mic_timeout
New Files:
- onjuino/audio_compression.h: μ-law encoding/decoding implementation
- flash_firmware.sh: Automated compilation and flashing script
- serial_monitor.py: Interactive serial monitor with auto-reconnect
- test_mic_receiver.py: UDP audio recording and compression testing
- test_speaker.py: Speaker testing with local WAV file
- test_streaming_tts.py: ElevenLabs streaming TTS performance testing
- record_from_esp32.py: Simple recording script for testing
- TESTING.md: Testing documentation
Fixes:
- I2S audio output now works on V3 boards (issue #57, #75)
- Microphone compression produces valid audio data
- Serial reset command now works properly (ESP.restart vs esp_restart)
🤖 Generated with [Claude Code](https://claude.com/claude-code)