onju-v2/ARCHITECTURE.md
justLV e4d7bc7ca5 End-of-speech protocol, LED tweaks, call-end guard
- Handle zero-length Opus frame (0x00 0x00) as end-of-speech marker:
  exits opusDecodeTask cleanly, clears isPlaying, re-enables mic
- Zero I2S DMA buffer on opusDecodeTask exit (prevents stale DMA)
- Reject 0xAA audio commands when callActive is false (prevents
  bridge from restarting playback after user double-tapped to end)
- Don't reset mic_timeout after playback if call was ended
- LED: white flash for tap/interrupt, red-orange for call end
- Pipeline: append end-of-speech marker to Opus TCP payload
- ARCHITECTURE.md: document end-of-speech marker protocol
2026-04-07 16:41:59 -07:00

9.1 KiB
Raw Blame History

Onju Voice Architecture

System Overview

ESP32-S3 voice assistant with bidirectional audio streaming over WiFi to a server running speech recognition and text-to-speech.

┌─────────────────────────────────────────────────────────────┐
│                         ESP32-S3                             │
│  ┌──────────┐    ┌─────────┐    ┌──────────┐   ┌─────────┐ │
│  │   Mic    │───→│ I2S RX  │───→│ μ-law    │──→│   UDP   │ │
│  │ (INMP441)│    │ 16kHz   │    │ encode   │   │  3000   │ │
│  └──────────┘    └─────────┘    └──────────┘   └─────┬───┘ │
│                                                        │     │
│  ┌──────────┐    ┌─────────┐    ┌──────────┐   ┌─────▼───┐ │
│  │ Speaker  │◀───│ I2S TX  │◀───│  Opus    │◀──│   TCP   │ │
│  │(MAX98357)│    │ 16kHz   │    │ decode   │   │  3001   │ │
│  └──────────┘    └─────────┘    └──────────┘   └─────────┘ │
└─────────────────────────────────────────────────────────────┘
                                 WiFi
                                  │
┌─────────────────────────────────▼───────────────────────────┐
│                           Server                             │
│  ┌─────────┐    ┌──────────┐    ┌─────────────────────┐    │
│  │   UDP   │───→│  μ-law   │───→│  Speech-to-Text     │    │
│  │  3000   │    │  decode  │    │  (Whisper/Deepgram) │    │
│  └─────────┘    └──────────┘    └─────────────────────┘    │
│                                                              │
│  ┌─────────┐    ┌──────────┐    ┌─────────────────────┐    │
│  │   TCP   │◀───│  Opus    │◀───│  Text-to-Speech     │    │
│  │  3001   │    │  encode  │    │  (ElevenLabs/etc)   │    │
│  └─────────┘    └──────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Audio Paths

Microphone → Server (UDP + μ-law)

  • Sample rate: 16kHz mono, 512 samples/chunk (32ms)
  • μ-law compressed: 512 bytes/chunk (16 KB/s) — 2x reduction
  • UDP: no retransmissions, no connection overhead — old audio is stale anyway
  • DC offset removed per-chunk before encoding

Why μ-law over Opus upstream: μ-law is stateless (sample-by-sample table lookup, ~1% CPU), zero buffering latency, and ASR models handle the quality fine. Opus would add 20-60ms frame buffering and 10-20% CPU for no practical benefit upstream.

Why UDP over TCP: Retransmissions add latency and head-of-line blocking delays newer audio. ASR handles occasional packet loss better than delayed old audio.

Server → Speaker (TCP + Opus)

  • Sample rate: 16kHz mono, 320 samples/frame (20ms)
  • Opus compressed: ~35-50 bytes/frame (1.5-2 KB/s) — 14-16x reduction
  • TCP: reliable ordered delivery required for Opus frame decoding

Why Opus over μ-law downstream: Human ears need better quality than ASR. Opus gives 14-16x compression vs μ-law's 2x, turning a tight 2.2x WiFi margin into 30x+.

Why TCP over UDP: Lost or out-of-order Opus frames cause decode errors. TCP's reliability guarantees are worth the slight latency cost, especially with the playback buffer absorbing jitter.

Device Discovery & Connection

  1. ESP32 boots and joins WiFi
  2. Sends multicast announcement to 239.0.0.1:12345 with hostname and git hash
  3. Server discovers device, learns IP
  4. Server connects to ESP32's TCP server on port 3001 (ESP32 is the TCP server, not client)
  5. ESP32 learns server IP from first TCP connection, uses it for UDP mic packets

TCP Command Protocol

All commands use a 6-byte header. The server initiates TCP connections to the ESP32.

0xAA — Audio Playback

header[0]   = 0xAA
header[1:2] = mic_timeout (seconds, big-endian) — enable mic after audio finishes
header[3]   = volume (0-20, bit-shift)
header[4]   = LED fade rate (0-255)
header[5]   = compression type: 0=PCM, 2=Opus

Followed by length-prefixed Opus frames: [2-byte big-endian length][Opus data]...

A zero-length frame (0x00 0x00) signals end of speech — the ESP32 exits opusDecodeTask, clears isPlaying, and re-enables the mic. The TCP connection may stay open for reuse.

0xBB — Set LEDs

header[0]   = 0xBB
header[1]   = LED bitmask (bits 0-5)
header[2:4] = RGB color
header[0]   = 0xCC
header[1]   = starting intensity (0-255)
header[2:4] = RGB color
header[5]   = fade rate

Also extends mic timeout if it's about to expire (VAD_MIC_EXTEND = 5s).

0xDD — Mic Timeout

header[0]   = 0xDD
header[1:2] = timeout (seconds, big-endian)

Used to stop mic while server is processing (thinking animation).

FreeRTOS Task Architecture

The ESP32-S3's dual cores are used to separate concerns:

Core 0 — Arduino loop:

  • TCP server: accepts connections, parses headers, handles PCM playback
  • Touch/mute input handling
  • UART debug commands

Core 1 — Dedicated tasks:

  • micTask (4KB stack, priority 1): continuous I2S read → μ-law encode → UDP send
  • opusDecodeTask (32KB stack, priority 1): created per-playback, reads TCP → Opus decode → I2S write
  • updateLedTask (2KB stack, priority 2): 40Hz LED refresh with gamma-corrected fade

The 32KB stack for Opus decoding is necessary because the Opus decoder uses 10-20KB of stack internally.

State Machine

Key state variables controlling behavior:

  • isPlaying — blocks mic recording during playback
  • mic_timeout — millis() deadline for mic recording; 0 = off
  • interruptPlayback — set by center touch to abort current playback
  • mute — hardware mute switch state (currently disabled via DISABLE_HARDWARE_MUTE)
  • serverIP — learned from first TCP connection; 0.0.0.0 = no server yet

Activation flow:

  1. Center touch → sets mic_timeout to now + 60s, green LED pulse
  2. Server sends 0xCC (VAD blink) during speech → extends timeout by 5s if nearly expired
  3. Server sends 0xDD (stop mic) when transcription complete → thinking animation
  4. Server sends 0xAA (audio) with response → plays audio, then re-enables mic per header timeout

Playback interruption:

  1. Center touch during playback → sets interruptPlayback, clears isPlaying
  2. Opus/PCM task detects flag, stops decoding
  3. Remaining TCP data drained (up to 1s) without playing
  4. Mic enabled immediately for 60s

LED System

6 NeoPixel LEDs, only the inner 4 (indices 1-4) used for animations. Edge LEDs (1, 4) dimmed by half for a softer visual.

  • Pulse-and-fade paradigm: setLed() sets color, starting intensity, and fade rate. updateLedTask ramps intensity down at 40Hz.
  • Gamma correction: LUT with gamma 1.8 (lower than typical 2.2 to avoid visible flicker at low PWM levels)
  • Audio-reactive: During playback, amplitude of PCM samples drives LED brightness (sampled every 32ms, only ramps up — natural fade handles the down)

Color semantics:

  • Green pulse: listening / mic active
  • White pulse: audio playback / VAD visualization
  • Red pulse: error / cannot listen (muted or no server)

Volume Control

Bit-shift based: PCM samples are left-shifted by the volume value (0-20). Default 14. Set per-playback via the 0xAA header, configurable via NVS.

Playback Buffering

TCP → tcpBuffer (512B) → wavData (2MB PSRAM) → I2S DMA → Speaker
  • Buffer threshold: 4096 samples (256ms) before starting I2S playback — balances latency vs jitter resilience
  • Without PSRAM: falls back to 1024 samples (64ms), 4KB allocation
  • I2S DMA: 4 buffers × 512 samples, hardware-driven (no CPU polling)

Configuration

Stored in NVS (ESP32 Preferences): WiFi credentials, server hostname, volume, mic timeout. Editable via UART config mode (c command).

UART Debug Commands

r restart, M mic on 10min, m mic off, W/w LED test fast/slow, L/l LEDs max/off, A multicast announce, c config mode.