Stream agentic LLM responses, add contextual stall classifier, rename backends

- SSE sentence-level streaming: consume agent deltas, split on sentence
  boundaries (handles no-space chunk joins), synthesize+send each sentence
  as it forms; intermediate sends keep mic_timeout=0
- Gemini-backed stall classifier for agentic mode only: narrow to
  retrieval-only, pass prev user/assistant for context awareness, avoid
  action promises the stall can't honor, sub-second latency via
  reasoning_effort=none
- Rename backends: local -> conversational, managed -> agentic
  (files, classes, config keys)
- PTT interrupt fix: set device.interrupted when button-press frames
  arrive mid-response and keep buffering so the next utterance captures
  cleanly instead of being dropped
- Startup summary log showing ASR, LLM, STALL, and TTS config at a glance
- run.sh launcher with Homebrew libopus path for macOS
- voice_prompt config for per-turn agentic reminders; inline continuity
  note injection so the agent knows what the stall just said aloud
- README section on streaming, stalls, and the first-turn OpenClaw caveat
This commit is contained in:
justLV 2026-04-12 13:55:59 -07:00
parent 19aca75ba8
commit dccb6ced15
14 changed files with 2965 additions and 135 deletions

4
.gitignore vendored
View file

@ -26,5 +26,9 @@ logs/
*.wav
!data/.gitkeep
# Local debug scripts (may contain personal test queries)
test_stall.py
test_stream.py
# Claude
.claude/

View file

@ -12,7 +12,7 @@ This repo consists of:
## What's new in v2
* **OpenClaw managed backend** 🦞 -- delegate conversation history and session management to an [OpenClaw](https://github.com/openclaw) gateway for centralized, multi-device orchestration
* **Agentic backend** 🦞 -- delegate conversation history, session management, and tool execution to an [OpenClaw](https://github.com/openclaw) gateway for centralized, multi-device orchestration
* **Opus compression** -- 14-16x downstream compression (server to speaker) for better audio quality over WiFi
* **Streaming-ready architecture** -- designed for sentence-level TTS streaming and agentic tool-calling loops
* **Modular async pipeline** -- replaced the monolithic server with a pluggable architecture for ASR, LLM, and TTS backends etc.
@ -85,9 +85,19 @@ A zero-length Opus frame (`0x00 0x00`) signals end of speech.
The pipeline supports two conversation backends, selectable via `config.yaml`:
**Local** (`conversation.backend: "local"`): Manages conversation history locally with per-device JSON persistence. Sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint.
**Conversational** (`conversation.backend: "conversational"`): Plain chat-completions backend. Manages conversation history client-side with per-device JSON persistence and sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint (OpenRouter, Gemini, Ollama, mlx_lm.server, etc.). Good for simple voice chat with no tool use.
**OpenClaw Managed** (`conversation.backend: "managed"`): Delegates session management to an [OpenClaw](https://github.com/openclaw) gateway. Only sends the latest user message -- OpenClaw tracks history server-side using the device ID as the session key. Set `OPENCLAW_GATEWAY_TOKEN` in your environment and point `base_url` at your gateway.
**Agentic** (`conversation.backend: "agentic"`): Delegates session management and tool execution to a remote agent gateway like [OpenClaw](https://github.com/openclaw). Only sends the latest user message — the gateway tracks history server-side using the device ID as the session key, and runs its own tool loop (web search, file access, multi-step work). Set `OPENCLAW_GATEWAY_TOKEN` in your environment and point `base_url` at your gateway.
### Streaming and stalls
Agentic requests can take 5-60+ seconds while the gateway runs tools, so the pipeline is built to make that wait feel responsive.
**Sentence-level streaming.** The agent's SSE response is consumed delta by delta. A splitter (`pipeline/conversation/__init__.py:sentence_chunks`) buffers text until it hits a sentence boundary, then hands the sentence to TTS and pushes the audio to the device. Any narration the agent emits between tool calls gets spoken aloud as it arrives, while the next tool runs in the background. Intermediate sends use `mic_timeout=0` so the mic only reopens after the final chunk.
**Contextual stall phrases.** Before the main agent call, the pipeline fires a fast classifier that decides whether the question needs a brief spoken acknowledgment. Conversational questions return `NONE` and get no stall. Tool-needing questions get a short personality-matched phrase that plays within about a second while the agent works. The stall text is then injected back into the agent's user message as a parenthetical continuity note so it doesn't repeat itself. Configure in `conversation.stall`.
**The first-turn caveat with OpenClaw.** OpenClaw's OpenAI-compatible endpoint buffers all content from the first agent turn until the first round of tool execution completes. If the model generates an opening sentence and then calls a tool, that sentence is held server-side until the tool finishes. Narration between *subsequent* tool rounds streams fine. This is why the stall classifier exists: it gives the user a fast spoken acknowledgment that bypasses the gateway's first-turn buffering. See `pipeline/conversation/stall.py`.
### Setting up OpenClaw
@ -102,7 +112,7 @@ This will:
2. Append a voice mode prompt to `~/.openclaw/workspace/AGENTS.md` (tells the agent to respond in concise, speech-friendly prose when the message channel is `onju-voice`)
3. Restart the gateway
Then set `conversation.backend: "managed"` in `pipeline/config.yaml` and ensure `OPENCLAW_GATEWAY_TOKEN` is set in your environment.
Then set `conversation.backend: "agentic"` in `pipeline/config.yaml` and ensure `OPENCLAW_GATEWAY_TOKEN` is set in your environment.
## Installation
@ -197,9 +207,10 @@ See [`pipeline/config.yaml.example`](pipeline/config.yaml.example) for all optio
| Section | What it controls |
|---|---|
| `asr` | Speech-to-text service URL |
| `conversation.backend` | `"local"` or `"managed"` (OpenClaw) |
| `conversation.local` | LLM endpoint, model, system prompt, message history |
| `conversation.managed` | OpenClaw gateway URL, auth token, message channel |
| `conversation.backend` | `"conversational"` (plain chat) or `"agentic"` (OpenClaw, tools) |
| `conversation.conversational` | LLM endpoint, model, system prompt, message history |
| `conversation.agentic` | OpenClaw gateway URL, auth token, message channel |
| `conversation.stall` | Fast classifier that decides if the agentic backend needs a brief spoken stall |
| `tts` | TTS backend (`"elevenlabs"` or `"qwen3"`), voice settings |
| `vad` | Voice activity detection thresholds and timing |
| `network` | UDP/TCP/multicast ports |
@ -209,9 +220,10 @@ See [`pipeline/config.yaml.example`](pipeline/config.yaml.example) for all optio
| Variable | Used by |
|---|---|
| `OPENROUTER_API_KEY` | Local backend via OpenRouter (default) |
| `ANTHROPIC_API_KEY` | Local backend via Anthropic API directly |
| `OPENCLAW_GATEWAY_TOKEN` | Managed (OpenClaw) backend |
| `OPENROUTER_API_KEY` | Conversational backend via OpenRouter |
| `GEMINI_API_KEY` | Conversational backend via Gemini, and the stall classifier |
| `ANTHROPIC_API_KEY` | Conversational backend via Anthropic API directly |
| `OPENCLAW_GATEWAY_TOKEN` | Agentic (OpenClaw) backend |
## Testing

View file

@ -2,17 +2,72 @@ asr:
url: "http://localhost:8100" # parakeet-asr-server
conversation:
backend: "managed" # "managed" (e.g. OpenClaw) or "local" (conversational only)
backend: "agentic" # "agentic" (e.g. OpenClaw, with tools) or "conversational" (plain chat)
managed:
stall:
enabled: true # decide if a stall phrase is needed while the agent works
base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
api_key: "${GEMINI_API_KEY}"
model: "gemini-2.5-flash"
reasoning_effort: "none" # disable thinking for sub-second latency (Gemini 2.5 Flash only)
max_tokens: 200
timeout: 1.5 # seconds; skip stall if slower than this
prompt: |
You are the bridge voice for a voice assistant. Your only job is to
decide if the user's latest utterance needs a brief spoken placeholder
while the assistant runs a slow information lookup in the background.
{recent_context}
The user just said: {user_text}
Output NONE if any of these apply:
- The assistant can answer from its own knowledge or creativity
(the assistant is itself a capable language model — facts,
jokes, opinions, explanations, general knowledge).
- It's conversational, small talk, a partial thought, or a
request for the assistant to keep talking ("go on", "tell me
more"). A follow-up that asks for NEW information about a
different parameter ("what about Saturday?", "same but for
Tuesday") is still a real lookup and should stall.
- The user is asking the assistant to DO something rather than
FIND something — any action verb like remember, save, schedule,
book, send, add, mark, delete, create, update. You are the
bridge voice, not the agent executing anything. You cannot
honestly promise an action from here, so let the agent confirm
it itself. This rule applies even if carrying out the action
happens to involve some internal lookup.
Only generate a stall if the user is asking a question whose answer
genuinely requires a slow external fetch — live web search, current
data, file read, API call, real-time info. Retrieval is the only
reason to stall.
When a lookup is needed, generate a three-to-seven-word spoken stall
ending with a period. Warm, present, friend energy. A tiny reaction
plus a signal that you're about to go look it up. If the user named a
place, person, or thing worth acknowledging, use its name rather than
"that". No predictions about the answer, no promises of action, no
call-center filler.
Output ONLY the stall phrase OR the literal word NONE. No quotes, no
explanation, no preamble.
agentic:
base_url: "http://127.0.0.1:18789/v1" # OpenClaw gateway
api_key: "${OPENCLAW_GATEWAY_TOKEN}" # env var reference
model: "openclaw/default"
max_tokens: 300
message_channel: "onju-voice" # x-openclaw-message-channel header
# provider_model: "anthropic/claude-opus-4-6" # optional: override backend LLM
voice_prompt: >- # prepended to every user message as a reminder
[voice: this is spoken input transcribed from a microphone and your entire
response will be read aloud by TTS on a small speaker. Write only plain
spoken prose — no markdown, no lists, no structured reports, no code. If
your research produces detailed findings, save them to a file and just
give a brief spoken summary. Remember, keep it conversational.]
local:
conversational:
base_url: "https://openrouter.ai/api/v1" # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
api_key: "${OPENROUTER_API_KEY}" # set key or use ${ENV_VAR} reference
model: "anthropic/claude-haiku-4.5"
@ -20,7 +75,7 @@ conversation:
max_tokens: 300
system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
persist_dir: "data/conversations" # per-device message history (omit to disable)
# Local example (Ollama):
# Fully local example (Ollama):
# base_url: "http://localhost:11434/v1"
# api_key: "none"
# model: "gemma4:e4b"

View file

@ -1,16 +1,47 @@
import re
from typing import AsyncIterator
from pipeline.conversation.base import ConversationBackend
from pipeline.conversation.local import LocalConversation
from pipeline.conversation.managed import ManagedConversation
from pipeline.conversation.conversational import ConversationalBackend
from pipeline.conversation.agentic import AgenticBackend
# Primary: punctuation followed by whitespace (safe, standard).
_SENTENCE_END = re.compile(r"[.!?\n]+\s+")
# Fallback: punctuation with no space, but only when preceded by a lowercase
# letter and followed by uppercase. Catches OpenClaw's chunk-boundary joins
# ("now.The") without breaking abbreviations like "U.S." (uppercase before dot).
_SENTENCE_END_NOSPACE = re.compile(r"(?<=[a-z])[.!?]+(?=[A-Z])")
async def sentence_chunks(deltas: AsyncIterator[str]) -> AsyncIterator[str]:
"""Buffer text deltas and yield one sentence at a time, plus any trailing
fragment when the stream ends."""
buffer = ""
async for delta in deltas:
buffer += delta
while True:
m = _SENTENCE_END.search(buffer)
if not m:
m = _SENTENCE_END_NOSPACE.search(buffer)
if not m:
break
sentence = buffer[: m.end()].strip()
buffer = buffer[m.end():]
if sentence:
yield sentence
tail = buffer.strip()
if tail:
yield tail
def create_backend(config: dict, device_id: str) -> ConversationBackend:
"""Create a conversation backend based on config."""
conv_cfg = config["conversation"]
backend = conv_cfg.get("backend", "local")
backend = conv_cfg.get("backend", "conversational")
if backend == "local":
return LocalConversation(conv_cfg["local"], device_id)
elif backend == "managed":
return ManagedConversation(conv_cfg["managed"], device_id)
if backend == "conversational":
return ConversationalBackend(conv_cfg["conversational"], device_id)
elif backend == "agentic":
return AgenticBackend(conv_cfg["agentic"], device_id)
else:
raise ValueError(f"Unknown conversation backend: {backend}")

View file

@ -0,0 +1,78 @@
import logging
import os
import re
from typing import AsyncIterator
from openai import AsyncOpenAI
log = logging.getLogger(__name__)
def _resolve_env(value: str) -> str:
return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
class AgenticBackend:
"""Delegates conversation memory and tool execution to a remote agent
service (e.g. OpenClaw). Only sends the latest user message the remote
service tracks session history via the session key derived from the
device ID, and runs its own tool loop server-side."""
def __init__(self, cfg: dict, device_id: str):
self.cfg = cfg
self.device_id = device_id
self.message_channel = cfg.get("message_channel", "onju-voice")
self.client = AsyncOpenAI(
base_url=cfg["base_url"],
api_key=_resolve_env(cfg.get("api_key", "none")),
default_headers={
"x-openclaw-message-channel": self.message_channel,
},
)
def _build_kwargs(self, user_text: str, extra_context: str | None = None) -> dict:
voice_prompt = self.cfg.get("voice_prompt")
parts = []
if voice_prompt:
parts.append(voice_prompt)
if extra_context:
parts.append(extra_context)
parts.append(user_text)
content = "\n\n".join(parts)
kwargs = dict(
model=self.cfg.get("model", "openclaw/default"),
messages=[{"role": "user", "content": content}],
max_tokens=self.cfg.get("max_tokens", 300),
user=self.device_id,
)
if self.cfg.get("provider_model"):
kwargs["extra_headers"] = {"x-openclaw-model": self.cfg["provider_model"]}
return kwargs
async def send(self, user_text: str, extra_context: str | None = None) -> str:
response = await self.client.chat.completions.create(
**self._build_kwargs(user_text, extra_context)
)
text = response.choices[0].message.content or ""
log.debug(f"[{self.device_id}] managed LLM: {text}")
return text
async def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
kwargs = self._build_kwargs(user_text, extra_context)
kwargs["stream"] = True
stream = await self.client.chat.completions.create(**kwargs)
async for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta.content or ""
if delta:
yield delta
def reset(self) -> None:
pass # session reset would require an API call if supported
def get_messages(self) -> list[dict]:
return [] # history lives on the remote service
def set_messages(self, messages: list[dict]) -> None:
pass # no-op — remote service owns the history

View file

@ -1,10 +1,14 @@
from typing import Protocol, runtime_checkable
from typing import AsyncIterator, Protocol, runtime_checkable
@runtime_checkable
class ConversationBackend(Protocol):
async def send(self, user_text: str) -> str:
"""Send a user message, return the assistant's response."""
async def send(self, user_text: str, extra_context: str | None = None) -> str:
"""Send a user message, return the full assistant response."""
...
def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
"""Send a user message, yield assistant text deltas as they arrive."""
...
def reset(self) -> None:

View file

@ -2,6 +2,7 @@ import json
import logging
import os
import re
from typing import AsyncIterator
from openai import AsyncOpenAI
@ -12,8 +13,10 @@ def _resolve_env(value: str) -> str:
return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
class LocalConversation:
"""Manages conversation history locally and sends full context to any OpenAI-compatible endpoint."""
class ConversationalBackend:
"""Simple chat-completions backend: manages conversation history on the
client and sends the full context to any OpenAI-compatible endpoint.
Good for plain LLM chat with no tool use."""
def __init__(self, cfg: dict, device_id: str):
self.cfg = cfg
@ -34,29 +37,56 @@ class LocalConversation:
self.messages: list[dict] = self._load() or [{"role": "system", "content": cfg["system_prompt"]}]
async def send(self, user_text: str) -> str:
self._sanitize()
self.messages.append({"role": "user", "content": user_text})
def _build_kwargs(self) -> dict:
kwargs = dict(
model=self.cfg["model"],
messages=self.messages,
max_tokens=self.cfg.get("max_tokens", 300),
)
if self.cfg.get("thinking_budget") is not None:
kwargs["extra_body"] = {
"google": {"thinking_config": {"thinking_budget": self.cfg["thinking_budget"]}}
}
# Gemini 2.5 via OpenAI-compat: disable thinking with reasoning_effort.
# https://ai.google.dev/gemini-api/docs/openai
if self.cfg.get("reasoning_effort"):
kwargs["reasoning_effort"] = self.cfg["reasoning_effort"]
return kwargs
response = await self.client.chat.completions.create(**kwargs)
text = response.choices[0].message.content or ""
def _finalize(self, text: str) -> None:
self.messages.append({"role": "assistant", "content": text})
self._prune()
self.save()
def _wrap_user(self, user_text: str, extra_context: str | None) -> str:
return f"{extra_context}\n\n{user_text}" if extra_context else user_text
async def send(self, user_text: str, extra_context: str | None = None) -> str:
self._sanitize()
self.messages.append({"role": "user", "content": self._wrap_user(user_text, extra_context)})
response = await self.client.chat.completions.create(**self._build_kwargs())
text = response.choices[0].message.content or ""
self._finalize(text)
log.debug(f"[{self.device_id}] LLM: {text}")
return text
async def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
self._sanitize()
self.messages.append({"role": "user", "content": self._wrap_user(user_text, extra_context)})
kwargs = self._build_kwargs()
kwargs["stream"] = True
stream = await self.client.chat.completions.create(**kwargs)
parts: list[str] = []
try:
async for chunk in stream:
if not chunk.choices:
continue
delta = chunk.choices[0].delta.content or ""
if delta:
parts.append(delta)
yield delta
finally:
self._finalize("".join(parts))
def reset(self) -> None:
self.messages = [{"role": "system", "content": self.cfg["system_prompt"]}]
self.save()

View file

@ -1,59 +0,0 @@
import logging
import os
import re
from openai import AsyncOpenAI
log = logging.getLogger(__name__)
def _resolve_env(value: str) -> str:
return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
class ManagedConversation:
"""Delegates conversation memory to a remote service (OpenClaw, etc).
Only sends the latest user message the remote service tracks session history
via the session key derived from the device ID.
"""
def __init__(self, cfg: dict, device_id: str):
self.cfg = cfg
self.device_id = device_id
self.message_channel = cfg.get("message_channel", "onju-voice")
self.client = AsyncOpenAI(
base_url=cfg["base_url"],
api_key=_resolve_env(cfg.get("api_key", "none")),
default_headers={
"x-openclaw-message-channel": self.message_channel,
},
)
async def send(self, user_text: str) -> str:
kwargs = dict(
model=self.cfg.get("model", "openclaw/default"),
messages=[{"role": "user", "content": user_text}],
max_tokens=self.cfg.get("max_tokens", 300),
user=self.device_id,
)
extra_headers = {}
if self.cfg.get("provider_model"):
extra_headers["x-openclaw-model"] = self.cfg["provider_model"]
if extra_headers:
kwargs["extra_headers"] = extra_headers
response = await self.client.chat.completions.create(**kwargs)
text = response.choices[0].message.content or ""
log.debug(f"[{self.device_id}] managed LLM: {text}")
return text
def reset(self) -> None:
pass # session reset would require an API call if supported
def get_messages(self) -> list[dict]:
return [] # history lives on the remote service
def set_messages(self, messages: list[dict]) -> None:
pass # no-op — remote service owns the history

View file

@ -0,0 +1,100 @@
import asyncio
import logging
import os
import re
from openai import AsyncOpenAI
log = logging.getLogger(__name__)
def _resolve_env(value: str) -> str:
return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
_client: AsyncOpenAI | None = None
_client_key: tuple | None = None
def _get_client(cfg: dict) -> AsyncOpenAI:
"""Lazy singleton so we reuse the HTTP connection pool across turns."""
global _client, _client_key
key = (cfg["base_url"], cfg.get("api_key", ""))
if _client is None or _client_key != key:
_client = AsyncOpenAI(
base_url=cfg["base_url"],
api_key=_resolve_env(cfg.get("api_key", "none")),
)
_client_key = key
return _client
async def decide_stall(
user_text: str,
config: dict,
prev_user: str | None = None,
prev_assistant: str | None = None,
) -> str | None:
"""Ask a fast classifier model whether the voice assistant should say a
brief stall phrase before answering. Returns the stall text to speak, or
None if the query is conversational and needs no stall (or the classifier
failed/timed out).
Only runs in agentic mode conversational backends already respond
quickly and don't benefit from a stall.
Pass the previous user/assistant exchange as context so the classifier
can recognize follow-ups, continuation cues ("go on"), and mid-conversation
prefaces ("one more thing") as conversational rather than tool-needing."""
conv_cfg = config.get("conversation", {})
if conv_cfg.get("backend") != "agentic":
return None
cfg = conv_cfg.get("stall")
if not cfg or not cfg.get("enabled", False):
return None
timeout = cfg.get("timeout", 1.5)
if prev_user or prev_assistant:
context_block = "Previous turn in this conversation:\n"
if prev_user:
context_block += f" User said: {prev_user}\n"
if prev_assistant:
context_block += f" You replied: {prev_assistant}\n"
else:
context_block = "(No previous turn — this is the start of the conversation.)"
prompt = (
cfg["prompt"]
.replace("{user_text}", user_text)
.replace("{recent_context}", context_block)
)
kwargs = dict(
model=cfg["model"],
messages=[{"role": "user", "content": prompt}],
max_tokens=cfg.get("max_tokens", 200),
)
# Gemini 2.5 models think by default — disable it for the stall call since
# we need sub-second latency and the classification is trivial.
# https://ai.google.dev/gemini-api/docs/openai
if cfg.get("reasoning_effort"):
kwargs["reasoning_effort"] = cfg["reasoning_effort"]
try:
response = await asyncio.wait_for(
_get_client(cfg).chat.completions.create(**kwargs),
timeout=timeout,
)
except asyncio.TimeoutError:
log.info(f"STALL timeout after {timeout}s — skipping")
return None
except Exception as e:
log.warning(f"STALL classifier failed: {e}")
return None
text = (response.choices[0].message.content or "").strip()
if not text or text.upper().startswith("NONE"):
return None
return text

View file

@ -24,6 +24,7 @@ class Device:
self.voice = el_cfg.get("default_voice", "Emma")
self.ptt = ptt
self.vad = None if ptt else VAD(config)
self.last_user_text: str | None = None
self.last_response: str | None = None
self.led_power = 0
self.led_update_time = 0.0

View file

@ -13,7 +13,8 @@ import numpy as np
import yaml
from pipeline.audio import decode_ulaw, opus_encode, opus_frames_to_tcp_payload, pcm_to_wav
from pipeline.conversation import create_backend
from pipeline.conversation import create_backend, sentence_chunks
from pipeline.conversation import stall as stall_mod
from pipeline.device import Device, DeviceManager
from pipeline.protocol import send_audio, send_led_blink, open_led_connection, write_led_blink, close_led_connection
from pipeline.services import asr, tts
@ -94,9 +95,13 @@ async def udp_listener(config: dict, manager: DeviceManager, utterance_queue: as
pcm = decode_ulaw(data)
if device.ptt:
# PTT: just buffer, no VAD needed
if device.processing:
continue
# PTT frames arriving while we're still responding means the user
# pressed the button to interrupt. The button press is the intent
# signal; keep buffering so these frames become the next utterance
# once the current turn bails out.
if device.processing and not device.interrupted.is_set():
log.info(f"PTT interrupt from {device.hostname}")
device.interrupted.set()
device.ptt_buffer.append(pcm)
else:
# VOX: run VAD
@ -228,41 +233,163 @@ async def process_utterances(config: dict, manager: DeviceManager, utterance_que
log.info(f"Interrupted before LLM")
continue
# Conversation
# Streaming LLM → sentence-buffered TTS → Opus → TCP.
# Intermediate sends use mic_timeout=0 so the mic only reopens after
# the final chunk has played out.
sample_rate = config["audio"]["sample_rate"]
opus_frame_size = config["audio"]["opus_frame_size"]
turn_t0 = time.monotonic()
full_response: list[str] = []
pending: str | None = None # sentence waiting to be flushed
sent_partial = False # any non-final chunk already sent?
first_sentence_at: float | None = None
stream_start_at: float | None = None
async def send_sentence(sentence: str, is_final: bool) -> bool:
"""Synthesize, encode, and push one sentence. Returns True on
success, False if interrupted or TTS failed."""
nonlocal sent_partial
if device.interrupted.is_set():
log.info(f"Interrupted before TTS")
return False
try:
pcm = await tts.synthesize(sentence, device.voice, config)
except Exception as e:
log.error(f"TTS failed: {e}")
return False
if device.interrupted.is_set():
log.info(f"Interrupted before send")
return False
frames = opus_encode(pcm, sample_rate, opus_frame_size)
payload = opus_frames_to_tcp_payload(frames)
mic_timeout = dev_cfg["default_mic_timeout"] if is_final else 0
log.info(f"SEND [+{time.monotonic() - turn_t0:.2f}s] "
f"{len(frames)} opus frames to {device.ip} "
f"({'final' if is_final else 'partial'}: {sentence!r})")
await send_audio(device.ip, tcp_port, payload,
mic_timeout=mic_timeout,
volume=dev_cfg["default_volume"],
fade=dev_cfg["led_fade"])
if not is_final:
sent_partial = True
return True
async def reopen_mic_if_needed():
"""If we already sent partial audio with mic_timeout=0, the mic
is closed push an empty audio to reopen it on recovery."""
if sent_partial and not device.ptt:
try:
await send_audio(device.ip, tcp_port, b"",
mic_timeout=dev_cfg["default_mic_timeout"],
volume=0, fade=0)
except Exception:
pass
# Stall decision (agentic mode only; blocking, capped by config timeout).
# Passes the previous exchange so the classifier can recognize
# continuations ("go on") and prefaces ("one more thing") as
# conversational rather than tool-needing.
stall_text: str | None = None
if config["conversation"].get("backend") == "agentic":
stall_text = await stall_mod.decide_stall(
text,
config,
prev_user=device.last_user_text,
prev_assistant=device.last_response,
)
stall_decided_at = time.monotonic() - turn_t0
if stall_text:
log.info(f"STALL [+{stall_decided_at:.2f}s] decided: {stall_text!r}")
else:
log.info(f"STALL [+{stall_decided_at:.2f}s] NONE")
device.last_user_text = text
# Fire stall TTS+send in parallel with OpenClaw warming up.
stall_task: asyncio.Task | None = None
extra_context: str | None = None
if stall_text:
stall_task = asyncio.create_task(send_sentence(stall_text, is_final=False))
extra_context = (
f"(You already said aloud to the user: \"{stall_text}\""
f"don't repeat this phrase, continue naturally with the answer.)"
)
aborted = False
try:
response_text = await device.conversation.send(text)
stream_start_at = time.monotonic()
async for sentence in sentence_chunks(
device.conversation.stream(text, extra_context=extra_context)
):
full_response.append(sentence)
if first_sentence_at is None:
first_sentence_at = time.monotonic()
ttfs_turn = first_sentence_at - turn_t0
ttfs_stream = first_sentence_at - stream_start_at
log.info(f"LLM first sentence [+{ttfs_turn:.2f}s turn / "
f"{ttfs_stream:.2f}s stream]: {sentence}")
else:
log.debug(f"LLM sentence: {sentence}")
# Make sure the stall audio has finished sending before we
# start pushing OpenClaw content to the device.
if stall_task is not None and not stall_task.done():
await stall_task
stall_task = None
# Flush the *previous* sentence as non-final; whichever
# sentence is last when the stream ends becomes the final.
if pending is not None:
if not await send_sentence(pending, is_final=False):
aborted = True
break
pending = sentence
except Exception as e:
log.error(f"LLM failed: {e}")
if stall_task is not None:
try:
await stall_task
except Exception:
pass
await reopen_mic_if_needed()
continue
# Drain the stall task if it's still pending (e.g. LLM stream
# returned zero content).
if stall_task is not None:
try:
await stall_task
except Exception:
pass
stall_task = None
if aborted:
await reopen_mic_if_needed()
continue
if pending is not None:
if not await send_sentence(pending, is_final=True):
await reopen_mic_if_needed()
continue
elif sent_partial:
# The stall played but OpenClaw returned nothing — reopen mic.
if not device.ptt:
await send_audio(device.ip, tcp_port, b"",
mic_timeout=dev_cfg["default_mic_timeout"],
volume=0, fade=0)
elif not device.ptt:
# No content from the LLM and no stall — still reopen the mic.
await send_audio(device.ip, tcp_port, b"",
mic_timeout=dev_cfg["default_mic_timeout"],
volume=0, fade=0)
response_text = " ".join(full_response)
device.last_response = response_text
log.info(f"LLM {response_text}")
# Check for interrupt before TTS
if device.interrupted.is_set():
log.info(f"Interrupted before TTS")
continue
# TTS
try:
pcm_response = await tts.synthesize(response_text, device.voice, config)
log.info(f"TTS {len(pcm_response)} bytes ({len(pcm_response)/32000:.1f}s)")
except Exception as e:
log.error(f"TTS failed: {e}")
continue
# Check for interrupt before sending audio
if device.interrupted.is_set():
log.info(f"Interrupted before send")
continue
# Opus encode and send
frames = opus_encode(pcm_response, config["audio"]["sample_rate"], config["audio"]["opus_frame_size"])
payload = opus_frames_to_tcp_payload(frames)
log.info(f"SEND {len(frames)} opus frames to {device.ip}")
await send_audio(device.ip, tcp_port, payload,
mic_timeout=dev_cfg["default_mic_timeout"],
volume=dev_cfg["default_volume"],
fade=dev_cfg["led_fade"])
elapsed = time.monotonic() - turn_t0
ttfs = f"{first_sentence_at - turn_t0:.2f}s" if first_sentence_at else ""
log.info(f"LLM [{ttfs} first / {elapsed:.2f}s total / "
f"{len(full_response)} sentences / {len(response_text)} chars] "
f"{response_text}")
except Exception as e:
log.error(f"Pipeline error ({device.hostname}): {e}")
@ -372,6 +499,41 @@ def _http_respond(writer: asyncio.StreamWriter, status: int, body: str):
writer.write(f"HTTP/1.1 {status} {reason}\r\nContent-Type: application/json\r\nContent-Length: {len(body)}\r\nConnection: close\r\n\r\n{body}".encode())
def _log_startup_summary(config: dict) -> None:
"""Log the active endpoints and models so it's obvious at a glance how the
pipeline is configured for this run."""
conv_cfg = config["conversation"]
backend_name = conv_cfg.get("backend", "conversational")
backend_cfg = conv_cfg.get(backend_name, {})
log.info("Pipeline server starting")
log.info(f" ASR {config['asr']['url']}")
if backend_name == "agentic":
model = backend_cfg.get("provider_model") or backend_cfg.get("model", "?")
log.info(f" LLM agentic: {model} @ {backend_cfg.get('base_url', '?')} "
f"(channel={backend_cfg.get('message_channel', '?')})")
stall_cfg = conv_cfg.get("stall", {})
if stall_cfg.get("enabled"):
log.info(f" STALL {stall_cfg.get('model', '?')} @ {stall_cfg.get('base_url', '?')} "
f"(timeout={stall_cfg.get('timeout', 1.5)}s)")
else:
log.info(" STALL disabled")
else:
log.info(f" LLM conversational: {backend_cfg.get('model', '?')} "
f"@ {backend_cfg.get('base_url', '?')}")
tts_cfg = config["tts"]
tts_backend = tts_cfg.get("backend", "?")
if tts_backend == "elevenlabs":
el = tts_cfg.get("elevenlabs", {})
vox = el.get("default_voice", "?")
ptt = el.get("default_voice_ptt", vox)
log.info(f" TTS elevenlabs: VOX={vox} PTT={ptt}")
else:
log.info(f" TTS {tts_backend}")
async def warmup(config: dict):
"""Validate conversation backend and TTS are reachable."""
log.info("Warming up conversation backend and TTS...")
@ -446,7 +608,7 @@ async def main(config_path: str = None, do_warmup: bool = False, devices: list[s
utterance_queue = asyncio.Queue()
log.info("Pipeline server starting")
_log_startup_summary(config)
await asyncio.gather(
udp_listener(config, manager, utterance_queue),
multicast_listener(config, manager),

22
run.sh Executable file
View file

@ -0,0 +1,22 @@
#!/bin/bash
# Run the onju-voice pipeline server.
#
# Usage:
# ./run.sh # default config
# ./run.sh --warmup # warmup LLM+TTS on startup
# ./run.sh --device onju=10.0.0.5 # pre-register a device
cd "$(dirname "$0")"
# opuslib uses ctypes.util.find_library('opus'), which on macOS does not search
# Homebrew prefixes. Point the dynamic loader at the brew opus lib if present.
if [ "$(uname)" = "Darwin" ] && command -v brew >/dev/null 2>&1; then
if opus_prefix="$(brew --prefix opus 2>/dev/null)" && [ -d "$opus_prefix/lib" ]; then
export DYLD_FALLBACK_LIBRARY_PATH="$opus_prefix/lib:${DYLD_FALLBACK_LIBRARY_PATH:-/usr/local/lib:/usr/lib}"
else
echo "Warning: 'opus' not installed via Homebrew. Run: brew install opus"
fi
fi
echo "Starting onju-voice pipeline..."
uv run python -m pipeline.main "$@"

View file

@ -9,10 +9,12 @@ VOICE_SECTION='# Voice mode
When the message channel is `onju-voice`, your response will be spoken aloud by TTS on a small speaker. The user'\''s input is transcribed, so expect errors and infer meaning generously.
- Format: no emojis, markdown, URLs, file paths, tables, or structured data of any kind. Everything gets pronounced literally or breaks the cadence of speech. Respond in plain prose only.
- Length: one to two sentences. If a topic needs depth, give the headline and offer to elaborate — "Short answer is yes, want me to walk through why?" Never dump information. Voice is conversation, not briefing.
- Format: your output goes directly to a text-to-speech engine and out a speaker. No markdown, no backticks, no asterisks, no bullet points, no numbered lists, no emojis, no URLs. Never mention file names, folder paths, code snippets, or config names — the listener cannot see them and they sound terrible read aloud. If you need to refer to something technical, describe it in plain words: "I updated your search settings" not "I edited openclaw dot json." For social media handles, say the name naturally — "Jason Beale on X" not "at jabeale." Everything you write gets pronounced exactly as-is, so write only clean spoken prose.
- Length: one to two sentences per spoken chunk. If a topic needs depth, give the headline and offer to elaborate — "Short answer is yes, want me to walk through why?" Never dump information. Voice is conversation, not briefing.
- Long outputs: if your research or work produces detailed results — full reports, lists of findings, code, configs — save them to a file so the user can review later at a screen. Then give a brief spoken summary of what you found and mention that the details are saved. Never read out a long report, a list of bullet points, or structured data over voice.
- Cadence: speak the way a thoughtful friend speaks out loud. Warm, direct, unhurried. Use contractions. Say numbers the human way ("about three thousand", not "3,247"). Spell out abbreviations that don'\''t have a natural spoken form. Skip jargon when plain language exists.
- Tools: you have full tool access, but the user only hears your final reply — they don'\''t see tool output, intermediate steps, or your reasoning. Translate structured results into prose, picking the one or two facts that actually matter.
- Tools: you have full tool access, but the user only hears your final reply — they don'\''t see tool output, intermediate steps, or your reasoning. You can narrate what you'\''re doing as you go — short, casual check-ins like "Got it, checking the docs now" or "Okay, writing that up" are great between tool calls. Just keep each update to one short sentence and never read out code, file names, or technical details. Translate structured results into prose, picking the one or two facts that actually matter.
- Don'\''t open with a stall phrase: skip "on it" / "give me a sec" / "alright, pulling that up" style openers. The pipeline handles brief acknowledgments for slow requests separately, so when you speak, start with the actual answer or status update.
- Character: you are the same assistant the user talks to everywhere else. Same memory, same personality, same relationship. Voice changes only the form of your response, never the substance.'
echo "==> Enabling chat completions endpoint on OpenClaw gateway..."
@ -20,7 +22,19 @@ openclaw config set gateway.http.endpoints.chatCompletions.enabled true
if [ -f "$AGENTS_MD" ]; then
if grep -qF "$VOICE_MARKER" "$AGENTS_MD"; then
echo "==> Voice mode section already present in AGENTS.md, skipping."
echo "==> Replacing existing voice mode section in AGENTS.md..."
# Remove everything from "# Voice mode" to the next H1 or end-of-file,
# then append the updated section.
python3 -c "
import re, sys
text = open(sys.argv[1]).read()
# Strip old voice section: from '# Voice mode' to next ^# heading or EOF
text = re.sub(r'(?m)^# Voice mode\n.*?(?=^# |\Z)', '', text, flags=re.DOTALL).rstrip()
open(sys.argv[1], 'w').write(text + '\n')
" "$AGENTS_MD"
echo "" >> "$AGENTS_MD"
echo "$VOICE_SECTION" >> "$AGENTS_MD"
echo "==> Voice mode section replaced."
else
echo "" >> "$AGENTS_MD"
echo "$VOICE_SECTION" >> "$AGENTS_MD"
@ -35,4 +49,4 @@ fi
echo "==> Restarting OpenClaw gateway..."
openclaw gateway restart
echo "==> Done. Set conversation.backend: \"managed\" in pipeline/config.yaml to use OpenClaw."
echo "==> Done. Set conversation.backend: \"agentic\" in pipeline/config.yaml to use OpenClaw."

2376
uv.lock Normal file

File diff suppressed because it is too large Load diff