Stream agentic LLM responses, add contextual stall classifier, rename backends

- SSE sentence-level streaming: consume agent deltas, split on sentence boundaries (handles no-space chunk joins), synthesize+send each sentence as it forms; intermediate sends keep mic_timeout=0 - Gemini-backed stall classifier for agentic mode only: narrow to retrieval-only, pass prev user/assistant for context awareness, avoid action promises the stall can't honor, sub-second latency via reasoning_effort=none - Rename backends: local -> conversational, managed -> agentic (files, classes, config keys) - PTT interrupt fix: set device.interrupted when button-press frames arrive mid-response and keep buffering so the next utterance captures cleanly instead of being dropped - Startup summary log showing ASR, LLM, STALL, and TTS config at a glance - run.sh launcher with Homebrew libopus path for macOS - voice_prompt config for per-turn agentic reminders; inline continuity note injection so the agent knows what the stall just said aloud - README section on streaming, stalls, and the first-turn OpenClaw caveat
2026-04-21 07:37:34 +00:00 · 2026-04-12 13:55:59 -07:00 · 2026-04-12 13:55:59 -07:00 · dccb6ced15
commit dccb6ced15
parent 19aca75ba8
14 changed files with 2965 additions and 135 deletions
--- a/.gitignore
+++ b/.gitignore
@ -26,5 +26,9 @@ logs/
 *.wav
 !data/.gitkeep

+# Local debug scripts (may contain personal test queries)
+test_stall.py
+test_stream.py
+
 # Claude
 .claude/
--- a/README.md
+++ b/README.md
@ -12,7 +12,7 @@ This repo consists of:

 ## What's new in v2

-* **OpenClaw managed backend** 🦞 -- delegate conversation history and session management to an [OpenClaw](https://github.com/openclaw) gateway for centralized, multi-device orchestration
+* **Agentic backend** 🦞 -- delegate conversation history, session management, and tool execution to an [OpenClaw](https://github.com/openclaw) gateway for centralized, multi-device orchestration
 * **Opus compression** -- 14-16x downstream compression (server to speaker) for better audio quality over WiFi
 * **Streaming-ready architecture** -- designed for sentence-level TTS streaming and agentic tool-calling loops
 * **Modular async pipeline** -- replaced the monolithic server with a pluggable architecture for ASR, LLM, and TTS backends etc.
@ -85,9 +85,19 @@ A zero-length Opus frame (`0x00 0x00`) signals end of speech.

 The pipeline supports two conversation backends, selectable via `config.yaml`:

-**Local** (`conversation.backend: "local"`): Manages conversation history locally with per-device JSON persistence. Sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint.
+**Conversational** (`conversation.backend: "conversational"`): Plain chat-completions backend. Manages conversation history client-side with per-device JSON persistence and sends the full message history on each LLM request. Works with any OpenAI-compatible endpoint (OpenRouter, Gemini, Ollama, mlx_lm.server, etc.). Good for simple voice chat with no tool use.

-**OpenClaw Managed** (`conversation.backend: "managed"`): Delegates session management to an [OpenClaw](https://github.com/openclaw) gateway. Only sends the latest user message -- OpenClaw tracks history server-side using the device ID as the session key. Set `OPENCLAW_GATEWAY_TOKEN` in your environment and point `base_url` at your gateway.
+**Agentic** (`conversation.backend: "agentic"`): Delegates session management and tool execution to a remote agent gateway like [OpenClaw](https://github.com/openclaw). Only sends the latest user message — the gateway tracks history server-side using the device ID as the session key, and runs its own tool loop (web search, file access, multi-step work). Set `OPENCLAW_GATEWAY_TOKEN` in your environment and point `base_url` at your gateway.
+
+### Streaming and stalls
+
+Agentic requests can take 5-60+ seconds while the gateway runs tools, so the pipeline is built to make that wait feel responsive.
+
+**Sentence-level streaming.** The agent's SSE response is consumed delta by delta. A splitter (`pipeline/conversation/__init__.py:sentence_chunks`) buffers text until it hits a sentence boundary, then hands the sentence to TTS and pushes the audio to the device. Any narration the agent emits between tool calls gets spoken aloud as it arrives, while the next tool runs in the background. Intermediate sends use `mic_timeout=0` so the mic only reopens after the final chunk.
+
+**Contextual stall phrases.** Before the main agent call, the pipeline fires a fast classifier that decides whether the question needs a brief spoken acknowledgment. Conversational questions return `NONE` and get no stall. Tool-needing questions get a short personality-matched phrase that plays within about a second while the agent works. The stall text is then injected back into the agent's user message as a parenthetical continuity note so it doesn't repeat itself. Configure in `conversation.stall`.
+
+**The first-turn caveat with OpenClaw.** OpenClaw's OpenAI-compatible endpoint buffers all content from the first agent turn until the first round of tool execution completes. If the model generates an opening sentence and then calls a tool, that sentence is held server-side until the tool finishes. Narration between *subsequent* tool rounds streams fine. This is why the stall classifier exists: it gives the user a fast spoken acknowledgment that bypasses the gateway's first-turn buffering. See `pipeline/conversation/stall.py`.

 ### Setting up OpenClaw

@ -102,7 +112,7 @@ This will:
 2. Append a voice mode prompt to `~/.openclaw/workspace/AGENTS.md` (tells the agent to respond in concise, speech-friendly prose when the message channel is `onju-voice`)
 3. Restart the gateway

-Then set `conversation.backend: "managed"` in `pipeline/config.yaml` and ensure `OPENCLAW_GATEWAY_TOKEN` is set in your environment.
+Then set `conversation.backend: "agentic"` in `pipeline/config.yaml` and ensure `OPENCLAW_GATEWAY_TOKEN` is set in your environment.

 ## Installation

@ -197,9 +207,10 @@ See [`pipeline/config.yaml.example`](pipeline/config.yaml.example) for all optio
 | Section | What it controls |
 |---|---|
 | `asr` | Speech-to-text service URL |
-| `conversation.backend` | `"local"` or `"managed"` (OpenClaw) |
-| `conversation.local` | LLM endpoint, model, system prompt, message history |
-| `conversation.managed` | OpenClaw gateway URL, auth token, message channel |
+| `conversation.backend` | `"conversational"` (plain chat) or `"agentic"` (OpenClaw, tools) |
+| `conversation.conversational` | LLM endpoint, model, system prompt, message history |
+| `conversation.agentic` | OpenClaw gateway URL, auth token, message channel |
+| `conversation.stall` | Fast classifier that decides if the agentic backend needs a brief spoken stall |
 | `tts` | TTS backend (`"elevenlabs"` or `"qwen3"`), voice settings |
 | `vad` | Voice activity detection thresholds and timing |
 | `network` | UDP/TCP/multicast ports |
@ -209,9 +220,10 @@ See [`pipeline/config.yaml.example`](pipeline/config.yaml.example) for all optio

 | Variable | Used by |
 |---|---|
-| `OPENROUTER_API_KEY` | Local backend via OpenRouter (default) |
-| `ANTHROPIC_API_KEY` | Local backend via Anthropic API directly |
-| `OPENCLAW_GATEWAY_TOKEN` | Managed (OpenClaw) backend |
+| `OPENROUTER_API_KEY` | Conversational backend via OpenRouter |
+| `GEMINI_API_KEY` | Conversational backend via Gemini, and the stall classifier |
+| `ANTHROPIC_API_KEY` | Conversational backend via Anthropic API directly |
+| `OPENCLAW_GATEWAY_TOKEN` | Agentic (OpenClaw) backend |

 ## Testing

--- a/pipeline/config.yaml.example
+++ b/pipeline/config.yaml.example
@ -2,17 +2,72 @@ asr:
  url: "http://localhost:8100"               # parakeet-asr-server

 conversation:
-  backend: "managed"                           # "managed" (e.g. OpenClaw) or "local" (conversational only)
+  backend: "agentic"                           # "agentic" (e.g. OpenClaw, with tools) or "conversational" (plain chat)

-  managed:
+  stall:
+    enabled: true                              # decide if a stall phrase is needed while the agent works
+    base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
+    api_key: "${GEMINI_API_KEY}"
+    model: "gemini-2.5-flash"
+    reasoning_effort: "none"                   # disable thinking for sub-second latency (Gemini 2.5 Flash only)
+    max_tokens: 200
+    timeout: 1.5                               # seconds; skip stall if slower than this
+    prompt: |
+      You are the bridge voice for a voice assistant. Your only job is to
+      decide if the user's latest utterance needs a brief spoken placeholder
+      while the assistant runs a slow information lookup in the background.
+
+      {recent_context}
+
+      The user just said: {user_text}
+
+      Output NONE if any of these apply:
+      - The assistant can answer from its own knowledge or creativity
+        (the assistant is itself a capable language model — facts,
+        jokes, opinions, explanations, general knowledge).
+      - It's conversational, small talk, a partial thought, or a
+        request for the assistant to keep talking ("go on", "tell me
+        more"). A follow-up that asks for NEW information about a
+        different parameter ("what about Saturday?", "same but for
+        Tuesday") is still a real lookup and should stall.
+      - The user is asking the assistant to DO something rather than
+        FIND something — any action verb like remember, save, schedule,
+        book, send, add, mark, delete, create, update. You are the
+        bridge voice, not the agent executing anything. You cannot
+        honestly promise an action from here, so let the agent confirm
+        it itself. This rule applies even if carrying out the action
+        happens to involve some internal lookup.
+
+      Only generate a stall if the user is asking a question whose answer
+      genuinely requires a slow external fetch — live web search, current
+      data, file read, API call, real-time info. Retrieval is the only
+      reason to stall.
+
+      When a lookup is needed, generate a three-to-seven-word spoken stall
+      ending with a period. Warm, present, friend energy. A tiny reaction
+      plus a signal that you're about to go look it up. If the user named a
+      place, person, or thing worth acknowledging, use its name rather than
+      "that". No predictions about the answer, no promises of action, no
+      call-center filler.
+
+      Output ONLY the stall phrase OR the literal word NONE. No quotes, no
+      explanation, no preamble.
+
+  agentic:
    base_url: "http://127.0.0.1:18789/v1"   # OpenClaw gateway
    api_key: "${OPENCLAW_GATEWAY_TOKEN}"     # env var reference
    model: "openclaw/default"
    max_tokens: 300
    message_channel: "onju-voice"            # x-openclaw-message-channel header
    # provider_model: "anthropic/claude-opus-4-6"  # optional: override backend LLM
+    voice_prompt: >-                            # prepended to every user message as a reminder
+      [voice: this is spoken input transcribed from a microphone and your entire
+      response will be read aloud by TTS on a small speaker. Write only plain
+      spoken prose — no markdown, no lists, no structured reports, no code. If
+      your research produces detailed findings, save them to a file and just
+      give a brief spoken summary. Remember, keep it conversational.]

-  local:
+  conversational:
    base_url: "https://openrouter.ai/api/v1"  # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
    api_key: "${OPENROUTER_API_KEY}"           # set key or use ${ENV_VAR} reference
    model: "anthropic/claude-haiku-4.5"
@ -20,7 +75,7 @@ conversation:
    max_tokens: 300
    system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
    persist_dir: "data/conversations"          # per-device message history (omit to disable)
-    # Local example (Ollama):
+    # Fully local example (Ollama):
    # base_url: "http://localhost:11434/v1"
    # api_key: "none"
    # model: "gemma4:e4b"
--- a/pipeline/conversation/init.py
+++ b/pipeline/conversation/init.py
@ -1,16 +1,47 @@
+import re
+from typing import AsyncIterator
+
 from pipeline.conversation.base import ConversationBackend
-from pipeline.conversation.local import LocalConversation
-from pipeline.conversation.managed import ManagedConversation
+from pipeline.conversation.conversational import ConversationalBackend
+from pipeline.conversation.agentic import AgenticBackend
+
+# Primary: punctuation followed by whitespace (safe, standard).
+_SENTENCE_END = re.compile(r"[.!?\n]+\s+")
+# Fallback: punctuation with no space, but only when preceded by a lowercase
+# letter and followed by uppercase.  Catches OpenClaw's chunk-boundary joins
+# ("now.The") without breaking abbreviations like "U.S." (uppercase before dot).
+_SENTENCE_END_NOSPACE = re.compile(r"(?<=[a-z])[.!?]+(?=[A-Z])")
+
+
+async def sentence_chunks(deltas: AsyncIterator[str]) -> AsyncIterator[str]:
+    """Buffer text deltas and yield one sentence at a time, plus any trailing
+    fragment when the stream ends."""
+    buffer = ""
+    async for delta in deltas:
+        buffer += delta
+        while True:
+            m = _SENTENCE_END.search(buffer)
+            if not m:
+                m = _SENTENCE_END_NOSPACE.search(buffer)
+            if not m:
+                break
+            sentence = buffer[: m.end()].strip()
+            buffer = buffer[m.end():]
+            if sentence:
+                yield sentence
+    tail = buffer.strip()
+    if tail:
+        yield tail


 def create_backend(config: dict, device_id: str) -> ConversationBackend:
    """Create a conversation backend based on config."""
    conv_cfg = config["conversation"]
-    backend = conv_cfg.get("backend", "local")
+    backend = conv_cfg.get("backend", "conversational")

-    if backend == "local":
-        return LocalConversation(conv_cfg["local"], device_id)
-    elif backend == "managed":
-        return ManagedConversation(conv_cfg["managed"], device_id)
+    if backend == "conversational":
+        return ConversationalBackend(conv_cfg["conversational"], device_id)
+    elif backend == "agentic":
+        return AgenticBackend(conv_cfg["agentic"], device_id)
    else:
        raise ValueError(f"Unknown conversation backend: {backend}")
--- a/pipeline/conversation/agentic.py
+++ b/pipeline/conversation/agentic.py
@ -0,0 +1,78 @@
+import logging
+import os
+import re
+from typing import AsyncIterator
+
+from openai import AsyncOpenAI
+
+log = logging.getLogger(__name__)
+
+
+def _resolve_env(value: str) -> str:
+    return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
+
+
+class AgenticBackend:
+    """Delegates conversation memory and tool execution to a remote agent
+    service (e.g. OpenClaw). Only sends the latest user message — the remote
+    service tracks session history via the session key derived from the
+    device ID, and runs its own tool loop server-side."""
+
+    def __init__(self, cfg: dict, device_id: str):
+        self.cfg = cfg
+        self.device_id = device_id
+        self.message_channel = cfg.get("message_channel", "onju-voice")
+        self.client = AsyncOpenAI(
+            base_url=cfg["base_url"],
+            api_key=_resolve_env(cfg.get("api_key", "none")),
+            default_headers={
+                "x-openclaw-message-channel": self.message_channel,
+            },
+        )
+
+    def _build_kwargs(self, user_text: str, extra_context: str | None = None) -> dict:
+        voice_prompt = self.cfg.get("voice_prompt")
+        parts = []
+        if voice_prompt:
+            parts.append(voice_prompt)
+        if extra_context:
+            parts.append(extra_context)
+        parts.append(user_text)
+        content = "\n\n".join(parts)
+        kwargs = dict(
+            model=self.cfg.get("model", "openclaw/default"),
+            messages=[{"role": "user", "content": content}],
+            max_tokens=self.cfg.get("max_tokens", 300),
+            user=self.device_id,
+        )
+        if self.cfg.get("provider_model"):
+            kwargs["extra_headers"] = {"x-openclaw-model": self.cfg["provider_model"]}
+        return kwargs
+
+    async def send(self, user_text: str, extra_context: str | None = None) -> str:
+        response = await self.client.chat.completions.create(
+            **self._build_kwargs(user_text, extra_context)
+        )
+        text = response.choices[0].message.content or ""
+        log.debug(f"[{self.device_id}] managed LLM: {text}")
+        return text
+
+    async def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
+        kwargs = self._build_kwargs(user_text, extra_context)
+        kwargs["stream"] = True
+        stream = await self.client.chat.completions.create(**kwargs)
+        async for chunk in stream:
+            if not chunk.choices:
+                continue
+            delta = chunk.choices[0].delta.content or ""
+            if delta:
+                yield delta
+
+    def reset(self) -> None:
+        pass  # session reset would require an API call if supported
+
+    def get_messages(self) -> list[dict]:
+        return []  # history lives on the remote service
+
+    def set_messages(self, messages: list[dict]) -> None:
+        pass  # no-op — remote service owns the history
--- a/pipeline/conversation/base.py
+++ b/pipeline/conversation/base.py
@ -1,10 +1,14 @@
-from typing import Protocol, runtime_checkable
+from typing import AsyncIterator, Protocol, runtime_checkable


@runtime_checkable
 class ConversationBackend(Protocol):
-    async def send(self, user_text: str) -> str:
-        """Send a user message, return the assistant's response."""
+    async def send(self, user_text: str, extra_context: str | None = None) -> str:
+        """Send a user message, return the full assistant response."""
+        ...
+
+    def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
+        """Send a user message, yield assistant text deltas as they arrive."""
        ...

    def reset(self) -> None:
--- a/pipeline/conversation/conversational.py
+++ b/pipeline/conversation/conversational.py
@ -2,6 +2,7 @@ import json
 import logging
 import os
 import re
+from typing import AsyncIterator

 from openai import AsyncOpenAI

@ -12,8 +13,10 @@ def _resolve_env(value: str) -> str:
    return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)


-class LocalConversation:
-    """Manages conversation history locally and sends full context to any OpenAI-compatible endpoint."""
+class ConversationalBackend:
+    """Simple chat-completions backend: manages conversation history on the
+    client and sends the full context to any OpenAI-compatible endpoint.
+    Good for plain LLM chat with no tool use."""

    def __init__(self, cfg: dict, device_id: str):
        self.cfg = cfg
@ -34,29 +37,56 @@ class LocalConversation:

        self.messages: list[dict] = self._load() or [{"role": "system", "content": cfg["system_prompt"]}]

-    async def send(self, user_text: str) -> str:
-        self._sanitize()
-        self.messages.append({"role": "user", "content": user_text})
-
+    def _build_kwargs(self) -> dict:
        kwargs = dict(
            model=self.cfg["model"],
            messages=self.messages,
            max_tokens=self.cfg.get("max_tokens", 300),
        )
-        if self.cfg.get("thinking_budget") is not None:
-            kwargs["extra_body"] = {
-                "google": {"thinking_config": {"thinking_budget": self.cfg["thinking_budget"]}}
-            }
+        # Gemini 2.5 via OpenAI-compat: disable thinking with reasoning_effort.
+        # https://ai.google.dev/gemini-api/docs/openai
+        if self.cfg.get("reasoning_effort"):
+            kwargs["reasoning_effort"] = self.cfg["reasoning_effort"]
+        return kwargs

-        response = await self.client.chat.completions.create(**kwargs)
-        text = response.choices[0].message.content or ""
+    def _finalize(self, text: str) -> None:
        self.messages.append({"role": "assistant", "content": text})
        self._prune()
        self.save()

+    def _wrap_user(self, user_text: str, extra_context: str | None) -> str:
+        return f"{extra_context}\n\n{user_text}" if extra_context else user_text
+
+    async def send(self, user_text: str, extra_context: str | None = None) -> str:
+        self._sanitize()
+        self.messages.append({"role": "user", "content": self._wrap_user(user_text, extra_context)})
+
+        response = await self.client.chat.completions.create(**self._build_kwargs())
+        text = response.choices[0].message.content or ""
+        self._finalize(text)
        log.debug(f"[{self.device_id}] LLM: {text}")
        return text

+    async def stream(self, user_text: str, extra_context: str | None = None) -> AsyncIterator[str]:
+        self._sanitize()
+        self.messages.append({"role": "user", "content": self._wrap_user(user_text, extra_context)})
+
+        kwargs = self._build_kwargs()
+        kwargs["stream"] = True
+        stream = await self.client.chat.completions.create(**kwargs)
+
+        parts: list[str] = []
+        try:
+            async for chunk in stream:
+                if not chunk.choices:
+                    continue
+                delta = chunk.choices[0].delta.content or ""
+                if delta:
+                    parts.append(delta)
+                    yield delta
+        finally:
+            self._finalize("".join(parts))
+
    def reset(self) -> None:
        self.messages = [{"role": "system", "content": self.cfg["system_prompt"]}]
        self.save()
--- a/pipeline/conversation/managed.py
+++ b/pipeline/conversation/managed.py
@ -1,59 +0,0 @@
-import logging
-import os
-import re
-
-from openai import AsyncOpenAI
-
-log = logging.getLogger(__name__)
-
-
-def _resolve_env(value: str) -> str:
-    return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
-
-
-class ManagedConversation:
-    """Delegates conversation memory to a remote service (OpenClaw, etc).
-
-    Only sends the latest user message — the remote service tracks session history
-    via the session key derived from the device ID.
-    """
-
-    def __init__(self, cfg: dict, device_id: str):
-        self.cfg = cfg
-        self.device_id = device_id
-        self.message_channel = cfg.get("message_channel", "onju-voice")
-        self.client = AsyncOpenAI(
-            base_url=cfg["base_url"],
-            api_key=_resolve_env(cfg.get("api_key", "none")),
-            default_headers={
-                "x-openclaw-message-channel": self.message_channel,
-            },
-        )
-
-    async def send(self, user_text: str) -> str:
-        kwargs = dict(
-            model=self.cfg.get("model", "openclaw/default"),
-            messages=[{"role": "user", "content": user_text}],
-            max_tokens=self.cfg.get("max_tokens", 300),
-            user=self.device_id,
-        )
-
-        extra_headers = {}
-        if self.cfg.get("provider_model"):
-            extra_headers["x-openclaw-model"] = self.cfg["provider_model"]
-        if extra_headers:
-            kwargs["extra_headers"] = extra_headers
-
-        response = await self.client.chat.completions.create(**kwargs)
-        text = response.choices[0].message.content or ""
-        log.debug(f"[{self.device_id}] managed LLM: {text}")
-        return text
-
-    def reset(self) -> None:
-        pass  # session reset would require an API call if supported
-
-    def get_messages(self) -> list[dict]:
-        return []  # history lives on the remote service
-
-    def set_messages(self, messages: list[dict]) -> None:
-        pass  # no-op — remote service owns the history
--- a/pipeline/conversation/stall.py
+++ b/pipeline/conversation/stall.py
@ -0,0 +1,100 @@
+import asyncio
+import logging
+import os
+import re
+
+from openai import AsyncOpenAI
+
+log = logging.getLogger(__name__)
+
+
+def _resolve_env(value: str) -> str:
+    return re.sub(r"\$\{(\w+)\}", lambda m: os.environ.get(m.group(1), ""), value)
+
+
+_client: AsyncOpenAI | None = None
+_client_key: tuple | None = None
+
+
+def _get_client(cfg: dict) -> AsyncOpenAI:
+    """Lazy singleton so we reuse the HTTP connection pool across turns."""
+    global _client, _client_key
+    key = (cfg["base_url"], cfg.get("api_key", ""))
+    if _client is None or _client_key != key:
+        _client = AsyncOpenAI(
+            base_url=cfg["base_url"],
+            api_key=_resolve_env(cfg.get("api_key", "none")),
+        )
+        _client_key = key
+    return _client
+
+
+async def decide_stall(
+    user_text: str,
+    config: dict,
+    prev_user: str | None = None,
+    prev_assistant: str | None = None,
+) -> str | None:
+    """Ask a fast classifier model whether the voice assistant should say a
+    brief stall phrase before answering. Returns the stall text to speak, or
+    None if the query is conversational and needs no stall (or the classifier
+    failed/timed out).
+
+    Only runs in agentic mode — conversational backends already respond
+    quickly and don't benefit from a stall.
+
+    Pass the previous user/assistant exchange as context so the classifier
+    can recognize follow-ups, continuation cues ("go on"), and mid-conversation
+    prefaces ("one more thing") as conversational rather than tool-needing."""
+    conv_cfg = config.get("conversation", {})
+    if conv_cfg.get("backend") != "agentic":
+        return None
+
+    cfg = conv_cfg.get("stall")
+    if not cfg or not cfg.get("enabled", False):
+        return None
+
+    timeout = cfg.get("timeout", 1.5)
+
+    if prev_user or prev_assistant:
+        context_block = "Previous turn in this conversation:\n"
+        if prev_user:
+            context_block += f"  User said: {prev_user}\n"
+        if prev_assistant:
+            context_block += f"  You replied: {prev_assistant}\n"
+    else:
+        context_block = "(No previous turn — this is the start of the conversation.)"
+
+    prompt = (
+        cfg["prompt"]
+        .replace("{user_text}", user_text)
+        .replace("{recent_context}", context_block)
+    )
+
+    kwargs = dict(
+        model=cfg["model"],
+        messages=[{"role": "user", "content": prompt}],
+        max_tokens=cfg.get("max_tokens", 200),
+    )
+    # Gemini 2.5 models think by default — disable it for the stall call since
+    # we need sub-second latency and the classification is trivial.
+    # https://ai.google.dev/gemini-api/docs/openai
+    if cfg.get("reasoning_effort"):
+        kwargs["reasoning_effort"] = cfg["reasoning_effort"]
+
+    try:
+        response = await asyncio.wait_for(
+            _get_client(cfg).chat.completions.create(**kwargs),
+            timeout=timeout,
+        )
+    except asyncio.TimeoutError:
+        log.info(f"STALL timeout after {timeout}s — skipping")
+        return None
+    except Exception as e:
+        log.warning(f"STALL classifier failed: {e}")
+        return None
+
+    text = (response.choices[0].message.content or "").strip()
+    if not text or text.upper().startswith("NONE"):
+        return None
+    return text
--- a/pipeline/device.py
+++ b/pipeline/device.py
@ -24,6 +24,7 @@ class Device:
            self.voice = el_cfg.get("default_voice", "Emma")
        self.ptt = ptt
        self.vad = None if ptt else VAD(config)
+        self.last_user_text: str | None = None
        self.last_response: str | None = None
        self.led_power = 0
        self.led_update_time = 0.0
--- a/pipeline/main.py
+++ b/pipeline/main.py
@ -13,7 +13,8 @@ import numpy as np
 import yaml

 from pipeline.audio import decode_ulaw, opus_encode, opus_frames_to_tcp_payload, pcm_to_wav
-from pipeline.conversation import create_backend
+from pipeline.conversation import create_backend, sentence_chunks
+from pipeline.conversation import stall as stall_mod
 from pipeline.device import Device, DeviceManager
 from pipeline.protocol import send_audio, send_led_blink, open_led_connection, write_led_blink, close_led_connection
 from pipeline.services import asr, tts
@ -94,9 +95,13 @@ async def udp_listener(config: dict, manager: DeviceManager, utterance_queue: as
        pcm = decode_ulaw(data)

        if device.ptt:
-            # PTT: just buffer, no VAD needed
-            if device.processing:
-                continue
+            # PTT frames arriving while we're still responding means the user
+            # pressed the button to interrupt. The button press is the intent
+            # signal; keep buffering so these frames become the next utterance
+            # once the current turn bails out.
+            if device.processing and not device.interrupted.is_set():
+                log.info(f"PTT  interrupt from {device.hostname}")
+                device.interrupted.set()
            device.ptt_buffer.append(pcm)
        else:
            # VOX: run VAD
@ -228,41 +233,163 @@ async def process_utterances(config: dict, manager: DeviceManager, utterance_que
                log.info(f"Interrupted before LLM")
                continue

-            # Conversation
+            # Streaming LLM → sentence-buffered TTS → Opus → TCP.
+            # Intermediate sends use mic_timeout=0 so the mic only reopens after
+            # the final chunk has played out.
+            sample_rate = config["audio"]["sample_rate"]
+            opus_frame_size = config["audio"]["opus_frame_size"]
+
+            turn_t0 = time.monotonic()
+            full_response: list[str] = []
+            pending: str | None = None   # sentence waiting to be flushed
+            sent_partial = False           # any non-final chunk already sent?
+            first_sentence_at: float | None = None
+            stream_start_at: float | None = None
+
+            async def send_sentence(sentence: str, is_final: bool) -> bool:
+                """Synthesize, encode, and push one sentence. Returns True on
+                success, False if interrupted or TTS failed."""
+                nonlocal sent_partial
+                if device.interrupted.is_set():
+                    log.info(f"Interrupted before TTS")
+                    return False
+                try:
+                    pcm = await tts.synthesize(sentence, device.voice, config)
+                except Exception as e:
+                    log.error(f"TTS  failed: {e}")
+                    return False
+                if device.interrupted.is_set():
+                    log.info(f"Interrupted before send")
+                    return False
+                frames = opus_encode(pcm, sample_rate, opus_frame_size)
+                payload = opus_frames_to_tcp_payload(frames)
+                mic_timeout = dev_cfg["default_mic_timeout"] if is_final else 0
+                log.info(f"SEND  [+{time.monotonic() - turn_t0:.2f}s] "
+                         f"{len(frames)} opus frames to {device.ip} "
+                         f"({'final' if is_final else 'partial'}: {sentence!r})")
+                await send_audio(device.ip, tcp_port, payload,
+                                 mic_timeout=mic_timeout,
+                                 volume=dev_cfg["default_volume"],
+                                 fade=dev_cfg["led_fade"])
+                if not is_final:
+                    sent_partial = True
+                return True
+
+            async def reopen_mic_if_needed():
+                """If we already sent partial audio with mic_timeout=0, the mic
+                is closed — push an empty audio to reopen it on recovery."""
+                if sent_partial and not device.ptt:
+                    try:
+                        await send_audio(device.ip, tcp_port, b"",
+                                         mic_timeout=dev_cfg["default_mic_timeout"],
+                                         volume=0, fade=0)
+                    except Exception:
+                        pass
+
+            # Stall decision (agentic mode only; blocking, capped by config timeout).
+            # Passes the previous exchange so the classifier can recognize
+            # continuations ("go on") and prefaces ("one more thing") as
+            # conversational rather than tool-needing.
+            stall_text: str | None = None
+            if config["conversation"].get("backend") == "agentic":
+                stall_text = await stall_mod.decide_stall(
+                    text,
+                    config,
+                    prev_user=device.last_user_text,
+                    prev_assistant=device.last_response,
+                )
+                stall_decided_at = time.monotonic() - turn_t0
+                if stall_text:
+                    log.info(f"STALL [+{stall_decided_at:.2f}s] decided: {stall_text!r}")
+                else:
+                    log.info(f"STALL [+{stall_decided_at:.2f}s] NONE")
+            device.last_user_text = text
+
+            # Fire stall TTS+send in parallel with OpenClaw warming up.
+            stall_task: asyncio.Task | None = None
+            extra_context: str | None = None
+            if stall_text:
+                stall_task = asyncio.create_task(send_sentence(stall_text, is_final=False))
+                extra_context = (
+                    f"(You already said aloud to the user: \"{stall_text}\" — "
+                    f"don't repeat this phrase, continue naturally with the answer.)"
+                )
+
+            aborted = False
            try:
-                response_text = await device.conversation.send(text)
+                stream_start_at = time.monotonic()
+                async for sentence in sentence_chunks(
+                    device.conversation.stream(text, extra_context=extra_context)
+                ):
+                    full_response.append(sentence)
+                    if first_sentence_at is None:
+                        first_sentence_at = time.monotonic()
+                        ttfs_turn = first_sentence_at - turn_t0
+                        ttfs_stream = first_sentence_at - stream_start_at
+                        log.info(f"LLM  first sentence [+{ttfs_turn:.2f}s turn / "
+                                 f"{ttfs_stream:.2f}s stream]: {sentence}")
+                    else:
+                        log.debug(f"LLM  sentence: {sentence}")
+
+                    # Make sure the stall audio has finished sending before we
+                    # start pushing OpenClaw content to the device.
+                    if stall_task is not None and not stall_task.done():
+                        await stall_task
+                        stall_task = None
+
+                    # Flush the *previous* sentence as non-final; whichever
+                    # sentence is last when the stream ends becomes the final.
+                    if pending is not None:
+                        if not await send_sentence(pending, is_final=False):
+                            aborted = True
+                            break
+                    pending = sentence
            except Exception as e:
                log.error(f"LLM  failed: {e}")
+                if stall_task is not None:
+                    try:
+                        await stall_task
+                    except Exception:
+                        pass
+                await reopen_mic_if_needed()
                continue
+
+            # Drain the stall task if it's still pending (e.g. LLM stream
+            # returned zero content).
+            if stall_task is not None:
+                try:
+                    await stall_task
+                except Exception:
+                    pass
+                stall_task = None
+
+            if aborted:
+                await reopen_mic_if_needed()
+                continue
+
+            if pending is not None:
+                if not await send_sentence(pending, is_final=True):
+                    await reopen_mic_if_needed()
+                    continue
+            elif sent_partial:
+                # The stall played but OpenClaw returned nothing — reopen mic.
+                if not device.ptt:
+                    await send_audio(device.ip, tcp_port, b"",
+                                     mic_timeout=dev_cfg["default_mic_timeout"],
+                                     volume=0, fade=0)
+            elif not device.ptt:
+                # No content from the LLM and no stall — still reopen the mic.
+                await send_audio(device.ip, tcp_port, b"",
+                                 mic_timeout=dev_cfg["default_mic_timeout"],
+                                 volume=0, fade=0)
+
+            response_text = " ".join(full_response)
            device.last_response = response_text
-            log.info(f"LLM  {response_text}")
-
-            # Check for interrupt before TTS
-            if device.interrupted.is_set():
-                log.info(f"Interrupted before TTS")
-                continue
-
-            # TTS
-            try:
-                pcm_response = await tts.synthesize(response_text, device.voice, config)
-                log.info(f"TTS  {len(pcm_response)} bytes ({len(pcm_response)/32000:.1f}s)")
-            except Exception as e:
-                log.error(f"TTS  failed: {e}")
-                continue
-
-            # Check for interrupt before sending audio
-            if device.interrupted.is_set():
-                log.info(f"Interrupted before send")
-                continue
-
-            # Opus encode and send
-            frames = opus_encode(pcm_response, config["audio"]["sample_rate"], config["audio"]["opus_frame_size"])
-            payload = opus_frames_to_tcp_payload(frames)
-            log.info(f"SEND  {len(frames)} opus frames to {device.ip}")
-            await send_audio(device.ip, tcp_port, payload,
-                             mic_timeout=dev_cfg["default_mic_timeout"],
-                             volume=dev_cfg["default_volume"],
-                             fade=dev_cfg["led_fade"])
+            elapsed = time.monotonic() - turn_t0
+            ttfs = f"{first_sentence_at - turn_t0:.2f}s" if first_sentence_at else "—"
+            log.info(f"LLM  [{ttfs} first / {elapsed:.2f}s total / "
+                     f"{len(full_response)} sentences / {len(response_text)} chars] "
+                     f"{response_text}")

        except Exception as e:
            log.error(f"Pipeline error ({device.hostname}): {e}")
@ -372,6 +499,41 @@ def _http_respond(writer: asyncio.StreamWriter, status: int, body: str):
    writer.write(f"HTTP/1.1 {status} {reason}\r\nContent-Type: application/json\r\nContent-Length: {len(body)}\r\nConnection: close\r\n\r\n{body}".encode())


+def _log_startup_summary(config: dict) -> None:
+    """Log the active endpoints and models so it's obvious at a glance how the
+    pipeline is configured for this run."""
+    conv_cfg = config["conversation"]
+    backend_name = conv_cfg.get("backend", "conversational")
+    backend_cfg = conv_cfg.get(backend_name, {})
+
+    log.info("Pipeline server starting")
+    log.info(f"  ASR   {config['asr']['url']}")
+
+    if backend_name == "agentic":
+        model = backend_cfg.get("provider_model") or backend_cfg.get("model", "?")
+        log.info(f"  LLM   agentic: {model} @ {backend_cfg.get('base_url', '?')} "
+                 f"(channel={backend_cfg.get('message_channel', '?')})")
+        stall_cfg = conv_cfg.get("stall", {})
+        if stall_cfg.get("enabled"):
+            log.info(f"  STALL {stall_cfg.get('model', '?')} @ {stall_cfg.get('base_url', '?')} "
+                     f"(timeout={stall_cfg.get('timeout', 1.5)}s)")
+        else:
+            log.info("  STALL disabled")
+    else:
+        log.info(f"  LLM   conversational: {backend_cfg.get('model', '?')} "
+                 f"@ {backend_cfg.get('base_url', '?')}")
+
+    tts_cfg = config["tts"]
+    tts_backend = tts_cfg.get("backend", "?")
+    if tts_backend == "elevenlabs":
+        el = tts_cfg.get("elevenlabs", {})
+        vox = el.get("default_voice", "?")
+        ptt = el.get("default_voice_ptt", vox)
+        log.info(f"  TTS   elevenlabs: VOX={vox} PTT={ptt}")
+    else:
+        log.info(f"  TTS   {tts_backend}")
+
+
 async def warmup(config: dict):
    """Validate conversation backend and TTS are reachable."""
    log.info("Warming up conversation backend and TTS...")
@ -446,7 +608,7 @@ async def main(config_path: str = None, do_warmup: bool = False, devices: list[s

    utterance_queue = asyncio.Queue()

-    log.info("Pipeline server starting")
+    _log_startup_summary(config)
    await asyncio.gather(
        udp_listener(config, manager, utterance_queue),
        multicast_listener(config, manager),
--- a/run.sh
+++ b/run.sh
@ -0,0 +1,22 @@
+#!/bin/bash
+# Run the onju-voice pipeline server.
+#
+# Usage:
+#   ./run.sh                        # default config
+#   ./run.sh --warmup               # warmup LLM+TTS on startup
+#   ./run.sh --device onju=10.0.0.5 # pre-register a device
+
+cd "$(dirname "$0")"
+
+# opuslib uses ctypes.util.find_library('opus'), which on macOS does not search
+# Homebrew prefixes. Point the dynamic loader at the brew opus lib if present.
+if [ "$(uname)" = "Darwin" ] && command -v brew >/dev/null 2>&1; then
+    if opus_prefix="$(brew --prefix opus 2>/dev/null)" && [ -d "$opus_prefix/lib" ]; then
+        export DYLD_FALLBACK_LIBRARY_PATH="$opus_prefix/lib:${DYLD_FALLBACK_LIBRARY_PATH:-/usr/local/lib:/usr/lib}"
+    else
+        echo "Warning: 'opus' not installed via Homebrew. Run: brew install opus"
+    fi
+fi
+
+echo "Starting onju-voice pipeline..."
+uv run python -m pipeline.main "$@"
--- a/setup_openclaw.sh
+++ b/setup_openclaw.sh
@ -9,10 +9,12 @@ VOICE_SECTION='# Voice mode

 When the message channel is `onju-voice`, your response will be spoken aloud by TTS on a small speaker. The user'\''s input is transcribed, so expect errors and infer meaning generously.

- Format: no emojis, markdown, URLs, file paths, tables, or structured data of any kind. Everything gets pronounced literally or breaks the cadence of speech. Respond in plain prose only.
- Length: one to two sentences. If a topic needs depth, give the headline and offer to elaborate — "Short answer is yes, want me to walk through why?" Never dump information. Voice is conversation, not briefing.
+- Format: your output goes directly to a text-to-speech engine and out a speaker. No markdown, no backticks, no asterisks, no bullet points, no numbered lists, no emojis, no URLs. Never mention file names, folder paths, code snippets, or config names — the listener cannot see them and they sound terrible read aloud. If you need to refer to something technical, describe it in plain words: "I updated your search settings" not "I edited openclaw dot json." For social media handles, say the name naturally — "Jason Beale on X" not "at jabeale." Everything you write gets pronounced exactly as-is, so write only clean spoken prose.
+- Length: one to two sentences per spoken chunk. If a topic needs depth, give the headline and offer to elaborate — "Short answer is yes, want me to walk through why?" Never dump information. Voice is conversation, not briefing.
+- Long outputs: if your research or work produces detailed results — full reports, lists of findings, code, configs — save them to a file so the user can review later at a screen. Then give a brief spoken summary of what you found and mention that the details are saved. Never read out a long report, a list of bullet points, or structured data over voice.
 - Cadence: speak the way a thoughtful friend speaks out loud. Warm, direct, unhurried. Use contractions. Say numbers the human way ("about three thousand", not "3,247"). Spell out abbreviations that don'\''t have a natural spoken form. Skip jargon when plain language exists.
- Tools: you have full tool access, but the user only hears your final reply — they don'\''t see tool output, intermediate steps, or your reasoning. Translate structured results into prose, picking the one or two facts that actually matter.
+- Tools: you have full tool access, but the user only hears your final reply — they don'\''t see tool output, intermediate steps, or your reasoning. You can narrate what you'\''re doing as you go — short, casual check-ins like "Got it, checking the docs now" or "Okay, writing that up" are great between tool calls. Just keep each update to one short sentence and never read out code, file names, or technical details. Translate structured results into prose, picking the one or two facts that actually matter.
+- Don'\''t open with a stall phrase: skip "on it" / "give me a sec" / "alright, pulling that up" style openers. The pipeline handles brief acknowledgments for slow requests separately, so when you speak, start with the actual answer or status update.
 - Character: you are the same assistant the user talks to everywhere else. Same memory, same personality, same relationship. Voice changes only the form of your response, never the substance.'

 echo "==> Enabling chat completions endpoint on OpenClaw gateway..."
@ -20,7 +22,19 @@ openclaw config set gateway.http.endpoints.chatCompletions.enabled true

 if [ -f "$AGENTS_MD" ]; then
    if grep -qF "$VOICE_MARKER" "$AGENTS_MD"; then
-        echo "==> Voice mode section already present in AGENTS.md, skipping."
+        echo "==> Replacing existing voice mode section in AGENTS.md..."
+        # Remove everything from "# Voice mode" to the next H1 or end-of-file,
+        # then append the updated section.
+        python3 -c "
+import re, sys
+text = open(sys.argv[1]).read()
+# Strip old voice section: from '# Voice mode' to next ^# heading or EOF
+text = re.sub(r'(?m)^# Voice mode\n.*?(?=^# |\Z)', '', text, flags=re.DOTALL).rstrip()
+open(sys.argv[1], 'w').write(text + '\n')
+" "$AGENTS_MD"
+        echo "" >> "$AGENTS_MD"
+        echo "$VOICE_SECTION" >> "$AGENTS_MD"
+        echo "==> Voice mode section replaced."
    else
        echo "" >> "$AGENTS_MD"
        echo "$VOICE_SECTION" >> "$AGENTS_MD"
@ -35,4 +49,4 @@ fi
 echo "==> Restarting OpenClaw gateway..."
 openclaw gateway restart

-echo "==> Done. Set conversation.backend: \"managed\" in pipeline/config.yaml to use OpenClaw."
+echo "==> Done. Set conversation.backend: \"agentic\" in pipeline/config.yaml to use OpenClaw."
--- a/uv.lock
+++ b/uv.lock