onju-v2/pipeline/config.yaml.example

asr:
  url: "http://localhost:8100"               # parakeet-asr-server

conversation:
  backend: "agentic"                           # "agentic" (e.g. OpenClaw, with tools) or "conversational" (plain chat)

  stall:
    enabled: true                              # decide if a stall phrase is needed while the agent works
    base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
    api_key: "${GEMINI_API_KEY}"
    model: "gemini-2.5-flash"
    reasoning_effort: "none"                   # disable thinking for sub-second latency (Gemini 2.5 Flash only)
    max_tokens: 200
    timeout: 1.5                               # seconds; skip stall if slower than this
    prompt: |
      You are the bridge voice for a voice assistant — a short, natural
      utterance you speak immediately while the real assistant starts
      working. Your job is to decide whether the user's latest
      utterance needs one, and if so, to say it.

      {recent_context}

      The user just said: {user_text}

      If the assistant can answer entirely from its own knowledge or
      creativity — facts, opinions, jokes, explanations, general
      knowledge, small talk, a partial thought, or a request to keep
      talking — output the literal word NONE. The assistant is itself
      a capable language model and doesn't need bridge audio for
      anything it can just answer. Note: a follow-up that changes a
      parameter in a previous lookup is a fresh lookup, not small talk.

      Otherwise the assistant is about to do slow agentic work — a
      live lookup, a file or API call, or an action like scheduling,
      saving, sending, remembering, updating something — and you
      should speak a brief, warm bridge phrase while that runs. Two
      situations:

      Asking FOR information. React naturally and signal you're going
      to go look. Roughly three to seven words, friend energy, specific
      to what the user actually mentioned — use the name of the place,
      person, or thing instead of vague filler. Never predict the
      answer.

      Asking you to DO something. You are ONLY the bridge voice —
      you have no authority to commit to the action, and the real
      agent will confirm it itself once it's done.

      Your job: speak a short listener-sound that tells the user
      "I heard you" without actually responding to the substance of
      their request. Two to five words, warm and natural, like the
      reaction a friend gives mid-conversation to show they're
      following. It should feel like a backchannel, not a reply.

      Content test you must pass: if a third party read ONLY your
      phrase, without the user's message, they should be unable to
      guess what the user asked for. That means:
      - No verb form of the action — no "adding", "saving",
        "scheduling", "sending", "marking", "reminding", "noting",
        "creating", "updating", "setting up", "putting", etc.
      - No naming of the thing being acted on.
      - No "I'll", "I will", "let me", "I'm going to", "on it",
        "will do", "right away".

      The common failure mode is helpfully narrating the action
      ("Okay, adding that…", "Sure, I'll remember that…") — that
      is exactly what NOT to do, because you cannot honestly make
      that promise. Stay content-free.

      Write fresh each time — don't reach for stock phrases. Match the
      user's register: relaxed if they were relaxed, brisk if they
      were brisk. Keep it under seven words either way. End with
      normal spoken punctuation.

      Output ONLY the spoken phrase, or the literal word NONE. No
      quotes, no explanation, no preamble.

  agentic:
    base_url: "http://127.0.0.1:18789/v1"   # OpenClaw gateway
    api_key: "${OPENCLAW_GATEWAY_TOKEN}"     # env var reference
    model: "openclaw/default"
    max_tokens: 300
    message_channel: "onju-voice"            # x-openclaw-message-channel header
    # provider_model: "anthropic/claude-opus-4-6"  # optional: override backend LLM
    voice_prompt: >-                            # prepended to every user message as a reminder
      [voice: this is spoken input transcribed from a microphone and your entire
      response will be read aloud by TTS on a small speaker. Write only plain
      spoken prose — no markdown, no lists, no structured reports, no code. If
      your research produces detailed findings, save them to a file and just
      give a brief spoken summary. Remember, keep it conversational.]

  conversational:
    base_url: "https://openrouter.ai/api/v1"  # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
    api_key: "${OPENROUTER_API_KEY}"           # set key or use ${ENV_VAR} reference
    model: "anthropic/claude-haiku-4.5"
    max_messages: 20
    max_tokens: 300
    system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
    persist_dir: "data/conversations"          # per-device message history (omit to disable)
    # Fully local example (Ollama):
    # base_url: "http://localhost:11434/v1"
    # api_key: "none"
    # model: "gemma4:e4b"

tts:
  backend: "elevenlabs"                      # "local" or "elevenlabs" (cloud)
  local:
    url: "http://localhost:8880"
    model: "mlx-community/Qwen3-TTS-12Hz-1.7B-Base-4bit"
    ref_audio: ""
    ref_text: ""
  elevenlabs:
    api_key: ""                              # your ElevenLabs API key
    default_voice: "Archer"
    default_voice_ptt: "Emma"                # PTT devices (smaller speaker)
    voices:
      Archer: "Fahco4VZzobUeiPqni1S"        # British conversational male
      Emma: "56bWURjYFHyYyVf490Dp"          # female, better on small speakers
      Rachel: "21m00Tcm4TlvDq8ikWAM"        # add your voice IDs here

vad:
  threshold: 0.5                             # speech onset probability
  neg_threshold: 0.35                        # speech offset probability (hysteresis)
  silence_time: 1.5
  pre_buffer_s: 1.0

network:
  udp_port: 3000
  tcp_port: 3001
  multicast_group: "239.0.0.1"
  multicast_port: 12345
  control_port: 3002

audio:
  sample_rate: 16000
  chunk_size: 512                            # 32ms at 16kHz (matches ESP32 SAMPLE_CHUNK_SIZE)
  opus_frame_size: 320                       # 20ms at 16kHz

device:
  default_volume: 15
  default_mic_timeout: 60
  led_fade: 2
  led_power: 50
  led_update_period: 0.25
  greeting: false
  greeting_wav: "data/hello_imhere.wav"

logging:
  level: "INFO"