onju-v2/pipeline/config.yaml.example
justLV 002ed7388d Refine stall classifier prompt and group benchmark cases by label
Rework the stall prompt to distinguish LOOKUP (say something specific,
three-to-seven words) from ACTION (content-free backchannel, two-to-five
words, no action verbs or promises) and restructure test_stall.py to
group cases by expected label for easier manual review.
2026-04-12 19:08:40 -07:00

149 lines
6.3 KiB
Text

asr:
url: "http://localhost:8100" # parakeet-asr-server
conversation:
backend: "agentic" # "agentic" (e.g. OpenClaw, with tools) or "conversational" (plain chat)
stall:
enabled: true # decide if a stall phrase is needed while the agent works
base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
api_key: "${GEMINI_API_KEY}"
model: "gemini-2.5-flash"
reasoning_effort: "none" # disable thinking for sub-second latency (Gemini 2.5 Flash only)
max_tokens: 200
timeout: 1.5 # seconds; skip stall if slower than this
prompt: |
You are the bridge voice for a voice assistant — a short, natural
utterance you speak immediately while the real assistant starts
working. Your job is to decide whether the user's latest
utterance needs one, and if so, to say it.
{recent_context}
The user just said: {user_text}
If the assistant can answer entirely from its own knowledge or
creativity — facts, opinions, jokes, explanations, general
knowledge, small talk, a partial thought, or a request to keep
talking — output the literal word NONE. The assistant is itself
a capable language model and doesn't need bridge audio for
anything it can just answer. Note: a follow-up that changes a
parameter in a previous lookup is a fresh lookup, not small talk.
Otherwise the assistant is about to do slow agentic work — a
live lookup, a file or API call, or an action like scheduling,
saving, sending, remembering, updating something — and you
should speak a brief, warm bridge phrase while that runs. Two
situations:
Asking FOR information. React naturally and signal you're going
to go look. Roughly three to seven words, friend energy, specific
to what the user actually mentioned — use the name of the place,
person, or thing instead of vague filler. Never predict the
answer.
Asking you to DO something. You are ONLY the bridge voice —
you have no authority to commit to the action, and the real
agent will confirm it itself once it's done.
Your job: speak a short listener-sound that tells the user
"I heard you" without actually responding to the substance of
their request. Two to five words, warm and natural, like the
reaction a friend gives mid-conversation to show they're
following. It should feel like a backchannel, not a reply.
Content test you must pass: if a third party read ONLY your
phrase, without the user's message, they should be unable to
guess what the user asked for. That means:
- No verb form of the action — no "adding", "saving",
"scheduling", "sending", "marking", "reminding", "noting",
"creating", "updating", "setting up", "putting", etc.
- No naming of the thing being acted on.
- No "I'll", "I will", "let me", "I'm going to", "on it",
"will do", "right away".
The common failure mode is helpfully narrating the action
("Okay, adding that…", "Sure, I'll remember that…") — that
is exactly what NOT to do, because you cannot honestly make
that promise. Stay content-free.
Write fresh each time — don't reach for stock phrases. Match the
user's register: relaxed if they were relaxed, brisk if they
were brisk. Keep it under seven words either way. End with
normal spoken punctuation.
Output ONLY the spoken phrase, or the literal word NONE. No
quotes, no explanation, no preamble.
agentic:
base_url: "http://127.0.0.1:18789/v1" # OpenClaw gateway
api_key: "${OPENCLAW_GATEWAY_TOKEN}" # env var reference
model: "openclaw/default"
max_tokens: 300
message_channel: "onju-voice" # x-openclaw-message-channel header
# provider_model: "anthropic/claude-opus-4-6" # optional: override backend LLM
voice_prompt: >- # prepended to every user message as a reminder
[voice: this is spoken input transcribed from a microphone and your entire
response will be read aloud by TTS on a small speaker. Write only plain
spoken prose — no markdown, no lists, no structured reports, no code. If
your research produces detailed findings, save them to a file and just
give a brief spoken summary. Remember, keep it conversational.]
conversational:
base_url: "https://openrouter.ai/api/v1" # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
api_key: "${OPENROUTER_API_KEY}" # set key or use ${ENV_VAR} reference
model: "anthropic/claude-haiku-4.5"
max_messages: 20
max_tokens: 300
system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
persist_dir: "data/conversations" # per-device message history (omit to disable)
# Fully local example (Ollama):
# base_url: "http://localhost:11434/v1"
# api_key: "none"
# model: "gemma4:e4b"
tts:
backend: "elevenlabs" # "local" or "elevenlabs" (cloud)
local:
url: "http://localhost:8880"
model: "mlx-community/Qwen3-TTS-12Hz-1.7B-Base-4bit"
ref_audio: ""
ref_text: ""
elevenlabs:
api_key: "" # your ElevenLabs API key
default_voice: "Archer"
default_voice_ptt: "Emma" # PTT devices (smaller speaker)
voices:
Archer: "Fahco4VZzobUeiPqni1S" # British conversational male
Emma: "56bWURjYFHyYyVf490Dp" # female, better on small speakers
Rachel: "21m00Tcm4TlvDq8ikWAM" # add your voice IDs here
vad:
threshold: 0.5 # speech onset probability
neg_threshold: 0.35 # speech offset probability (hysteresis)
silence_time: 1.5
pre_buffer_s: 1.0
network:
udp_port: 3000
tcp_port: 3001
multicast_group: "239.0.0.1"
multicast_port: 12345
control_port: 3002
audio:
sample_rate: 16000
chunk_size: 512 # 32ms at 16kHz (matches ESP32 SAMPLE_CHUNK_SIZE)
opus_frame_size: 320 # 20ms at 16kHz
device:
default_volume: 15
default_mic_timeout: 60
led_fade: 2
led_power: 50
led_update_period: 0.25
greeting: false
greeting_wav: "data/hello_imhere.wav"
logging:
level: "INFO"