mirror of
https://github.com/justLV/onju-v2
synced 2026-04-21 15:47:55 +00:00
Rework the stall prompt to distinguish LOOKUP (say something specific, three-to-seven words) from ACTION (content-free backchannel, two-to-five words, no action verbs or promises) and restructure test_stall.py to group cases by expected label for easier manual review.
149 lines
6.3 KiB
Text
149 lines
6.3 KiB
Text
asr:
|
|
url: "http://localhost:8100" # parakeet-asr-server
|
|
|
|
conversation:
|
|
backend: "agentic" # "agentic" (e.g. OpenClaw, with tools) or "conversational" (plain chat)
|
|
|
|
stall:
|
|
enabled: true # decide if a stall phrase is needed while the agent works
|
|
base_url: "https://generativelanguage.googleapis.com/v1beta/openai/"
|
|
api_key: "${GEMINI_API_KEY}"
|
|
model: "gemini-2.5-flash"
|
|
reasoning_effort: "none" # disable thinking for sub-second latency (Gemini 2.5 Flash only)
|
|
max_tokens: 200
|
|
timeout: 1.5 # seconds; skip stall if slower than this
|
|
prompt: |
|
|
You are the bridge voice for a voice assistant — a short, natural
|
|
utterance you speak immediately while the real assistant starts
|
|
working. Your job is to decide whether the user's latest
|
|
utterance needs one, and if so, to say it.
|
|
|
|
{recent_context}
|
|
|
|
The user just said: {user_text}
|
|
|
|
If the assistant can answer entirely from its own knowledge or
|
|
creativity — facts, opinions, jokes, explanations, general
|
|
knowledge, small talk, a partial thought, or a request to keep
|
|
talking — output the literal word NONE. The assistant is itself
|
|
a capable language model and doesn't need bridge audio for
|
|
anything it can just answer. Note: a follow-up that changes a
|
|
parameter in a previous lookup is a fresh lookup, not small talk.
|
|
|
|
Otherwise the assistant is about to do slow agentic work — a
|
|
live lookup, a file or API call, or an action like scheduling,
|
|
saving, sending, remembering, updating something — and you
|
|
should speak a brief, warm bridge phrase while that runs. Two
|
|
situations:
|
|
|
|
Asking FOR information. React naturally and signal you're going
|
|
to go look. Roughly three to seven words, friend energy, specific
|
|
to what the user actually mentioned — use the name of the place,
|
|
person, or thing instead of vague filler. Never predict the
|
|
answer.
|
|
|
|
Asking you to DO something. You are ONLY the bridge voice —
|
|
you have no authority to commit to the action, and the real
|
|
agent will confirm it itself once it's done.
|
|
|
|
Your job: speak a short listener-sound that tells the user
|
|
"I heard you" without actually responding to the substance of
|
|
their request. Two to five words, warm and natural, like the
|
|
reaction a friend gives mid-conversation to show they're
|
|
following. It should feel like a backchannel, not a reply.
|
|
|
|
Content test you must pass: if a third party read ONLY your
|
|
phrase, without the user's message, they should be unable to
|
|
guess what the user asked for. That means:
|
|
- No verb form of the action — no "adding", "saving",
|
|
"scheduling", "sending", "marking", "reminding", "noting",
|
|
"creating", "updating", "setting up", "putting", etc.
|
|
- No naming of the thing being acted on.
|
|
- No "I'll", "I will", "let me", "I'm going to", "on it",
|
|
"will do", "right away".
|
|
|
|
The common failure mode is helpfully narrating the action
|
|
("Okay, adding that…", "Sure, I'll remember that…") — that
|
|
is exactly what NOT to do, because you cannot honestly make
|
|
that promise. Stay content-free.
|
|
|
|
Write fresh each time — don't reach for stock phrases. Match the
|
|
user's register: relaxed if they were relaxed, brisk if they
|
|
were brisk. Keep it under seven words either way. End with
|
|
normal spoken punctuation.
|
|
|
|
Output ONLY the spoken phrase, or the literal word NONE. No
|
|
quotes, no explanation, no preamble.
|
|
|
|
agentic:
|
|
base_url: "http://127.0.0.1:18789/v1" # OpenClaw gateway
|
|
api_key: "${OPENCLAW_GATEWAY_TOKEN}" # env var reference
|
|
model: "openclaw/default"
|
|
max_tokens: 300
|
|
message_channel: "onju-voice" # x-openclaw-message-channel header
|
|
# provider_model: "anthropic/claude-opus-4-6" # optional: override backend LLM
|
|
voice_prompt: >- # prepended to every user message as a reminder
|
|
[voice: this is spoken input transcribed from a microphone and your entire
|
|
response will be read aloud by TTS on a small speaker. Write only plain
|
|
spoken prose — no markdown, no lists, no structured reports, no code. If
|
|
your research produces detailed findings, save them to a file and just
|
|
give a brief spoken summary. Remember, keep it conversational.]
|
|
|
|
conversational:
|
|
base_url: "https://openrouter.ai/api/v1" # OpenRouter, Ollama, mlx_lm.server, Gemini, etc.
|
|
api_key: "${OPENROUTER_API_KEY}" # set key or use ${ENV_VAR} reference
|
|
model: "anthropic/claude-haiku-4.5"
|
|
max_messages: 20
|
|
max_tokens: 300
|
|
system_prompt: "You are a helpful voice assistant. Keep responses concise (under 2 sentences)."
|
|
persist_dir: "data/conversations" # per-device message history (omit to disable)
|
|
# Fully local example (Ollama):
|
|
# base_url: "http://localhost:11434/v1"
|
|
# api_key: "none"
|
|
# model: "gemma4:e4b"
|
|
|
|
tts:
|
|
backend: "elevenlabs" # "local" or "elevenlabs" (cloud)
|
|
local:
|
|
url: "http://localhost:8880"
|
|
model: "mlx-community/Qwen3-TTS-12Hz-1.7B-Base-4bit"
|
|
ref_audio: ""
|
|
ref_text: ""
|
|
elevenlabs:
|
|
api_key: "" # your ElevenLabs API key
|
|
default_voice: "Archer"
|
|
default_voice_ptt: "Emma" # PTT devices (smaller speaker)
|
|
voices:
|
|
Archer: "Fahco4VZzobUeiPqni1S" # British conversational male
|
|
Emma: "56bWURjYFHyYyVf490Dp" # female, better on small speakers
|
|
Rachel: "21m00Tcm4TlvDq8ikWAM" # add your voice IDs here
|
|
|
|
vad:
|
|
threshold: 0.5 # speech onset probability
|
|
neg_threshold: 0.35 # speech offset probability (hysteresis)
|
|
silence_time: 1.5
|
|
pre_buffer_s: 1.0
|
|
|
|
network:
|
|
udp_port: 3000
|
|
tcp_port: 3001
|
|
multicast_group: "239.0.0.1"
|
|
multicast_port: 12345
|
|
control_port: 3002
|
|
|
|
audio:
|
|
sample_rate: 16000
|
|
chunk_size: 512 # 32ms at 16kHz (matches ESP32 SAMPLE_CHUNK_SIZE)
|
|
opus_frame_size: 320 # 20ms at 16kHz
|
|
|
|
device:
|
|
default_volume: 15
|
|
default_mic_timeout: 60
|
|
led_fade: 2
|
|
led_power: 50
|
|
led_update_period: 0.25
|
|
greeting: false
|
|
greeting_wav: "data/hello_imhere.wav"
|
|
|
|
logging:
|
|
level: "INFO"
|