VoiceOps: The First Full-Duplex Voice Bot for OpenClaw
We built real-time voice command infrastructure for our AI agent platform — speak into Discord, the agent reasons, the agent speaks back. No button presses. No mode-switching. One continuous operational loop, now open source.
>_The VoiceOps Objective
The target was clear: remove keyboard friction for high-tempo operations. Operators should be able to issue commands by voice, receive machine reasoning in real time, and hear responses immediately through a reliable speech layer.
This is not a toy assistant feature. It is an operational interface designed for speed, continuity, and decision support under real workload pressure. VoiceOps is the first fully operational full-duplex voice integration for OpenClaw — and it is now open source on GitHub.
Input Layer
Live voice capture and utterance segmentation through Discord's Opus codec with silence-gated VAD and energy-based noise rejection.
Output Layer
Full-duplex response channel that routes through the OpenClaw Gateway, synthesizes speech locally via kokoro-js, and delivers audio back to the voice channel with no per-turn API cost.
>_What Is Full Duplex? (Plain English)
The phone call analogy
A walkie-talkie requires you to press a button to speak, and stop speaking for the other party to respond. A phone call is full duplex — both parties can speak and listen at the same time, without any buttons or modes.
VoiceOps works like a phone call: you speak naturally into your Discord voice channel, the bot hears you in real time, reasons through your command with the AI agent, and speaks back — all without interrupting your workflow or waiting for you to signal readiness.
Most voice-enabled bots are half-duplex — push to talk, wait for the bot, listen, repeat. The UX feels like a radio exchange. Full duplex removes that friction entirely. You speak when you need to speak. The bot responds when it is ready. The channel stays open.
>_The Pipeline
The production pipeline runs in six discrete stages. Each stage has a clearly defined input, output, and failure contract. The design is intentionally linear — no shared global state, no concurrent stage overlap, one audio utterance processed completely before the next begins.
Discord Voice Capture
Raw Opus-encoded audio streams in from a Discord voice channel via @discordjs/voice. The receiver yields per-user AudioReceiveStream objects keyed by Discord user ID, allowing per-operator audio isolation at the protocol layer.
Opus codec · @discordjs/voice · per-user stream isolation
Silence-Gated VAD
EndBehaviorType.AfterSilence(800) marks the utterance boundary — 800 milliseconds of silence closes the stream and signals end-of-turn. An RMS energy gate (threshold 0.008) discards near-silent frames before they reach the ASR layer, preventing expensive API calls on background noise.
800ms silence gate · RMS 0.008 energy threshold · zero ONNX dependency
Whisper ASR
The segmented utterance buffer is shipped to OpenAI Whisper API as a WAV-encoded audio blob. Whisper returns the transcript in 500 milliseconds to 1.5 seconds depending on utterance length and server load. The model handles ambient noise, accents, and partial sentences gracefully.
OpenAI Whisper API · 500ms–1.5s · WAV input
OpenClaw Gateway v3
The transcript hits the OpenClaw Gateway WebSocket on ws://127.0.0.1:18789 as a chat.send frame. The Gateway routes to the configured agent (Gemini Flash by default), executes tool calls if needed, and streams the response back. Agent reasoning adds 1 to 3 seconds depending on task complexity.
WebSocket · chat.send · Gateway v3 protocol · 1–3s reasoning
kokoro-js TTS
The agent response text is sent to a child_process subprocess running tts-worker.mjs. kokoro-js synthesizes speech using an 82MB ONNX model warm-loaded at startup. Warm synthesis completes in under 300 milliseconds. The subprocess exits cleanly (or crashes with exit 7 from WASM cleanup — both are accepted as success) and returns WAV bytes via stdout.
kokoro-js · 82MB ONNX · <300ms warm · subprocess isolation
Discord Voice Playback
The WAV buffer is wrapped in a Discord AudioResource and handed to the VoiceConnection player. Discord handles jitter buffering and Opus re-encoding for transmission. The synthesized response plays back in the voice channel within approximately 200 milliseconds of buffer handoff.
WAV → AudioResource · Discord TX · ~200ms buffering
Latency Breakdown
Real-world latency is 3 to 7 seconds end to end. Claims of 1 to 2 seconds were not credible under realistic network and reasoning load — the adversarial audit caught this before implementation began.
| Stage | Time |
|---|---|
| VAD / silence detection | ~800ms |
| Opus decode + buffer | <20ms |
| Whisper ASR (5s clip) | 500ms–1.5s |
| Agent reasoning | 1–3s |
| kokoro-js TTS (warm) | <300ms |
| Discord TX buffering | ~200ms |
| Total | 3–7 seconds |
>_Research Before Build: The Adversarial Protocol
Before implementation, the architecture was stress-tested by three AI personas — The Visionary, The Empiricist, and The Critic — before a single line was written. The objective was to expose weak assumptions before code locked them into production.
The Visionary
Expansive, opportunity-focused, and biased toward strategic upside. Proposes architectures at ambition scale.
The Empiricist
Facts-only, evidence-grounded, and intolerant of unsupported latency or compatibility claims.
The Critic
Adversarial by design, tasked with breaking weak arguments and uncovering hidden dependency risks.
The protocol produced 2 kill-grade flaws and 3 significant wounds. Every finding arrived before implementation started.
KILL 1 — VAD Library Rejection
The proposed voice activity detection library (@ricky0123/vad-node v0.0.3) was pre-alpha with 8+ months of stagnation. Critically, it required onnxruntime-node@1.21.0 — directly conflicting with kokoro-js's requirement for onnxruntime-node@1.24.2. Building production reliability on a pre-alpha dependency with an irresolvable version conflict was rejected outright.
KILL 2 — GPU-Accelerated Local ASR
GPU-accelerated local speech recognition was assumed viable. The Empiricist checked: the target kernel (6.17) sits outside ROCm's supported range (supported up to 6.2). CPU-only execution and Vulkan compute backend were identified as the real viable paths. The GPU path was killed before any toolchain investment was made.
Wounds: Survivable but Requiring Treatment
Methodology Validation
These failures were discovered before implementation began. This was not brainstorming — it was adversarial research that removed failure paths before they reached production. Both kills would have caused days of debugging and rework had they been discovered post-implementation.
>_The ONNX Version Conflict
Technical Root Cause
Both the Silero VAD library and kokoro-js are ONNX-based — but they require different versions of onnxruntime-node:
@ricky0123/vad-node
requires onnxruntime-node@1.21.0
kokoro-js
requires onnxruntime-node@1.24.2
Rather than fight version pinning — a battle with no clean resolution — we dropped the external VAD library entirely and used @discordjs/voice's built-in EndBehaviorType.AfterSilence(800) combined with an RMS energy gate that discards near-silent frames before any API call. Zero ONNX conflicts. Zero VAD library install risk. The silence gate is deterministic, predictable, and maintenance-free.
>_TTS Engine Comparison
Five TTS engines were evaluated before selecting kokoro-js as the default. The selection criteria were latency, output quality, per-turn cost, and installation complexity. The winner had to be good enough to not embarrass the agent, fast enough to not break the conversational loop, and cheap enough to run at any volume.
| Engine | Latency | Cost |
|---|---|---|
| ★kokoro-js | <300ms warm | $0/turn |
| piper-tts | <1s | $0/turn |
| edge-tts | 1–2s | $0 cloud |
| espeak-ng | <100ms | $0/turn |
| ElevenLabs Starter | 300–800ms | $0.108/turn |
kokoro-js ships as a pure npm dependency with an 82MB ONNX model file. No separate binary installation, no Python environment, no API key. The model loads once at startup and warm synthesis completes in under 300 milliseconds — faster than any cloud TTS option at zero per-turn cost.
>_The Subprocess Isolation Pattern
This was the most unexpected engineering challenge of the entire build.
The WASM Process Exit Problem
kokoro-js uses an Emscripten-compiled phonemizer that calls process.exit(7) during WASM cleanup. Running synthesis in the main process would kill the entire bot on every TTS call — after the first voice response, the bot would terminate.
The solution: tts-worker.mjs runs as a child_process subprocess with a clean protocol boundary:
Main process spawns tts-worker.mjs as a child_process subprocess
Text to synthesize is written to the subprocess's stdin
Subprocess synthesizes with kokoro-js and writes WAV bytes to stdout
Main process collects stdout chunks into a complete WAV buffer
Subprocess exits with 0 or 7 — both are accepted as synthesis success
WASM crash is fully contained; the bot process is unaffected
The subprocess pattern converts a show-stopping process termination into a safely contained side effect. The main bot process stays alive indefinitely while WASM does whatever it needs to do on exit.
>_TTS Benchmark Results
Measured on a modern 6-core x86_64 CPU, fp32 precision, cold subprocess load (first synthesis after spawn includes model load time). RTF = Real Time Factor — how long synthesis takes relative to audio playback duration. RTF below 1.0 means the synthesis completes before the audio finishes playing.
| Phrase | Wall Time | RTF |
|---|---|---|
| "Acknowledged." | 1908ms | 1.14x |
| "Standby." | 1964ms | 1.21x |
| "The operation has completed successfully..." | 2839ms | 0.56x |
| "I am ready for your next command." | 2554ms | 0.76x |
RTF < 1.0 means the agent is thinking ahead. For longer phrases, kokoro-js finishes synthesis before the audio finishes playing — the pipeline can begin buffering the next response while the current one is still in the speaker. Cold subprocess load dominates for short phrases. Warm synthesis (second call and beyond) consistently delivers under 300ms.
>_Cost Analysis
Voice quality has a severe cost gradient. Premium cloud voice generation is roughly 200 times more expensive than local neural synthesis for the same conversational workload. The choice of TTS engine is as much a financial decision as a technical one.
★kokoro-js + Whisper API + Gemini Flash
~$0.75/month at 50/day
$0.0005/turn
★★kokoro-js + whisper.cpp local + Gemini Flash
$0.00/month
$0.00/turn
ElevenLabs Starter + Gemini Flash
~$163/month at 50/day
$0.108/turn
ElevenLabs + Claude Opus fallback
~$288/month at 50/day
~$0.19/turn
Cost Constraint
Tiered voice is not optional — it is economic control. The fully local path (whisper.cpp + kokoro-js + Gemini Flash) runs at genuine zero API cost in daily operation. Premium cloud TTS is reserved for moments where clarity carries true operational value.
>_Security Model
Voice access is enforced at protocol level. Audio capture is restricted to one authorized Discord user ID, so the bot only processes speech from the intended operator. No ambient microphone access, no shared channel promiscuity.
Voice-sourced commands run through a restricted tool allowlist with no destructive operations permitted over voice alone.
Any destructive action requires explicit text confirmation, regardless of voice confidence score or utterance clarity.
Prompt injection risk from misrecognized utterances is mitigated through command gating and confirmation boundaries enforced at the Gateway layer.
>_What Shipped
The VoiceOps system is functional and running live tests across the OpenClaw agent environment. Voice activity benchmarks confirm sub-1ms processing per audio frame after the one-time cold-start initialization.
Delivered Capabilities
- Full-duplex voice pipeline: speak naturally, receive spoken responses, no mode-switching.
- Silence-gated VAD using EndBehaviorType.AfterSilence(800) with RMS energy gate — zero ONNX dependencies.
- Whisper ASR integration with WAV encoding and 500ms–1.5s transcription latency.
- OpenClaw Gateway v3 WebSocket integration via chat.send for full agent reasoning.
- kokoro-js TTS with subprocess isolation containing the WASM process.exit(7) crash.
- Tiered TTS architecture: local neural ($0/turn), cloud fallback ($0.108/turn), robotic emergency fallback.
- User-level audio isolation: bot only processes speech from the authorized Discord user ID.
- Voice-sourced command restrictions with confirmation guardrails for all destructive operations.
- Pre-build adversarial audit with 2 kills and 3 wounds caught before a single line of implementation code was written.
>_Open Source on GitHub
VoiceOps is now open source. The full implementation — pipeline, ASR integration, TTS subprocess worker, Gateway WebSocket client, and voice manager — is on GitHub. Inspect it, fork it, run it against your own OpenClaw instance.
GreyforgeLabs / voiceops
Full-duplex voice bot for OpenClaw. Discord voice capture → Whisper ASR → Gateway v3 agent → kokoro-js TTS → Discord playback.
github.com/GreyforgeLabs/voiceopsComplete Pipeline
All 6 stages implemented and documented
TTS Subprocess
tts-worker.mjs with WASM crash containment
Gateway Client
v3 WebSocket protocol with chat.send