Back to Chronicles
February 27, 202615 min read

VoiceOps: The First Full-Duplex Voice Bot for OpenClaw

We built real-time voice command infrastructure for our AI agent platform — speak into Discord, the agent reasons, the agent speaks back. No button presses. No mode-switching. One continuous operational loop, now open source.


>_The VoiceOps Objective

The target was clear: remove keyboard friction for high-tempo operations. Operators should be able to issue commands by voice, receive machine reasoning in real time, and hear responses immediately through a reliable speech layer.

This is not a toy assistant feature. It is an operational interface designed for speed, continuity, and decision support under real workload pressure. VoiceOps is the first fully operational full-duplex voice integration for OpenClaw — and it is now open source on GitHub.

Input Layer

Live voice capture and utterance segmentation through Discord's Opus codec with silence-gated VAD and energy-based noise rejection.

Output Layer

Full-duplex response channel that routes through the OpenClaw Gateway, synthesizes speech locally via kokoro-js, and delivers audio back to the voice channel with no per-turn API cost.


>_What Is Full Duplex? (Plain English)

The phone call analogy

A walkie-talkie requires you to press a button to speak, and stop speaking for the other party to respond. A phone call is full duplex — both parties can speak and listen at the same time, without any buttons or modes.

VoiceOps works like a phone call: you speak naturally into your Discord voice channel, the bot hears you in real time, reasons through your command with the AI agent, and speaks back — all without interrupting your workflow or waiting for you to signal readiness.

Most voice-enabled bots are half-duplex — push to talk, wait for the bot, listen, repeat. The UX feels like a radio exchange. Full duplex removes that friction entirely. You speak when you need to speak. The bot responds when it is ready. The channel stays open.


>_The Pipeline

The production pipeline runs in six discrete stages. Each stage has a clearly defined input, output, and failure contract. The design is intentionally linear — no shared global state, no concurrent stage overlap, one audio utterance processed completely before the next begins.

01

Discord Voice Capture

Raw Opus-encoded audio streams in from a Discord voice channel via @discordjs/voice. The receiver yields per-user AudioReceiveStream objects keyed by Discord user ID, allowing per-operator audio isolation at the protocol layer.

Opus codec · @discordjs/voice · per-user stream isolation

02

Silence-Gated VAD

EndBehaviorType.AfterSilence(800) marks the utterance boundary — 800 milliseconds of silence closes the stream and signals end-of-turn. An RMS energy gate (threshold 0.008) discards near-silent frames before they reach the ASR layer, preventing expensive API calls on background noise.

800ms silence gate · RMS 0.008 energy threshold · zero ONNX dependency

03

Whisper ASR

The segmented utterance buffer is shipped to OpenAI Whisper API as a WAV-encoded audio blob. Whisper returns the transcript in 500 milliseconds to 1.5 seconds depending on utterance length and server load. The model handles ambient noise, accents, and partial sentences gracefully.

OpenAI Whisper API · 500ms–1.5s · WAV input

04

OpenClaw Gateway v3

The transcript hits the OpenClaw Gateway WebSocket on ws://127.0.0.1:18789 as a chat.send frame. The Gateway routes to the configured agent (Gemini Flash by default), executes tool calls if needed, and streams the response back. Agent reasoning adds 1 to 3 seconds depending on task complexity.

WebSocket · chat.send · Gateway v3 protocol · 1–3s reasoning

05

kokoro-js TTS

The agent response text is sent to a child_process subprocess running tts-worker.mjs. kokoro-js synthesizes speech using an 82MB ONNX model warm-loaded at startup. Warm synthesis completes in under 300 milliseconds. The subprocess exits cleanly (or crashes with exit 7 from WASM cleanup — both are accepted as success) and returns WAV bytes via stdout.

kokoro-js · 82MB ONNX · <300ms warm · subprocess isolation

06

Discord Voice Playback

The WAV buffer is wrapped in a Discord AudioResource and handed to the VoiceConnection player. Discord handles jitter buffering and Opus re-encoding for transmission. The synthesized response plays back in the voice channel within approximately 200 milliseconds of buffer handoff.

WAV → AudioResource · Discord TX · ~200ms buffering

Latency Breakdown

Real-world latency is 3 to 7 seconds end to end. Claims of 1 to 2 seconds were not credible under realistic network and reasoning load — the adversarial audit caught this before implementation began.

StageTime
VAD / silence detection~800ms
Opus decode + buffer<20ms
Whisper ASR (5s clip)500ms–1.5s
Agent reasoning1–3s
kokoro-js TTS (warm)<300ms
Discord TX buffering~200ms
Total3–7 seconds

>_Research Before Build: The Adversarial Protocol

Before implementation, the architecture was stress-tested by three AI personas — The Visionary, The Empiricist, and The Critic — before a single line was written. The objective was to expose weak assumptions before code locked them into production.

The Visionary

Expansive, opportunity-focused, and biased toward strategic upside. Proposes architectures at ambition scale.

The Empiricist

Facts-only, evidence-grounded, and intolerant of unsupported latency or compatibility claims.

The Critic

Adversarial by design, tasked with breaking weak arguments and uncovering hidden dependency risks.

The protocol produced 2 kill-grade flaws and 3 significant wounds. Every finding arrived before implementation started.

KILL 1 — VAD Library Rejection

The proposed voice activity detection library (@ricky0123/vad-node v0.0.3) was pre-alpha with 8+ months of stagnation. Critically, it required onnxruntime-node@1.21.0 — directly conflicting with kokoro-js's requirement for onnxruntime-node@1.24.2. Building production reliability on a pre-alpha dependency with an irresolvable version conflict was rejected outright.

KILL 2 — GPU-Accelerated Local ASR

GPU-accelerated local speech recognition was assumed viable. The Empiricist checked: the target kernel (6.17) sits outside ROCm's supported range (supported up to 6.2). CPU-only execution and Vulkan compute backend were identified as the real viable paths. The GPU path was killed before any toolchain investment was made.

Wounds: Survivable but Requiring Treatment

WOUND — One-Shot Stream Subscriptions: receiver.subscribe() streams are one-shot — without re-subscribing on every utterance end, the bot goes permanently deaf after the first command. The fix: re-subscribe inside the stream-end handler.
WOUND — Discord Speaking Event Conflation: The Discord speaking event is server-side and unreliable for VAD purposes. Energy-based detection was being conflated with true voice activity detection, causing background noise to trigger false positives on every breath and ambient sound.
WOUND — Latency Fantasy: Initial latency claims of 1 to 2 seconds end-to-end were not credible under realistic conditions. Agent reasoning alone contributes 1 to 3 seconds. Realistic measured latency is 3 to 7 seconds — acceptable for an operational tool, but not what the initial spec suggested.

Methodology Validation

These failures were discovered before implementation began. This was not brainstorming — it was adversarial research that removed failure paths before they reached production. Both kills would have caused days of debugging and rework had they been discovered post-implementation.


>_The ONNX Version Conflict

Technical Root Cause

Both the Silero VAD library and kokoro-js are ONNX-based — but they require different versions of onnxruntime-node:

@ricky0123/vad-node

requires onnxruntime-node@1.21.0

kokoro-js

requires onnxruntime-node@1.24.2

Rather than fight version pinning — a battle with no clean resolution — we dropped the external VAD library entirely and used @discordjs/voice's built-in EndBehaviorType.AfterSilence(800) combined with an RMS energy gate that discards near-silent frames before any API call. Zero ONNX conflicts. Zero VAD library install risk. The silence gate is deterministic, predictable, and maintenance-free.


>_TTS Engine Comparison

Five TTS engines were evaluated before selecting kokoro-js as the default. The selection criteria were latency, output quality, per-turn cost, and installation complexity. The winner had to be good enough to not embarrass the agent, fast enough to not break the conversational loop, and cheap enough to run at any volume.

EngineLatencyCost
kokoro-js<300ms warm$0/turn
piper-tts<1s$0/turn
edge-tts1–2s$0 cloud
espeak-ng<100ms$0/turn
ElevenLabs Starter300–800ms$0.108/turn

kokoro-js ships as a pure npm dependency with an 82MB ONNX model file. No separate binary installation, no Python environment, no API key. The model loads once at startup and warm synthesis completes in under 300 milliseconds — faster than any cloud TTS option at zero per-turn cost.


>_The Subprocess Isolation Pattern

This was the most unexpected engineering challenge of the entire build.

The WASM Process Exit Problem

kokoro-js uses an Emscripten-compiled phonemizer that calls process.exit(7) during WASM cleanup. Running synthesis in the main process would kill the entire bot on every TTS call — after the first voice response, the bot would terminate.

The solution: tts-worker.mjs runs as a child_process subprocess with a clean protocol boundary:

01

Main process spawns tts-worker.mjs as a child_process subprocess

02

Text to synthesize is written to the subprocess's stdin

03

Subprocess synthesizes with kokoro-js and writes WAV bytes to stdout

04

Main process collects stdout chunks into a complete WAV buffer

05

Subprocess exits with 0 or 7 — both are accepted as synthesis success

06

WASM crash is fully contained; the bot process is unaffected

The subprocess pattern converts a show-stopping process termination into a safely contained side effect. The main bot process stays alive indefinitely while WASM does whatever it needs to do on exit.


>_TTS Benchmark Results

Measured on a modern 6-core x86_64 CPU, fp32 precision, cold subprocess load (first synthesis after spawn includes model load time). RTF = Real Time Factor — how long synthesis takes relative to audio playback duration. RTF below 1.0 means the synthesis completes before the audio finishes playing.

PhraseWall TimeRTF
"Acknowledged."1908ms1.14x
"Standby."1964ms1.21x
"The operation has completed successfully..."2839ms0.56x
"I am ready for your next command."2554ms0.76x

RTF < 1.0 means the agent is thinking ahead. For longer phrases, kokoro-js finishes synthesis before the audio finishes playing — the pipeline can begin buffering the next response while the current one is still in the speaker. Cold subprocess load dominates for short phrases. Warm synthesis (second call and beyond) consistently delivers under 300ms.


>_Cost Analysis

Voice quality has a severe cost gradient. Premium cloud voice generation is roughly 200 times more expensive than local neural synthesis for the same conversational workload. The choice of TTS engine is as much a financial decision as a technical one.

kokoro-js + Whisper API + Gemini Flash

~$0.75/month at 50/day

$0.0005/turn

★★kokoro-js + whisper.cpp local + Gemini Flash

$0.00/month

$0.00/turn

ElevenLabs Starter + Gemini Flash

~$163/month at 50/day

$0.108/turn

ElevenLabs + Claude Opus fallback

~$288/month at 50/day

~$0.19/turn

Cost Constraint

Tiered voice is not optional — it is economic control. The fully local path (whisper.cpp + kokoro-js + Gemini Flash) runs at genuine zero API cost in daily operation. Premium cloud TTS is reserved for moments where clarity carries true operational value.


>_Security Model

Voice access is enforced at protocol level. Audio capture is restricted to one authorized Discord user ID, so the bot only processes speech from the intended operator. No ambient microphone access, no shared channel promiscuity.

Voice-sourced commands run through a restricted tool allowlist with no destructive operations permitted over voice alone.

Any destructive action requires explicit text confirmation, regardless of voice confidence score or utterance clarity.

Prompt injection risk from misrecognized utterances is mitigated through command gating and confirmation boundaries enforced at the Gateway layer.


>_What Shipped

The VoiceOps system is functional and running live tests across the OpenClaw agent environment. Voice activity benchmarks confirm sub-1ms processing per audio frame after the one-time cold-start initialization.

Delivered Capabilities

  • Full-duplex voice pipeline: speak naturally, receive spoken responses, no mode-switching.
  • Silence-gated VAD using EndBehaviorType.AfterSilence(800) with RMS energy gate — zero ONNX dependencies.
  • Whisper ASR integration with WAV encoding and 500ms–1.5s transcription latency.
  • OpenClaw Gateway v3 WebSocket integration via chat.send for full agent reasoning.
  • kokoro-js TTS with subprocess isolation containing the WASM process.exit(7) crash.
  • Tiered TTS architecture: local neural ($0/turn), cloud fallback ($0.108/turn), robotic emergency fallback.
  • User-level audio isolation: bot only processes speech from the authorized Discord user ID.
  • Voice-sourced command restrictions with confirmation guardrails for all destructive operations.
  • Pre-build adversarial audit with 2 kills and 3 wounds caught before a single line of implementation code was written.

>_Open Source on GitHub

VoiceOps is now open source. The full implementation — pipeline, ASR integration, TTS subprocess worker, Gateway WebSocket client, and voice manager — is on GitHub. Inspect it, fork it, run it against your own OpenClaw instance.

GreyforgeLabs / voiceops

Full-duplex voice bot for OpenClaw. Discord voice capture → Whisper ASR → Gateway v3 agent → kokoro-js TTS → Discord playback.

github.com/GreyforgeLabs/voiceops

Complete Pipeline

All 6 stages implemented and documented

TTS Subprocess

tts-worker.mjs with WASM crash containment

Gateway Client

v3 WebSocket protocol with chat.send