VoiceOps: Building the First Full-Duplex Voice Bot for OpenClaw

>_The VoiceOps Objective

The target was clear: remove keyboard friction for high-tempo operations. Operators should be able to issue commands by voice, receive machine reasoning in real time, and hear responses immediately through a reliable speech layer.

This is not a toy assistant feature. It is an operational interface designed for speed, continuity, and decision support under real workload pressure. VoiceOps is the first fully operational full-duplex voice integration for OpenClaw - building on early ForgeClaw operating work - and it is now open source on GitHub.

Input Layer

Live voice capture and utterance segmentation through Discord's Opus codec with silence-gated VAD and energy-based noise rejection.

Output Layer

Full-duplex response channel that routes through the OpenClaw Gateway, synthesizes speech locally via kokoro-js, and delivers audio back to the voice channel with no per-turn API cost.

>_What Is Full Duplex? (Plain English)

The phone call analogy

A walkie-talkie requires you to press a button to speak, and stop speaking for the other party to respond. A phone call is full duplex - both parties can speak and listen at the same time, without any buttons or modes.

VoiceOps works like a phone call: you speak naturally into your Discord voice channel, the bot hears you in real time, reasons through your command with the autonomous agent, and speaks back - all without interrupting your workflow or waiting for you to signal readiness.

Most voice-enabled bots are half-duplex - push to talk, wait for the bot, listen, repeat. The UX feels like a radio exchange. Full duplex removes that friction entirely. You speak when you need to speak. The bot responds when it is ready. The channel stays open.

>_The Pipeline

The production pipeline runs in six discrete stages. Each stage has a clearly defined input, output, and failure contract. The design is intentionally linear - no shared global state, no concurrent stage overlap, one audio utterance processed completely before the next begins.

Discord Voice Capture

Audio streams in from Discord voice through the standard voice library. Per-user stream separation lets the system isolate authorized speakers at the protocol layer.

Discord voice · per-user stream isolation

Silence-Gated VAD

EndBehaviorType.AfterSilence(800) marks the utterance boundary - 800 milliseconds of silence closes the stream and signals end-of-turn. An RMS energy gate (threshold 0.008) discards near-silent frames before they reach the ASR layer, preventing expensive API calls on background noise.

800ms silence gate · RMS 0.008 energy threshold · zero ONNX dependency

Whisper ASR

The segmented utterance buffer is shipped to OpenAI Whisper API as a WAV-encoded audio blob. Whisper returns the transcript in 500 milliseconds to 1.5 seconds depending on utterance length and server load. The model handles ambient noise, accents, and partial sentences gracefully.

OpenAI Whisper API · 500ms-1.5s · WAV input

OpenClaw Gateway v3

The transcript passes through a local gateway to the configured agent. The gateway routes the task, handles tool calls when allowed, and streams the response back for speech synthesis.

Gateway protocol · agent routing · streaming response

kokoro-js TTS

The agent response text is sent through an isolated synthesis worker. kokoro-js generates the spoken response, while the main voice process stays insulated from worker-level failures.

kokoro-js · local TTS · worker isolation

Discord Voice Playback

The WAV buffer is wrapped in a Discord AudioResource and handed to the VoiceConnection player. Discord handles jitter buffering and Opus re-encoding for transmission. The synthesized response plays back in the voice channel within approximately 200 milliseconds of buffer handoff.

WAV → AudioResource · Discord TX · ~200ms buffering

Latency Breakdown

Real-world latency is 3 to 7 seconds end to end. Claims of 1 to 2 seconds were not credible under realistic network and reasoning load - the adversarial audit caught this before implementation began.

Stage	Time
VAD / silence detection	~800ms
Opus decode + buffer	<20ms
Whisper ASR (5s clip)	500ms-1.5s
Agent reasoning	1-3s
kokoro-js TTS (warm)	<300ms
Discord TX buffering	~200ms
Total	3-7 seconds

>_Research Before Build: The Adversarial Protocol

Before implementation, the architecture was stress-tested by three autonomous personas - The Visionary, The Empiricist, and The Critic - using our adversarial research protocol before a single line was written. The objective was to expose weak assumptions before code locked them into production.

The Visionary

Expansive, opportunity-focused, and biased toward strategic upside. Proposes architectures at ambition scale.

The Empiricist

Facts-only, evidence-grounded, and intolerant of unsupported latency or compatibility claims.

The Critic

Adversarial by design, tasked with breaking weak arguments and uncovering hidden dependency risks.

The protocol produced 2 kill-grade flaws and 3 significant wounds. Every finding arrived before implementation started.

KILL 1 - VAD Library Rejection

The proposed voice activity detection library (@ricky0123/vad-node v0.0.3) was pre-alpha with 8+ months of stagnation. Critically, it required onnxruntime-node@1.21.0 - directly conflicting with kokoro-js's requirement for onnxruntime-node@1.24.2. Building production reliability on a pre-alpha dependency with an irresolvable version conflict was rejected outright.

KILL 2 - GPU-Accelerated Local ASR

GPU-accelerated local speech recognition was assumed viable. The Empiricist checked the driver and platform fit before implementation and found the path too brittle for the first release. CPU-first and portable fallback paths were preserved instead.

Wounds: Survivable but Requiring Treatment

WOUND - One-Shot Stream Subscriptions: receiver.subscribe() streams are one-shot - without re-subscribing on every utterance end, the bot goes permanently deaf after the first command. The fix: re-subscribe inside the stream-end handler.

WOUND - Discord Speaking Event Conflation: The Discord speaking event is server-side and unreliable for VAD purposes. Energy-based detection was being conflated with true voice activity detection, causing background noise to trigger false positives on every breath and ambient sound.

WOUND - Latency Fantasy: Initial latency claims of 1 to 2 seconds end-to-end were not credible under realistic conditions. Agent reasoning alone contributes 1 to 3 seconds. Realistic measured latency is 3 to 7 seconds - acceptable for an operational tool, but not what the initial spec suggested.

Methodology Validation

These failures were discovered before implementation began. This was not brainstorming - it was adversarial research that removed failure paths before they reached production. Both kills would have caused days of debugging and rework had they been discovered post-implementation.

>_The ONNX Version Conflict

Technical Root Cause

Both the Silero VAD library and kokoro-js are ONNX-based - but they require different versions of onnxruntime-node:

@ricky0123/vad-node

requires onnxruntime-node@1.21.0

kokoro-js

requires onnxruntime-node@1.24.2

Rather than fight version pinning - a battle with no clean resolution - we dropped the external VAD library entirely and used @discordjs/voice's built-in EndBehaviorType.AfterSilence(800) combined with an RMS energy gate that discards near-silent frames before any API call. Zero ONNX conflicts. Zero VAD library install risk. The silence gate is deterministic, predictable, and maintenance-free.

>_TTS Engine Comparison

Five TTS engines were evaluated before selecting kokoro-js as the default. The selection criteria were latency, output quality, per-turn cost, and installation complexity. The winner had to be good enough to not embarrass the agent, fast enough to not break the conversational loop, and cheap enough to run at any volume.

Engine	Latency	Quality	Cost	Install
★kokoro-js	<300ms warm	Near-ElevenLabs	$0/turn	npm install
piper-tts	<1s	Good neural	$0/turn	pip + binary
edge-tts	1-2s	Excellent (MS neural)	$0 cloud	npm install
espeak-ng	<100ms	Robotic	$0/turn	npm install
ElevenLabs Starter	300-800ms	Excellent	$0.108/turn	API key

kokoro-js ships as a pure npm dependency with an 82MB ONNX model file. No separate binary installation, no Python environment, no API key. The model loads once at startup and warm synthesis completes in under 300 milliseconds - faster than any cloud TTS option at zero per-turn cost.

>_The Subprocess Isolation Pattern

This was the most unexpected engineering challenge of the entire build.

The Synthesis Isolation Problem

The local speech stack could terminate the host process during cleanup if it ran inline. VoiceOps moved synthesis behind an isolated worker boundary so the main voice process survives worker-level failures.

The solution: speech synthesis runs behind an isolated worker boundary so voice playback can fail safely without taking the main process down.

Main process sends text to an isolated synthesis worker.

The worker generates speech and returns a complete audio buffer.

Worker-level failures stay contained inside the synthesis boundary.

The voice process remains available for the next turn.

The worker pattern converts a show-stopping synthesis failure into a contained side effect. The main voice process stays available.

>_TTS Benchmark Results

Measured on a modern 6-core x86_64 CPU, fp32 precision, cold subprocess load (first synthesis after spawn includes model load time). RTF = Real Time Factor - how long synthesis takes relative to audio playback duration. RTF below 1.0 means the synthesis completes before the audio finishes playing.

Phrase	Wall Time	Audio Duration	RTF
"Acknowledged."	1908ms	1.68s	1.14x
"Standby."	1964ms	1.63s	1.21x
"The operation has completed successfully..."	2839ms	5.03s	0.56x
"I am ready for your next command."	2554ms	3.38s	0.76x

RTF < 1.0 means the agent is thinking ahead. For longer phrases, kokoro-js finishes synthesis before the audio finishes playing - the pipeline can begin buffering the next response while the current one is still in the speaker. Cold subprocess load dominates for short phrases. Warm synthesis (second call and beyond) consistently delivers under 300ms.

>_Cost Analysis

Voice quality has a severe cost gradient. Premium cloud voice generation is roughly 200 times more expensive than local neural synthesis for the same conversational workload. The choice of TTS engine is as much a financial decision as a technical one.

★kokoro-js + Whisper API + Gemini Flash

~$0.75/month at 50/day

$0.0005/turn

★★kokoro-js + whisper.cpp local + Gemini Flash

$0.00/month

$0.00/turn

ElevenLabs Starter + Gemini Flash

~$163/month at 50/day

$0.108/turn

ElevenLabs + Claude Opus fallback

~$288/month at 50/day

~$0.19/turn

Cost Constraint

Tiered voice is not optional - it is economic control. The fully local path (whisper.cpp + kokoro-js + Gemini Flash) runs at genuine zero API cost in daily operation. Premium cloud TTS is reserved for moments where clarity carries true operational value.

>_Security Model

Voice access is enforced at protocol level, following the same ForgeOps doctrine that governs Greyforge operational interfaces. The bot processes only authorized voice input. No ambient microphone access, no open shared-channel capture.

Voice-sourced commands run through a restricted tool allowlist with no destructive operations permitted over voice alone.

Any destructive action requires explicit text confirmation, regardless of voice confidence score or utterance clarity.

Prompt injection risk from misrecognized utterances is mitigated through command gating and confirmation boundaries enforced at the Gateway layer.

>_What Shipped

The VoiceOps system is functional and running live tests across the OpenClaw agent environment. Voice activity benchmarks confirm sub-1ms processing per audio frame after the one-time cold-start initialization.

Delivered Capabilities

Full-duplex voice pipeline: speak naturally, receive spoken responses, no mode-switching.
Silence-gated voice activity detection with no heavyweight model dependency.
Whisper ASR integration with practical short-turn transcription latency.
Gateway integration for full agent reasoning.
kokoro-js TTS with isolated worker containment.
Tiered TTS architecture with local-first synthesis and fallback paths.
Authorized-user voice isolation.
Voice-sourced command restrictions with confirmation guardrails for all destructive operations.
Pre-build adversarial audit with 2 kills and 3 wounds caught before a single line of implementation code was written.

>_Open Source on GitHub

VoiceOps is now open source. The full implementation - pipeline, ASR integration, TTS subprocess worker, Gateway WebSocket client, and voice manager - is on GitHub. Inspect it, fork it, run it against your own OpenClaw instance.

GreyforgeLabs / voiceops

Full-duplex voice bot for OpenClaw. Discord voice capture → Whisper ASR → Gateway v3 agent → kokoro-js TTS → Discord playback.

github.com/GreyforgeLabs/voiceops

Complete Pipeline

All 6 stages implemented and documented

TTS Worker

isolated local synthesis boundary

Gateway Client

streaming agent response path