VoiceOps: The First Full-Duplex Voice Bot for OpenClaw
We built real-time voice command infrastructure for our autonomous agent platform - speak into Discord, the agent reasons, the agent speaks back. No button presses. No mode-switching. One continuous operational loop, now open source.

>_The VoiceOps Objective
The target was clear: remove keyboard friction for high-tempo operations. Operators should be able to issue commands by voice, receive machine reasoning in real time, and hear responses immediately through a reliable speech layer.
This is not a toy assistant feature. It is an operational interface designed for speed, continuity, and decision support under real workload pressure. VoiceOps is the first fully operational full-duplex voice integration for OpenClaw - building on early ForgeClaw operating work - and it is now open source on GitHub.
Input Layer
Live voice capture and utterance segmentation through Discord's Opus codec with silence-gated VAD and energy-based noise rejection.
Output Layer
Full-duplex response channel that routes through the OpenClaw Gateway, synthesizes speech locally via kokoro-js, and delivers audio back to the voice channel with no per-turn API cost.
>_What Is Full Duplex? (Plain English)
The phone call analogy
A walkie-talkie requires you to press a button to speak, and stop speaking for the other party to respond. A phone call is full duplex - both parties can speak and listen at the same time, without any buttons or modes.
VoiceOps works like a phone call: you speak naturally into your Discord voice channel, the bot hears you in real time, reasons through your command with the autonomous agent, and speaks back - all without interrupting your workflow or waiting for you to signal readiness.
Most voice-enabled bots are half-duplex - push to talk, wait for the bot, listen, repeat. The UX feels like a radio exchange. Full duplex removes that friction entirely. You speak when you need to speak. The bot responds when it is ready. The channel stays open.
>_The Pipeline
The production pipeline runs in six discrete stages. Each stage has a clearly defined input, output, and failure contract. The design is intentionally linear - no shared global state, no concurrent stage overlap, one audio utterance processed completely before the next begins.
Discord Voice Capture
Audio streams in from Discord voice through the standard voice library. Per-user stream separation lets the system isolate authorized speakers at the protocol layer.
Discord voice · per-user stream isolation
Silence-Gated VAD
EndBehaviorType.AfterSilence(800) marks the utterance boundary - 800 milliseconds of silence closes the stream and signals end-of-turn. An RMS energy gate (threshold 0.008) discards near-silent frames before they reach the ASR layer, preventing expensive API calls on background noise.
800ms silence gate · RMS 0.008 energy threshold · zero ONNX dependency
Whisper ASR
The segmented utterance buffer is shipped to OpenAI Whisper API as a WAV-encoded audio blob. Whisper returns the transcript in 500 milliseconds to 1.5 seconds depending on utterance length and server load. The model handles ambient noise, accents, and partial sentences gracefully.
OpenAI Whisper API · 500ms-1.5s · WAV input
OpenClaw Gateway v3
The transcript passes through a local gateway to the configured agent. The gateway routes the task, handles tool calls when allowed, and streams the response back for speech synthesis.
Gateway protocol · agent routing · streaming response
kokoro-js TTS
The agent response text is sent through an isolated synthesis worker. kokoro-js generates the spoken response, while the main voice process stays insulated from worker-level failures.
kokoro-js · local TTS · worker isolation
Discord Voice Playback
The WAV buffer is wrapped in a Discord AudioResource and handed to the VoiceConnection player. Discord handles jitter buffering and Opus re-encoding for transmission. The synthesized response plays back in the voice channel within approximately 200 milliseconds of buffer handoff.
WAV → AudioResource · Discord TX · ~200ms buffering
Latency Breakdown
Real-world latency is 3 to 7 seconds end to end. Claims of 1 to 2 seconds were not credible under realistic network and reasoning load - the adversarial audit caught this before implementation began.
| Stage | Time |
|---|---|
| VAD / silence detection | ~800ms |
| Opus decode + buffer | <20ms |
| Whisper ASR (5s clip) | 500ms-1.5s |
| Agent reasoning | 1-3s |
| kokoro-js TTS (warm) | <300ms |
| Discord TX buffering | ~200ms |
| Total | 3-7 seconds |
>_Research Before Build: The Adversarial Protocol
Before implementation, the architecture was stress-tested by three autonomous personas - The Visionary, The Empiricist, and The Critic - using our adversarial research protocol before a single line was written. The objective was to expose weak assumptions before code locked them into production.
The Visionary
Expansive, opportunity-focused, and biased toward strategic upside. Proposes architectures at ambition scale.
The Empiricist
Facts-only, evidence-grounded, and intolerant of unsupported latency or compatibility claims.
The Critic
Adversarial by design, tasked with breaking weak arguments and uncovering hidden dependency risks.
The protocol produced 2 kill-grade flaws and 3 significant wounds. Every finding arrived before implementation started.
KILL 1 - VAD Library Rejection
The proposed voice activity detection library (@ricky0123/vad-node v0.0.3) was pre-alpha with 8+ months of stagnation. Critically, it required onnxruntime-node@1.21.0 - directly conflicting with kokoro-js's requirement for onnxruntime-node@1.24.2. Building production reliability on a pre-alpha dependency with an irresolvable version conflict was rejected outright.
KILL 2 - GPU-Accelerated Local ASR
GPU-accelerated local speech recognition was assumed viable. The Empiricist checked the driver and platform fit before implementation and found the path too brittle for the first release. CPU-first and portable fallback paths were preserved instead.
Wounds: Survivable but Requiring Treatment
Methodology Validation
These failures were discovered before implementation began. This was not brainstorming - it was adversarial research that removed failure paths before they reached production. Both kills would have caused days of debugging and rework had they been discovered post-implementation.
>_The ONNX Version Conflict
Technical Root Cause
Both the Silero VAD library and kokoro-js are ONNX-based - but they require different versions of onnxruntime-node:
@ricky0123/vad-node
requires onnxruntime-node@1.21.0
kokoro-js
requires onnxruntime-node@1.24.2
Rather than fight version pinning - a battle with no clean resolution - we dropped the external VAD library entirely and used @discordjs/voice's built-in EndBehaviorType.AfterSilence(800) combined with an RMS energy gate that discards near-silent frames before any API call. Zero ONNX conflicts. Zero VAD library install risk. The silence gate is deterministic, predictable, and maintenance-free.
>_TTS Engine Comparison
Five TTS engines were evaluated before selecting kokoro-js as the default. The selection criteria were latency, output quality, per-turn cost, and installation complexity. The winner had to be good enough to not embarrass the agent, fast enough to not break the conversational loop, and cheap enough to run at any volume.
| Engine | Latency | Cost |
|---|---|---|
| ★kokoro-js | <300ms warm | $0/turn |
| piper-tts | <1s | $0/turn |
| edge-tts | 1-2s | $0 cloud |
| espeak-ng | <100ms | $0/turn |
| ElevenLabs Starter | 300-800ms | $0.108/turn |
kokoro-js ships as a pure npm dependency with an 82MB ONNX model file. No separate binary installation, no Python environment, no API key. The model loads once at startup and warm synthesis completes in under 300 milliseconds - faster than any cloud TTS option at zero per-turn cost.
>_The Subprocess Isolation Pattern
This was the most unexpected engineering challenge of the entire build.
The Synthesis Isolation Problem
The local speech stack could terminate the host process during cleanup if it ran inline. VoiceOps moved synthesis behind an isolated worker boundary so the main voice process survives worker-level failures.
The solution: speech synthesis runs behind an isolated worker boundary so voice playback can fail safely without taking the main process down.
Main process sends text to an isolated synthesis worker.
The worker generates speech and returns a complete audio buffer.
Worker-level failures stay contained inside the synthesis boundary.
The voice process remains available for the next turn.
The worker pattern converts a show-stopping synthesis failure into a contained side effect. The main voice process stays available.
>_TTS Benchmark Results
Measured on a modern 6-core x86_64 CPU, fp32 precision, cold subprocess load (first synthesis after spawn includes model load time). RTF = Real Time Factor - how long synthesis takes relative to audio playback duration. RTF below 1.0 means the synthesis completes before the audio finishes playing.
| Phrase | Wall Time | RTF |
|---|---|---|
| "Acknowledged." | 1908ms | 1.14x |
| "Standby." | 1964ms | 1.21x |
| "The operation has completed successfully..." | 2839ms | 0.56x |
| "I am ready for your next command." | 2554ms | 0.76x |
RTF < 1.0 means the agent is thinking ahead. For longer phrases, kokoro-js finishes synthesis before the audio finishes playing - the pipeline can begin buffering the next response while the current one is still in the speaker. Cold subprocess load dominates for short phrases. Warm synthesis (second call and beyond) consistently delivers under 300ms.
>_Cost Analysis
Voice quality has a severe cost gradient. Premium cloud voice generation is roughly 200 times more expensive than local neural synthesis for the same conversational workload. The choice of TTS engine is as much a financial decision as a technical one.
★kokoro-js + Whisper API + Gemini Flash
~$0.75/month at 50/day
$0.0005/turn
★★kokoro-js + whisper.cpp local + Gemini Flash
$0.00/month
$0.00/turn
ElevenLabs Starter + Gemini Flash
~$163/month at 50/day
$0.108/turn
ElevenLabs + Claude Opus fallback
~$288/month at 50/day
~$0.19/turn
Cost Constraint
Tiered voice is not optional - it is economic control. The fully local path (whisper.cpp + kokoro-js + Gemini Flash) runs at genuine zero API cost in daily operation. Premium cloud TTS is reserved for moments where clarity carries true operational value.
>_Security Model
Voice access is enforced at protocol level, following the same ForgeOps doctrine that governs Greyforge operational interfaces. The bot processes only authorized voice input. No ambient microphone access, no open shared-channel capture.
Voice-sourced commands run through a restricted tool allowlist with no destructive operations permitted over voice alone.
Any destructive action requires explicit text confirmation, regardless of voice confidence score or utterance clarity.
Prompt injection risk from misrecognized utterances is mitigated through command gating and confirmation boundaries enforced at the Gateway layer.
>_What Shipped
The VoiceOps system is functional and running live tests across the OpenClaw agent environment. Voice activity benchmarks confirm sub-1ms processing per audio frame after the one-time cold-start initialization.
Delivered Capabilities
- Full-duplex voice pipeline: speak naturally, receive spoken responses, no mode-switching.
- Silence-gated voice activity detection with no heavyweight model dependency.
- Whisper ASR integration with practical short-turn transcription latency.
- Gateway integration for full agent reasoning.
- kokoro-js TTS with isolated worker containment.
- Tiered TTS architecture with local-first synthesis and fallback paths.
- Authorized-user voice isolation.
- Voice-sourced command restrictions with confirmation guardrails for all destructive operations.
- Pre-build adversarial audit with 2 kills and 3 wounds caught before a single line of implementation code was written.
>_Open Source on GitHub
VoiceOps is now open source. The full implementation - pipeline, ASR integration, TTS subprocess worker, Gateway WebSocket client, and voice manager - is on GitHub. Inspect it, fork it, run it against your own OpenClaw instance.
GreyforgeLabs / voiceops
Full-duplex voice bot for OpenClaw. Discord voice capture → Whisper ASR → Gateway v3 agent → kokoro-js TTS → Discord playback.
github.com/GreyforgeLabs/voiceopsComplete Pipeline
All 6 stages implemented and documented
TTS Worker
isolated local synthesis boundary
Gateway Client
streaming agent response path