Greyforge Labs

The Vision

Hands-free AI interaction represents a significant productivity multiplier. Imagine a development environment where you can vocalize commands, receive spoken summaries, and maintain workflow without touching keyboard or mouse. This is the vision driving voice integration in ForgeClaw and ForgeCast.

The implementation, however, involves navigating a complex landscape of text-to-speech engines, speech recognition services, audio processing, and real-time streaming — each with its own constraints and tradeoffs.

Text-to-Speech: The Options

Modern TTS has evolved dramatically. The options range from fully local processing to cloud-based neural synthesis:

Local Engines (eSpeak, Piper)

Pros: Zero latency, privacy-preserving, works offline
Cons: Robotic quality, limited voice variety, no emotional inflection

Cloud Neural (ElevenLabs, OpenAI TTS)

Pros: Human-quality voices, emotional range, cloning capabilities
Cons: Latency (500ms-2s), cost per character, requires network

Hybrid (Coqui, Tortoise)

Pros: Local inference with neural quality, trainable voices
Cons: GPU requirements, slower than cloud streaming, complex setup

The Latency Problem

In an AI agent context, the voice pipeline compounds existing latencies. Consider the full request cycle:

Speech recognition: 200-800ms depending on engine
Agent processing: 1-5s for complex queries (LLM inference)
Text-to-speech: 500ms-2s for cloud neural voices

Total round-trip: 1.7-7.8 seconds. This is perceptible and frustrating for interactive use.

The solution involves streaming — both for LLM responses (token-by-token output) and TTS (audio chunk streaming). But implementing end-to-end streaming is architecturally complex. The agent kernel wasn't designed for partial responses; it expects complete text before proceeding.

Key Challenge

Retrofitting streaming into an existing synchronous architecture requires careful state management. Partial responses must be cancelable, audio queues must be interruptible, and the UI must gracefully handle incomplete output.

Voice Commands: Beyond Transcription

Speech recognition gives us text, but that's just the beginning. The text must be:

Intent classified: Is this a command, a query, or conversational?
Wake word filtered: Continuous listening needs activation detection
Noise filtered: Environmental sounds shouldn't trigger actions
Confirmation handled: Destructive actions need voice affirmation

The agent's existing command parsing can handle the text once recognized, but the intermediate pipeline requires new infrastructure.

Current Implementation State

ForgeClaw's social/voice.py module currently contains a basic TTS wrapper using system speech synthesis. It works for notifications and short announcements, but falls short of the full vision.

Current Capabilities

✓ Basic TTS for notifications
✓ System voice selection
○ Streaming TTS (partial)
✗ Speech recognition integration
✗ Wake word detection
✗ Voice command parsing

The Path Forward

Completing voice integration requires several focused efforts:

Phase 1: Streaming Architecture

Refactor the kernel to support token-level response streaming and cancelable operations.

Phase 2: Recognition Pipeline

Integrate Whisper or similar for local speech-to-text with wake word detection.

Phase 3: Neural TTS

Implement quality voice synthesis with streaming playback and agent personality matching.

Voice integration represents one of the most requested features for our agent platforms. The technical challenges are substantial but solvable. As we progress through each phase, we'll share learnings and updates through these chronicles.