Building a Self-Hosted Voice AI Assistant with LiveKit Agents
Building a Self-Hosted Voice AI Assistant with LiveKit Agents
Voice assistants are everywhere — Siri, Alexa, Google Assistant. But they all share the same fundamental trade-off: your audio leaves your device, gets processed on someone else's servers, and your conversation history lives on infrastructure you don't control.
I wanted something different: a voice AI that runs entirely on my own machine, with real-time conversation, no cloud dependencies, and the freedom to swap any component of the pipeline. So I built Voice AI Demo — a fully self-hosted, open-source conversational voice AI assistant powered by LiveKit Agents.
In this post I'll walk through the architecture, the four-stage voice pipeline, how LiveKit orchestrates WebRTC media, and the production patterns that make it reliable.
The Problem
Building a voice AI assistant locally means solving five hard problems:
- Real-time audio transport — You can't use HTTP for streaming audio. You need WebRTC with low-latency media channels
- Voice Activity Detection — When does the user start speaking? When do they stop? Naive approaches break on background noise
- Speech-to-Text — Transcribing audio locally requires a model that's both accurate and fast enough for real-time use
- Language Understanding — The LLM needs to respond naturally and fast. Cloud APIs add latency and privacy concerns
- Text-to-Speech — Synthesizing natural-sounding audio locally without sounding robotic
LiveKit Agents solves problems 1 and 5 elegantly. The rest is about choosing the right models and wiring them together resiliently.
Architecture Overview
At a glance:
Browser → LiveKit (WebRTC) → Agent Worker → VAD → STT → LLM → TTS
↕ ↕
LiveKit Server Ollama (local)
Core components:
| Layer | Technology | Role |
|---|---|---|
| Frontend | Next.js 15, React 19, Agents UI | Voice interface with mic controls, audio visualizer, chat transcript |
| Agent Server | Python 3.12, FastAPI, LiveKit Agents SDK | JWT token generation, health check, pipeline orchestration |
| Media | LiveKit Server (Docker/Go) | WebRTC SFU, room management, job dispatch |
| VAD | Silero VAD | Speech segment and utterance boundary detection |
| STT | faster-whisper (large-v3-turbo) | Local speech-to-text transcription |
| LLM | Ollama (llama3.2:3b) | Local language model via OpenAI-compatible API |
| TTS | Edge-TTS | Microsoft Edge TTS engine for speech synthesis |
How LiveKit Agents Works
LiveKit Agents is the backbone of this system. It provides:
- WebRTC SFU — A Selective Forwarding Unit that routes media between browsers and agent workers
- PipelineAgent — A high-level abstraction that processes audio through config stages (VAD → STT → LLM → TTS)
- Job dispatch — When a user connects, LiveKit automatically dispatches a job to an available agent worker
- JWT authentication — Secure token-based room access with configurable permissions
The data flow is elegant:
- User opens
http://localhost:3000in a browser - Frontend calls
GET /tokenon FastAPI → receives a signed LiveKit JWT - Browser connects to LiveKit Server via WebRTC using the JWT
- LiveKit dispatches a job to the Agent Worker (background thread)
- Agent processes audio through the voice pipeline
- Response audio streams back through LiveKit → browser plays it in real-time
The 4-Stage Voice Pipeline
Stage 1: VAD (Voice Activity Detection)
Silero VAD detects when speech starts and ends. This is critical for two reasons:
- Utterance detection — Know when the user has finished speaking so the pipeline can start processing
- Noise filtering — Silence and background noise are discarded before they reach the STT model
Silero is pre-trained, tiny (~1.7MB), and runs efficiently on CPU. It operates on 30ms audio frames and returns a probability score between 0 and 1 for each frame.
Stage 2: STT (Speech-to-Text)
faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2. It's 4x faster than the original while maintaining accuracy.
The pipeline uses large-v3-turbo — a distilled model that balances accuracy and speed. On Apple Silicon, Metal GPU acceleration brings inference time well under real-time:
Audio duration: 5 seconds
Processing time: ~1.2 seconds (Metal GPU)
Real-time factor: 0.24x
The first run downloads the model (~3GB). Subsequent runs load from cache and are instant.
Stage 3: LLM (Language Model)
Ollama serves Llama 3.2 3B locally via an OpenAI-compatible API. The agent sends transcribed text along with a system prompt, and Ollama streams tokens back.
The model is small enough to run on a MacBook with good performance:
Prompt processing: ~30 tokens/s
Token generation: ~25 tokens/s
End-to-end response: 1-3 seconds for typical queries
The magic is that you can swap to any OpenAI-compatible provider at runtime. Just change .env:
LLM_PROVIDER=ollama # or openai_compatible LLM_MODEL=llama3.2:3b # or llama-3.3-70b-versatile (Groq) LLM_BASE_URL=http://localhost:11434/v1 # or https://api.groq.com/openai/v1
Stage 4: TTS (Text-to-Speech)
Edge-TTS uses Microsoft's Edge browser TTS engine under the hood. It produces natural-sounding speech with multiple voice options (en-US-AriaNeural is the default).
The TTS runs locally, streams audio as it generates, and feeds it back through LiveKit's audio track to the browser. The whole round-trip — from user speaking to hearing the response — completes in under 2 seconds on Apple Silicon.
Safe Wrappers: Error Resilience
Each pipeline stage has a Safe wrapper that catches failures and provides graceful fallbacks:
class SafeSTT: async def transcribe(self, audio: AudioFrame) -> str: try: return await self.stt.transcribe(audio) except Exception as e: logger.error(f"STT failed: {e}") return "" # Empty transcript — LLM handles gracefully
This means a single model failure doesn't crash the entire conversation. If Whisper fails, the user gets a "I didn't catch that" response. If Ollama fails, TTS gets a fallback message. The conversation keeps going.
Frontend: Agents UI
The Next.js frontend uses LiveKit's Agents UI — a shadcn-based React component library:
- Audio visualizer — Real-time waveform display of microphone input
- Chat transcript — Scrollable conversation history with timestamps
- Mic controls — Mute/unmute, push-to-talk, and connection status
- Dark theme — Optimized for the voice-first interface
The frontend is deliberately minimal. The complexity lives in the agent pipeline, not the UI.
Provider Switching
The most powerful feature: every component of the pipeline is swappable at runtime. Here are three example configurations:
Fully local (default):
STT_PROVIDER=whisper
LLM_PROVIDER=ollama
TTS_PROVIDER=edge-tts
Zero API keys required. Everything runs on your machine.
Cloud-powered (faster responses):
STT_PROVIDER=openai_compatible STT_MODEL=whisper-1 STT_BASE_URL=https://api.openai.com/v1 # Add OPENAI_API_KEY to .env
OpenAI's Whisper API is faster than local for very long audio. Good trade-off if you have an API key.
Hybrid (best of both):
LLM_PROVIDER=openai_compatible LLM_MODEL=llama-3.3-70b-versatile LLM_BASE_URL=https://api.groq.com/v1 # Add GROQ_API_KEY to .env
Free Groq inference for the LLM, local Whisper for STT, local Edge-TTS for speech. Fast AND free.
Quick Start
Getting the full stack running takes about 5 minutes:
# 1. Start LiveKit Server (Docker) cp .env.example .env docker compose up -d # 2. Pull LLM model ollama pull llama3.2:3b # 3. Start Agent Service cd agent python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt python main.py # 4. Start Frontend cd frontend npm install npm run dev
Open http://localhost:3000 and click "Start audio". You're talking to a fully local voice AI.
Key Takeaways
- LiveKit Agents is the right abstraction — It handles WebRTC complexity, media routing, job dispatch, and pipeline orchestration so you focus on the AI models
- Local models are production-viable — Whisper large-v3-turbo + Llama 3.2 3B + Edge-TTS on Apple Silicon delivers sub-2s latency with zero API costs
- Provider swappability is essential — The ability to switch STT, LLM, or TTS at runtime (not rebuild time) makes the system adaptable to any environment
- Safe wrappers prevent cascade failures — A single model crash should never take down the conversation. Graceful degradation keeps the UX intact
- Docker for the hard parts — LiveKit Server runs in Docker. The agent and frontend run natively. This separation keeps development fast and deployment flexible
Code & Resources
- Full source code: github.com/aiwithvd/voiceai
- Docker setup: One-command LiveKit server with
docker compose up -d - Testing: 15 agent tests + 4 frontend tests +
--test-modefor CI - Models: Default pipeline uses zero cloud models; Ollama requires one
ollama pull
Want to build production voice AI systems? Let's connect. I advise teams on real-time AI architecture, voice pipeline design, and self-hosted deployment strategies.