LiveKitVoice AIFastAPIWhisperOllamaWebRTCNext.jsSelf-HostedEdge-TTSReal-Time Audio

Building a Self-Hosted Voice AI Assistant with LiveKit Agents

Published May 13, 20267 min read

Building a Self-Hosted Voice AI Assistant with LiveKit Agents

Voice assistants are everywhere — Siri, Alexa, Google Assistant. But they all share the same fundamental trade-off: your audio leaves your device, gets processed on someone else's servers, and your conversation history lives on infrastructure you don't control.

I wanted something different: a voice AI that runs entirely on my own machine, with real-time conversation, no cloud dependencies, and the freedom to swap any component of the pipeline. So I built Voice AI Demo — a fully self-hosted, open-source conversational voice AI assistant powered by LiveKit Agents.

In this post I'll walk through the architecture, the four-stage voice pipeline, how LiveKit orchestrates WebRTC media, and the production patterns that make it reliable.

The Problem

Building a voice AI assistant locally means solving five hard problems:

Real-time audio transport — You can't use HTTP for streaming audio. You need WebRTC with low-latency media channels
Voice Activity Detection — When does the user start speaking? When do they stop? Naive approaches break on background noise
Speech-to-Text — Transcribing audio locally requires a model that's both accurate and fast enough for real-time use
Language Understanding — The LLM needs to respond naturally and fast. Cloud APIs add latency and privacy concerns
Text-to-Speech — Synthesizing natural-sounding audio locally without sounding robotic

LiveKit Agents solves problems 1 and 5 elegantly. The rest is about choosing the right models and wiring them together resiliently.

Architecture Overview

Voice AI Demo Architecture

At a glance:

Browser → LiveKit (WebRTC) → Agent Worker → VAD → STT → LLM → TTS
                  ↕                         ↕
            LiveKit Server              Ollama (local)

Core components:

Layer	Technology	Role
Frontend	Next.js 15, React 19, Agents UI	Voice interface with mic controls, audio visualizer, chat transcript
Agent Server	Python 3.12, FastAPI, LiveKit Agents SDK	JWT token generation, health check, pipeline orchestration
Media	LiveKit Server (Docker/Go)	WebRTC SFU, room management, job dispatch
VAD	Silero VAD	Speech segment and utterance boundary detection
STT	faster-whisper (large-v3-turbo)	Local speech-to-text transcription
LLM	Ollama (llama3.2:3b)	Local language model via OpenAI-compatible API
TTS	Edge-TTS	Microsoft Edge TTS engine for speech synthesis

How LiveKit Agents Works

LiveKit Agents is the backbone of this system. It provides:

WebRTC SFU — A Selective Forwarding Unit that routes media between browsers and agent workers
PipelineAgent — A high-level abstraction that processes audio through config stages (VAD → STT → LLM → TTS)
Job dispatch — When a user connects, LiveKit automatically dispatches a job to an available agent worker
JWT authentication — Secure token-based room access with configurable permissions

The data flow is elegant:

User opens http://localhost:3000 in a browser
Frontend calls GET /token on FastAPI → receives a signed LiveKit JWT
Browser connects to LiveKit Server via WebRTC using the JWT
LiveKit dispatches a job to the Agent Worker (background thread)
Agent processes audio through the voice pipeline
Response audio streams back through LiveKit → browser plays it in real-time

The 4-Stage Voice Pipeline

Stage 1: VAD (Voice Activity Detection)

Silero VAD detects when speech starts and ends. This is critical for two reasons:

Utterance detection — Know when the user has finished speaking so the pipeline can start processing
Noise filtering — Silence and background noise are discarded before they reach the STT model

Silero is pre-trained, tiny (~1.7MB), and runs efficiently on CPU. It operates on 30ms audio frames and returns a probability score between 0 and 1 for each frame.

Stage 2: STT (Speech-to-Text)

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2. It's 4x faster than the original while maintaining accuracy.

The pipeline uses large-v3-turbo — a distilled model that balances accuracy and speed. On Apple Silicon, Metal GPU acceleration brings inference time well under real-time:

Audio duration: 5 seconds
Processing time: ~1.2 seconds (Metal GPU)
Real-time factor: 0.24x

The first run downloads the model (~3GB). Subsequent runs load from cache and are instant.

Stage 3: LLM (Language Model)

Ollama serves Llama 3.2 3B locally via an OpenAI-compatible API. The agent sends transcribed text along with a system prompt, and Ollama streams tokens back.

The model is small enough to run on a MacBook with good performance:

Prompt processing: ~30 tokens/s
Token generation: ~25 tokens/s
End-to-end response: 1-3 seconds for typical queries

The magic is that you can swap to any OpenAI-compatible provider at runtime. Just change .env:

LLM_PROVIDER=ollama  # or openai_compatible
LLM_MODEL=llama3.2:3b  # or llama-3.3-70b-versatile (Groq)
LLM_BASE_URL=http://localhost:11434/v1  # or https://api.groq.com/openai/v1

Stage 4: TTS (Text-to-Speech)

Edge-TTS uses Microsoft's Edge browser TTS engine under the hood. It produces natural-sounding speech with multiple voice options (en-US-AriaNeural is the default).

The TTS runs locally, streams audio as it generates, and feeds it back through LiveKit's audio track to the browser. The whole round-trip — from user speaking to hearing the response — completes in under 2 seconds on Apple Silicon.

Safe Wrappers: Error Resilience

Each pipeline stage has a Safe wrapper that catches failures and provides graceful fallbacks:

class SafeSTT:
    async def transcribe(self, audio: AudioFrame) -> str:
        try:
            return await self.stt.transcribe(audio)
        except Exception as e:
            logger.error(f"STT failed: {e}")
            return ""  # Empty transcript — LLM handles gracefully

This means a single model failure doesn't crash the entire conversation. If Whisper fails, the user gets a "I didn't catch that" response. If Ollama fails, TTS gets a fallback message. The conversation keeps going.

Frontend: Agents UI

The Next.js frontend uses LiveKit's Agents UI — a shadcn-based React component library:

Audio visualizer — Real-time waveform display of microphone input
Chat transcript — Scrollable conversation history with timestamps
Mic controls — Mute/unmute, push-to-talk, and connection status
Dark theme — Optimized for the voice-first interface

The frontend is deliberately minimal. The complexity lives in the agent pipeline, not the UI.

Provider Switching

The most powerful feature: every component of the pipeline is swappable at runtime. Here are three example configurations:

Fully local (default):

STT_PROVIDER=whisper
LLM_PROVIDER=ollama
TTS_PROVIDER=edge-tts

Zero API keys required. Everything runs on your machine.

Cloud-powered (faster responses):

STT_PROVIDER=openai_compatible
STT_MODEL=whisper-1
STT_BASE_URL=https://api.openai.com/v1
# Add OPENAI_API_KEY to .env

OpenAI's Whisper API is faster than local for very long audio. Good trade-off if you have an API key.

Hybrid (best of both):

LLM_PROVIDER=openai_compatible
LLM_MODEL=llama-3.3-70b-versatile
LLM_BASE_URL=https://api.groq.com/v1
# Add GROQ_API_KEY to .env

Free Groq inference for the LLM, local Whisper for STT, local Edge-TTS for speech. Fast AND free.

Quick Start

Getting the full stack running takes about 5 minutes:

# 1. Start LiveKit Server (Docker)
cp .env.example .env
docker compose up -d

# 2. Pull LLM model
ollama pull llama3.2:3b

# 3. Start Agent Service
cd agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py

# 4. Start Frontend
cd frontend
npm install
npm run dev

Open http://localhost:3000 and click "Start audio". You're talking to a fully local voice AI.

Key Takeaways

LiveKit Agents is the right abstraction — It handles WebRTC complexity, media routing, job dispatch, and pipeline orchestration so you focus on the AI models
Local models are production-viable — Whisper large-v3-turbo + Llama 3.2 3B + Edge-TTS on Apple Silicon delivers sub-2s latency with zero API costs
Provider swappability is essential — The ability to switch STT, LLM, or TTS at runtime (not rebuild time) makes the system adaptable to any environment
Safe wrappers prevent cascade failures — A single model crash should never take down the conversation. Graceful degradation keeps the UX intact
Docker for the hard parts — LiveKit Server runs in Docker. The agent and frontend run natively. This separation keeps development fast and deployment flexible

Code & Resources

Full source code: github.com/aiwithvd/voiceai
Docker setup: One-command LiveKit server with docker compose up -d
Testing: 15 agent tests + 4 frontend tests + --test-mode for CI
Models: Default pipeline uses zero cloud models; Ollama requires one ollama pull

Want to build production voice AI systems? Let's connect. I advise teams on real-time AI architecture, voice pipeline design, and self-hosted deployment strategies.