Back to Blog
LiveKitVoice AIFastAPIWhisperOllamaWebRTCNext.jsSelf-HostedEdge-TTSReal-Time Audio

Building a Self-Hosted Voice AI Assistant with LiveKit Agents

Published May 13, 20267 min read
Building a Self-Hosted Voice AI Assistant with LiveKit Agents

Building a Self-Hosted Voice AI Assistant with LiveKit Agents

Voice assistants are everywhere — Siri, Alexa, Google Assistant. But they all share the same fundamental trade-off: your audio leaves your device, gets processed on someone else's servers, and your conversation history lives on infrastructure you don't control.

I wanted something different: a voice AI that runs entirely on my own machine, with real-time conversation, no cloud dependencies, and the freedom to swap any component of the pipeline. So I built Voice AI Demo — a fully self-hosted, open-source conversational voice AI assistant powered by LiveKit Agents.

In this post I'll walk through the architecture, the four-stage voice pipeline, how LiveKit orchestrates WebRTC media, and the production patterns that make it reliable.

The Problem

Building a voice AI assistant locally means solving five hard problems:

  1. Real-time audio transport — You can't use HTTP for streaming audio. You need WebRTC with low-latency media channels
  2. Voice Activity Detection — When does the user start speaking? When do they stop? Naive approaches break on background noise
  3. Speech-to-Text — Transcribing audio locally requires a model that's both accurate and fast enough for real-time use
  4. Language Understanding — The LLM needs to respond naturally and fast. Cloud APIs add latency and privacy concerns
  5. Text-to-Speech — Synthesizing natural-sounding audio locally without sounding robotic

LiveKit Agents solves problems 1 and 5 elegantly. The rest is about choosing the right models and wiring them together resiliently.

Architecture Overview

Voice AI Demo Architecture

At a glance:

Browser → LiveKit (WebRTC) → Agent Worker → VAD → STT → LLM → TTS
                  ↕                         ↕
            LiveKit Server              Ollama (local)

Core components:

LayerTechnologyRole
FrontendNext.js 15, React 19, Agents UIVoice interface with mic controls, audio visualizer, chat transcript
Agent ServerPython 3.12, FastAPI, LiveKit Agents SDKJWT token generation, health check, pipeline orchestration
MediaLiveKit Server (Docker/Go)WebRTC SFU, room management, job dispatch
VADSilero VADSpeech segment and utterance boundary detection
STTfaster-whisper (large-v3-turbo)Local speech-to-text transcription
LLMOllama (llama3.2:3b)Local language model via OpenAI-compatible API
TTSEdge-TTSMicrosoft Edge TTS engine for speech synthesis

How LiveKit Agents Works

LiveKit Agents is the backbone of this system. It provides:

  • WebRTC SFU — A Selective Forwarding Unit that routes media between browsers and agent workers
  • PipelineAgent — A high-level abstraction that processes audio through config stages (VAD → STT → LLM → TTS)
  • Job dispatch — When a user connects, LiveKit automatically dispatches a job to an available agent worker
  • JWT authentication — Secure token-based room access with configurable permissions

The data flow is elegant:

  1. User opens http://localhost:3000 in a browser
  2. Frontend calls GET /token on FastAPI → receives a signed LiveKit JWT
  3. Browser connects to LiveKit Server via WebRTC using the JWT
  4. LiveKit dispatches a job to the Agent Worker (background thread)
  5. Agent processes audio through the voice pipeline
  6. Response audio streams back through LiveKit → browser plays it in real-time

The 4-Stage Voice Pipeline

Stage 1: VAD (Voice Activity Detection)

Silero VAD detects when speech starts and ends. This is critical for two reasons:

  • Utterance detection — Know when the user has finished speaking so the pipeline can start processing
  • Noise filtering — Silence and background noise are discarded before they reach the STT model

Silero is pre-trained, tiny (~1.7MB), and runs efficiently on CPU. It operates on 30ms audio frames and returns a probability score between 0 and 1 for each frame.

Stage 2: STT (Speech-to-Text)

faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2. It's 4x faster than the original while maintaining accuracy.

The pipeline uses large-v3-turbo — a distilled model that balances accuracy and speed. On Apple Silicon, Metal GPU acceleration brings inference time well under real-time:

Audio duration: 5 seconds
Processing time: ~1.2 seconds (Metal GPU)
Real-time factor: 0.24x

The first run downloads the model (~3GB). Subsequent runs load from cache and are instant.

Stage 3: LLM (Language Model)

Ollama serves Llama 3.2 3B locally via an OpenAI-compatible API. The agent sends transcribed text along with a system prompt, and Ollama streams tokens back.

The model is small enough to run on a MacBook with good performance:

Prompt processing: ~30 tokens/s
Token generation: ~25 tokens/s
End-to-end response: 1-3 seconds for typical queries

The magic is that you can swap to any OpenAI-compatible provider at runtime. Just change .env:

LLM_PROVIDER=ollama  # or openai_compatible
LLM_MODEL=llama3.2:3b  # or llama-3.3-70b-versatile (Groq)
LLM_BASE_URL=http://localhost:11434/v1  # or https://api.groq.com/openai/v1

Stage 4: TTS (Text-to-Speech)

Edge-TTS uses Microsoft's Edge browser TTS engine under the hood. It produces natural-sounding speech with multiple voice options (en-US-AriaNeural is the default).

The TTS runs locally, streams audio as it generates, and feeds it back through LiveKit's audio track to the browser. The whole round-trip — from user speaking to hearing the response — completes in under 2 seconds on Apple Silicon.

Safe Wrappers: Error Resilience

Each pipeline stage has a Safe wrapper that catches failures and provides graceful fallbacks:

class SafeSTT:
    async def transcribe(self, audio: AudioFrame) -> str:
        try:
            return await self.stt.transcribe(audio)
        except Exception as e:
            logger.error(f"STT failed: {e}")
            return ""  # Empty transcript — LLM handles gracefully

This means a single model failure doesn't crash the entire conversation. If Whisper fails, the user gets a "I didn't catch that" response. If Ollama fails, TTS gets a fallback message. The conversation keeps going.

Frontend: Agents UI

The Next.js frontend uses LiveKit's Agents UI — a shadcn-based React component library:

  • Audio visualizer — Real-time waveform display of microphone input
  • Chat transcript — Scrollable conversation history with timestamps
  • Mic controls — Mute/unmute, push-to-talk, and connection status
  • Dark theme — Optimized for the voice-first interface

The frontend is deliberately minimal. The complexity lives in the agent pipeline, not the UI.

Provider Switching

The most powerful feature: every component of the pipeline is swappable at runtime. Here are three example configurations:

Fully local (default):

STT_PROVIDER=whisper
LLM_PROVIDER=ollama
TTS_PROVIDER=edge-tts

Zero API keys required. Everything runs on your machine.

Cloud-powered (faster responses):

STT_PROVIDER=openai_compatible
STT_MODEL=whisper-1
STT_BASE_URL=https://api.openai.com/v1
# Add OPENAI_API_KEY to .env

OpenAI's Whisper API is faster than local for very long audio. Good trade-off if you have an API key.

Hybrid (best of both):

LLM_PROVIDER=openai_compatible
LLM_MODEL=llama-3.3-70b-versatile
LLM_BASE_URL=https://api.groq.com/v1
# Add GROQ_API_KEY to .env

Free Groq inference for the LLM, local Whisper for STT, local Edge-TTS for speech. Fast AND free.

Quick Start

Getting the full stack running takes about 5 minutes:

# 1. Start LiveKit Server (Docker)
cp .env.example .env
docker compose up -d

# 2. Pull LLM model
ollama pull llama3.2:3b

# 3. Start Agent Service
cd agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python main.py

# 4. Start Frontend
cd frontend
npm install
npm run dev

Open http://localhost:3000 and click "Start audio". You're talking to a fully local voice AI.

Key Takeaways

  1. LiveKit Agents is the right abstraction — It handles WebRTC complexity, media routing, job dispatch, and pipeline orchestration so you focus on the AI models
  2. Local models are production-viable — Whisper large-v3-turbo + Llama 3.2 3B + Edge-TTS on Apple Silicon delivers sub-2s latency with zero API costs
  3. Provider swappability is essential — The ability to switch STT, LLM, or TTS at runtime (not rebuild time) makes the system adaptable to any environment
  4. Safe wrappers prevent cascade failures — A single model crash should never take down the conversation. Graceful degradation keeps the UX intact
  5. Docker for the hard parts — LiveKit Server runs in Docker. The agent and frontend run natively. This separation keeps development fast and deployment flexible

Code & Resources

  • Full source code: github.com/aiwithvd/voiceai
  • Docker setup: One-command LiveKit server with docker compose up -d
  • Testing: 15 agent tests + 4 frontend tests + --test-mode for CI
  • Models: Default pipeline uses zero cloud models; Ollama requires one ollama pull

Want to build production voice AI systems? Let's connect. I advise teams on real-time AI architecture, voice pipeline design, and self-hosted deployment strategies.