- Published on
LiveKit for Real-Time AI — Voice Agents, Video, and WebRTC in Production
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Building real-time video and voice applications is hard: WebRTC is complex, codec negotiation is fragile, and scaling across geographies is expensive.
LiveKit is a WebRTC platform. Deploy it or use their cloud. Pair it with OpenAI Realtime API to build voice AI agents that respond to users in milliseconds.
A single developer can build what used to require a team of infrastructure engineers.
- LiveKit as WebRTC Infrastructure (Rooms, Participants, Tracks)
- LiveKit Agents Framework for AI Voice Agents
- OpenAI Realtime API With LiveKit
- Building a Voice AI Agent (STT → LLM → TTS Pipeline)
- Custom LLM Integration With LiveKit Agents
- Video Conferencing Backend With Recording
- LiveKit Cloud vs Self-Hosted
- Node.js Server SDK for Room Management
- Client SDKs (React, Swift, Android)
- Latency Optimisation for Real-Time AI
- Checklist
- Conclusion
LiveKit as WebRTC Infrastructure (Rooms, Participants, Tracks)
LiveKit abstracts WebRTC's complexity. Think of it as a pub/sub system for audio and video.
Rooms: Isolated spaces where participants meet. Each room has a URL participants join.
Participants: Users in a room. Each participant publishes audio and/or video tracks.
Tracks: Streams of media. A participant might publish a microphone track and a camera track.
Set up LiveKit Cloud:
npm install livekit livekit-plugins-openai
Get your API key and URL from https://cloud.livekit.io/.
LiveKit Agents Framework for AI Voice Agents
LiveKit provides a Python framework for building agent workflows. Build a voice agent that joins a room and interacts with participants.
import asyncio
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions
from livekit.agents.openai import OpenAI
from livekit.plugins import openai as openai_plugin
class AssistantAgent:
def __init__(self):
self.ctx = None
async def entry(self, ctx: JobContext) -> None:
self.ctx = ctx
initial_ctx = openai_plugin.ChatContext().add(
role="user",
text="Please introduce yourself and ask how you can help.",
)
await ctx.aconnect()
assistant = openai_plugin.ChatSession(
model="gpt-4o-mini",
ctx=initial_ctx,
)
participant = await ctx.wait_for_participant()
print(f"Starting conversation with {participant.identity}")
async with VoiceAssistant.create(
ctx,
participant,
assistant,
) as assistant:
await assistant.say(
"Hi there! I''m your AI assistant. How can I help you today?",
allow_interruptions=True,
)
while ctx.state.room.num_participants > 0:
user_speech = await assistant.asr.recognize()
if not user_speech:
break
response = await assistant.answer(user_speech)
await assistant.say(response, allow_interruptions=True)
async def prewarm(proc: JobContext):
"""Called on idle to preload models for faster response times."""
await openai_plugin.OpenAI.create()
if __name__ == "__main__":
worker_options = WorkerOptions(
entrypoint_fnc=Assistant.entry,
prewarm_fnc=prewarm,
)
worker = LiveKitWorkers(worker_options=worker_options)
worker.run()
Deploy this on your server. When a participant joins a room, the agent automatically joins and starts a conversation.
OpenAI Realtime API With LiveKit
OpenAI Realtime API accepts audio input and outputs audio. Stream participant audio to OpenAI and stream the response back.
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function transcribeAndRespond(audioBuffer: Buffer): Promise<string> {
const response = await openai.audio.transcriptions.create({
file: audioBuffer,
model: 'whisper-1'
});
const text = response.text;
// Get LLM response
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content:
'You are a helpful AI assistant. Keep responses concise and natural for voice.'
},
{ role: 'user', content: text }
]
});
const assistantText = completion.choices[0].message.content;
// Convert text to speech
const speech = await openai.audio.speech.create({
model: 'tts-1',
voice: 'onyx',
input: assistantText
});
return speech; // Audio bytes
}
This pattern (STT → LLM → TTS) powers voice AI agents. OpenAI Realtime is optimized for low-latency streaming; use it over the standard API when latency matters.
Building a Voice AI Agent (STT → LLM → TTS Pipeline)
Chain together speech recognition, language model, and speech synthesis:
import * as livekit from 'livekit-server-sdk';
import OpenAI from 'openai';
class VoiceAgent {
private participant: livekit.RemoteParticipant;
private openai: OpenAI;
private audioBuffer: Buffer[] = [];
constructor(
participant: livekit.RemoteParticipant,
openai: OpenAI
) {
this.participant = participant;
this.openai = openai;
}
async run() {
// Subscribe to participant's audio
for await (const frame of this.participant.audioTrack!
.getAudioFrames()) {
this.audioBuffer.push(Buffer.from(frame.data));
// Process every 2 seconds
if (this.audioBuffer.length > 100) {
await this.processAudio();
}
}
}
private async processAudio() {
const combined = Buffer.concat(this.audioBuffer);
this.audioBuffer = [];
// Speech-to-text
const transcript = await this.openai.audio.transcriptions.create({
file: combined,
model: 'whisper-1'
});
if (!transcript.text) return;
console.log(`User: ${transcript.text}`);
// Generate response
const completion = await this.openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'You are a helpful voice assistant.'
},
{ role: 'user', content: transcript.text }
]
});
const responseText = completion.choices[0].message.content;
console.log(`Assistant: ${responseText}`);
// Text-to-speech
const speechStream = await this.openai.audio.speech.create({
model: 'tts-1-hd',
voice: 'alloy',
input: responseText
});
// Send audio back to participant
await this.publishAudio(
Buffer.from(await speechStream.arrayBuffer())
);
}
private async publishAudio(audioData: Buffer) {
// This would publish to participant's audio track
// Implementation depends on LiveKit SDK version
}
}
Custom LLM Integration With LiveKit Agents
Use any LLM, not just OpenAI. LiveKit Agents supports custom integrations:
from livekit.agents import ChatSession, ChatContext
from anthropic import Anthropic
class ClaudeSession:
def __init__(self):
self.client = Anthropic()
self.messages = []
async def answer(self, question: str) -> str:
self.messages.append({
"role": "user",
"content": question
})
response = self.client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=self.messages
)
answer = response.content[0].text
self.messages.append({
"role": "assistant",
"content": answer
})
return answer
# Use in agent
async def entry(ctx: JobContext) -> None:
await ctx.aconnect()
participant = await ctx.wait_for_participant()
claude = ClaudeSession()
async with VoiceAssistant.create(ctx, participant) as voice:
while True:
user_input = await voice.asr.recognize()
response = await claude.answer(user_input)
await voice.say(response, allow_interruptions=True)
Replace OpenAI with Claude, Llama, or any model you prefer.
Video Conferencing Backend With Recording
LiveKit handles video conferencing and recording:
import * as livekit from 'livekit-server-sdk';
const livekitApi = new livekit.RoomServiceClient(
process.env.LIVEKIT_URL,
process.env.LIVEKIT_API_KEY,
process.env.LIVEKIT_API_SECRET
);
async function createRoom(roomName: string, maxParticipants: number) {
const room = await livekitApi.createRoom({
emptyTimeout: 300,
maxParticipants,
name: roomName
});
return room;
}
async function listRooms() {
const rooms = await livekitApi.listRooms({});
return rooms;
}
async function startRecording(roomName: string) {
// Webhooks notify when recording starts/stops
const recordingOptions = {
outputOptions: {
audioCodec: livekit.AudioCodec.DEFAULT_AUDIO_CODEC,
videoCodec: livekit.VideoCodec.H264,
fileType: livekit.FileType.MP4
}
};
// Start recording via webhook
await livekitApi.startRecording({
request: {
roomName,
layout: 'speaker'
}
});
}
async function stopRecording(roomName: string) {
await livekitApi.stopRecording({
roomName
});
}
async function getParticipants(roomName: string) {
const room = await livekitApi.listRooms({});
const participantRoom = room.find((r) => r.name === roomName);
if (participantRoom) {
const participants = await livekitApi.listParticipants({
room: roomName
});
return participants;
}
return [];
}
On the client, generate a token and join:
import { AccessToken } from 'livekit-server-sdk';
app.get('/api/token', (req, res) => {
const roomName = req.query.room;
const participantName = req.query.name;
const token = new AccessToken(
process.env.LIVEKIT_API_KEY,
process.env.LIVEKIT_API_SECRET
);
token.addGrant({
room: roomName,
roomJoin: true,
canPublish: true,
canPublishData: true,
canSubscribe: true
});
token.identity = participantName;
res.json({
token: token.toJwt()
});
});
Client joins the room:
import { LiveKitRoom, VideoConference } from '@livekit/components-react';
export function ConferenceComponent() {
const token = await fetch(`/api/token?room=room-1&name=Alice`).then(
(r) => r.json()
);
return (
<LiveKitRoom
url={process.env.REACT_APP_LIVEKIT_URL}
token={token.token}
onConnected={async () => console.log('Connected')}
onDisconnected={() => console.log('Disconnected')}
options={{
autoSubscribe: true
}}
>
<VideoConference />
</LiveKitRoom>
);
}
LiveKit Cloud vs Self-Hosted
LiveKit Cloud (recommended for most teams):
- No ops burden
- Global CDN for low latency
- Automatic scaling
- Pay per participant-minute
- HIPAA/SOC2 compliant
Self-hosted:
- Full control
- Lower cost at massive scale
- Run on your infrastructure
- Manage upgrades and monitoring
Start with Cloud. Self-host if you hit millions of participant-minutes per month.
Node.js Server SDK for Room Management
Manage rooms, participants, and tokens programmatically:
import * as livekit from 'livekit-server-sdk';
const webhook = new livekit.WebhookReceiver(
process.env.LIVEKIT_API_KEY,
process.env.LIVEKIT_API_SECRET
);
// Handle LiveKit webhook
app.post('/webhooks/livekit', async (req, res) => {
const event = webhook.receive(
req.body,
req.get('Authorization')
);
switch (event.event) {
case 'participant_joined':
console.log(`${event.participantJoined.identity} joined`);
break;
case 'participant_left':
console.log(`${event.participantLeft.identity} left`);
break;
case 'recording_started':
console.log('Recording started');
break;
case 'recording_finished':
console.log(
'Recording finished:',
event.recordingFinished.filename
);
break;
}
res.ok();
});
Webhooks notify your backend of room events. Sync to your database, trigger downstream processes, or send notifications.
Client SDKs (React, Swift, Android)
React (@livekit/components-react):
import { LiveKitRoom, VideoConference } from '@livekit/components-react';
export function VideoComponent() {
return (
<LiveKitRoom url={process.env.REACT_APP_LIVEKIT_URL} token={token}>
<VideoConference />
</LiveKitRoom>
);
}
Swift (iOS):
import LiveKit
let room = Room()
Task {
do {
try await room.connect(
url: "ws://...",
token: token
)
print("Connected")
} catch {
print("Error: \(error)")
}
}
Android (Kotlin):
val room = Room()
room.connect(
url = "ws://...",
token = token
).onFailure { error ->
println("Error: $error")
}.onSuccess {
println("Connected")
}
All SDKs handle WebRTC negotiation, codec handling, and network resilience.
Latency Optimisation for Real-Time AI
Voice AI agents need <500ms latency end-to-end. Optimize:
Use regional servers: LiveKit Cloud runs in multiple regions. Route users to nearest region.
Optimize audio processing: Stream audio in chunks rather than buffering entire sentences.
const audioProcessor = new AudioWorkletProcessor();
audioProcessor.process((inputData) => {
// Send immediately, don't wait for silence
sendToOpenAI(inputData);
});
Parallel processing: Start synthesizing the response while waiting for STT to finish.
Model selection: Smaller models (Whisper tiny, GPT-3.5-turbo) have lower latency than larger ones.
const transcript = await openai.audio.transcriptions.create({
file: audioBuffer,
model: 'whisper-1', // Faster than large-v3
language: 'en' // Pre-specifying language helps
});
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo', // Faster than gpt-4
messages: [...]
});
- Connection multiplexing: Reuse TCP/WebRTC connections. Don't create new connections per request.
Checklist
- Create LiveKit Cloud account
- Set up API key and secret
- Implement token generation endpoint
- Build React/Swift/Android client with video
- Deploy LiveKit Agents for voice AI
- Integrate OpenAI Realtime or custom LLM
- Set up recording
- Configure webhooks for room events
- Test latency and optimize
- Monitor participant-minutes and costs
Conclusion
LiveKit abstracts WebRTC's complexity. Build real-time voice and video applications without infrastructure expertise. Pair with OpenAI Realtime API or custom LLMs to create voice agents that respond in real time.
For voice AI, this is the stack: LiveKit + OpenAI Realtime. For video conferencing, LiveKit + React/Swift SDK. Deploy agents or use Cloud. Let LiveKit handle the rest.
Real-time AI is no longer a luxury for well-funded teams. It's accessible to anyone with a laptop and an API key.