Published on

LiveKit for Real-Time AI — Voice Agents, Video, and WebRTC in Production

Authors

Introduction

Building real-time video and voice applications is hard: WebRTC is complex, codec negotiation is fragile, and scaling across geographies is expensive.

LiveKit is a WebRTC platform. Deploy it or use their cloud. Pair it with OpenAI Realtime API to build voice AI agents that respond to users in milliseconds.

A single developer can build what used to require a team of infrastructure engineers.

LiveKit as WebRTC Infrastructure (Rooms, Participants, Tracks)

LiveKit abstracts WebRTC's complexity. Think of it as a pub/sub system for audio and video.

Rooms: Isolated spaces where participants meet. Each room has a URL participants join.

Participants: Users in a room. Each participant publishes audio and/or video tracks.

Tracks: Streams of media. A participant might publish a microphone track and a camera track.

Set up LiveKit Cloud:

npm install livekit livekit-plugins-openai

Get your API key and URL from https://cloud.livekit.io/.

LiveKit Agents Framework for AI Voice Agents

LiveKit provides a Python framework for building agent workflows. Build a voice agent that joins a room and interacts with participants.

import asyncio
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions
from livekit.agents.openai import OpenAI
from livekit.plugins import openai as openai_plugin

class AssistantAgent:
    def __init__(self):
        self.ctx = None

    async def entry(self, ctx: JobContext) -> None:
        self.ctx = ctx
        initial_ctx = openai_plugin.ChatContext().add(
            role="user",
            text="Please introduce yourself and ask how you can help.",
        )

        await ctx.aconnect()

        assistant = openai_plugin.ChatSession(
            model="gpt-4o-mini",
            ctx=initial_ctx,
        )

        participant = await ctx.wait_for_participant()

        print(f"Starting conversation with {participant.identity}")

        async with VoiceAssistant.create(
            ctx,
            participant,
            assistant,
        ) as assistant:
            await assistant.say(
                "Hi there! I''m your AI assistant. How can I help you today?",
                allow_interruptions=True,
            )

            while ctx.state.room.num_participants > 0:
                user_speech = await assistant.asr.recognize()
                if not user_speech:
                    break

                response = await assistant.answer(user_speech)
                await assistant.say(response, allow_interruptions=True)

async def prewarm(proc: JobContext):
    """Called on idle to preload models for faster response times."""
    await openai_plugin.OpenAI.create()

if __name__ == "__main__":
    worker_options = WorkerOptions(
        entrypoint_fnc=Assistant.entry,
        prewarm_fnc=prewarm,
    )
    worker = LiveKitWorkers(worker_options=worker_options)
    worker.run()

Deploy this on your server. When a participant joins a room, the agent automatically joins and starts a conversation.

OpenAI Realtime API With LiveKit

OpenAI Realtime API accepts audio input and outputs audio. Stream participant audio to OpenAI and stream the response back.

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function transcribeAndRespond(audioBuffer: Buffer): Promise<string> {
  const response = await openai.audio.transcriptions.create({
    file: audioBuffer,
    model: 'whisper-1'
  });

  const text = response.text;

  // Get LLM response
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content:
          'You are a helpful AI assistant. Keep responses concise and natural for voice.'
      },
      { role: 'user', content: text }
    ]
  });

  const assistantText = completion.choices[0].message.content;

  // Convert text to speech
  const speech = await openai.audio.speech.create({
    model: 'tts-1',
    voice: 'onyx',
    input: assistantText
  });

  return speech; // Audio bytes
}

This pattern (STT → LLM → TTS) powers voice AI agents. OpenAI Realtime is optimized for low-latency streaming; use it over the standard API when latency matters.

Building a Voice AI Agent (STT → LLM → TTS Pipeline)

Chain together speech recognition, language model, and speech synthesis:

import * as livekit from 'livekit-server-sdk';
import OpenAI from 'openai';

class VoiceAgent {
  private participant: livekit.RemoteParticipant;
  private openai: OpenAI;
  private audioBuffer: Buffer[] = [];

  constructor(
    participant: livekit.RemoteParticipant,
    openai: OpenAI
  ) {
    this.participant = participant;
    this.openai = openai;
  }

  async run() {
    // Subscribe to participant's audio
    for await (const frame of this.participant.audioTrack!
      .getAudioFrames()) {
      this.audioBuffer.push(Buffer.from(frame.data));

      // Process every 2 seconds
      if (this.audioBuffer.length &gt; 100) {
        await this.processAudio();
      }
    }
  }

  private async processAudio() {
    const combined = Buffer.concat(this.audioBuffer);
    this.audioBuffer = [];

    // Speech-to-text
    const transcript = await this.openai.audio.transcriptions.create({
      file: combined,
      model: 'whisper-1'
    });

    if (!transcript.text) return;

    console.log(`User: ${transcript.text}`);

    // Generate response
    const completion = await this.openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [
        {
          role: 'system',
          content: 'You are a helpful voice assistant.'
        },
        { role: 'user', content: transcript.text }
      ]
    });

    const responseText = completion.choices[0].message.content;
    console.log(`Assistant: ${responseText}`);

    // Text-to-speech
    const speechStream = await this.openai.audio.speech.create({
      model: 'tts-1-hd',
      voice: 'alloy',
      input: responseText
    });

    // Send audio back to participant
    await this.publishAudio(
      Buffer.from(await speechStream.arrayBuffer())
    );
  }

  private async publishAudio(audioData: Buffer) {
    // This would publish to participant's audio track
    // Implementation depends on LiveKit SDK version
  }
}

Custom LLM Integration With LiveKit Agents

Use any LLM, not just OpenAI. LiveKit Agents supports custom integrations:

from livekit.agents import ChatSession, ChatContext
from anthropic import Anthropic

class ClaudeSession:
    def __init__(self):
        self.client = Anthropic()
        self.messages = []

    async def answer(self, question: str) -> str:
        self.messages.append({
            "role": "user",
            "content": question
        })

        response = self.client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=self.messages
        )

        answer = response.content[0].text
        self.messages.append({
            "role": "assistant",
            "content": answer
        })

        return answer

# Use in agent
async def entry(ctx: JobContext) -> None:
    await ctx.aconnect()
    participant = await ctx.wait_for_participant()

    claude = ClaudeSession()

    async with VoiceAssistant.create(ctx, participant) as voice:
        while True:
            user_input = await voice.asr.recognize()
            response = await claude.answer(user_input)
            await voice.say(response, allow_interruptions=True)

Replace OpenAI with Claude, Llama, or any model you prefer.

Video Conferencing Backend With Recording

LiveKit handles video conferencing and recording:

import * as livekit from 'livekit-server-sdk';

const livekitApi = new livekit.RoomServiceClient(
  process.env.LIVEKIT_URL,
  process.env.LIVEKIT_API_KEY,
  process.env.LIVEKIT_API_SECRET
);

async function createRoom(roomName: string, maxParticipants: number) {
  const room = await livekitApi.createRoom({
    emptyTimeout: 300,
    maxParticipants,
    name: roomName
  });

  return room;
}

async function listRooms() {
  const rooms = await livekitApi.listRooms({});
  return rooms;
}

async function startRecording(roomName: string) {
  // Webhooks notify when recording starts/stops
  const recordingOptions = {
    outputOptions: {
      audioCodec: livekit.AudioCodec.DEFAULT_AUDIO_CODEC,
      videoCodec: livekit.VideoCodec.H264,
      fileType: livekit.FileType.MP4
    }
  };

  // Start recording via webhook
  await livekitApi.startRecording({
    request: {
      roomName,
      layout: 'speaker'
    }
  });
}

async function stopRecording(roomName: string) {
  await livekitApi.stopRecording({
    roomName
  });
}

async function getParticipants(roomName: string) {
  const room = await livekitApi.listRooms({});
  const participantRoom = room.find((r) => r.name === roomName);

  if (participantRoom) {
    const participants = await livekitApi.listParticipants({
      room: roomName
    });
    return participants;
  }

  return [];
}

On the client, generate a token and join:

import { AccessToken } from 'livekit-server-sdk';

app.get('/api/token', (req, res) => {
  const roomName = req.query.room;
  const participantName = req.query.name;

  const token = new AccessToken(
    process.env.LIVEKIT_API_KEY,
    process.env.LIVEKIT_API_SECRET
  );

  token.addGrant({
    room: roomName,
    roomJoin: true,
    canPublish: true,
    canPublishData: true,
    canSubscribe: true
  });

  token.identity = participantName;

  res.json({
    token: token.toJwt()
  });
});

Client joins the room:

import { LiveKitRoom, VideoConference } from '@livekit/components-react';

export function ConferenceComponent() {
  const token = await fetch(`/api/token?room=room-1&name=Alice`).then(
    (r) => r.json()
  );

  return (
    <LiveKitRoom
      url={process.env.REACT_APP_LIVEKIT_URL}
      token={token.token}
      onConnected={async () => console.log('Connected')}
      onDisconnected={() => console.log('Disconnected')}
      options={{
        autoSubscribe: true
      }}
    >
      <VideoConference />
    </LiveKitRoom>
  );
}

LiveKit Cloud vs Self-Hosted

LiveKit Cloud (recommended for most teams):

  • No ops burden
  • Global CDN for low latency
  • Automatic scaling
  • Pay per participant-minute
  • HIPAA/SOC2 compliant

Self-hosted:

  • Full control
  • Lower cost at massive scale
  • Run on your infrastructure
  • Manage upgrades and monitoring

Start with Cloud. Self-host if you hit millions of participant-minutes per month.

Node.js Server SDK for Room Management

Manage rooms, participants, and tokens programmatically:

import * as livekit from 'livekit-server-sdk';

const webhook = new livekit.WebhookReceiver(
  process.env.LIVEKIT_API_KEY,
  process.env.LIVEKIT_API_SECRET
);

// Handle LiveKit webhook
app.post('/webhooks/livekit', async (req, res) => {
  const event = webhook.receive(
    req.body,
    req.get('Authorization')
  );

  switch (event.event) {
    case 'participant_joined':
      console.log(`${event.participantJoined.identity} joined`);
      break;
    case 'participant_left':
      console.log(`${event.participantLeft.identity} left`);
      break;
    case 'recording_started':
      console.log('Recording started');
      break;
    case 'recording_finished':
      console.log(
        'Recording finished:',
        event.recordingFinished.filename
      );
      break;
  }

  res.ok();
});

Webhooks notify your backend of room events. Sync to your database, trigger downstream processes, or send notifications.

Client SDKs (React, Swift, Android)

React (@livekit/components-react):

import { LiveKitRoom, VideoConference } from '@livekit/components-react';

export function VideoComponent() {
  return (
    <LiveKitRoom url={process.env.REACT_APP_LIVEKIT_URL} token={token}>
      <VideoConference />
    </LiveKitRoom>
  );
}

Swift (iOS):

import LiveKit

let room = Room()

Task {
  do {
    try await room.connect(
      url: "ws://...",
      token: token
    )
    print("Connected")
  } catch {
    print("Error: \(error)")
  }
}

Android (Kotlin):

val room = Room()

room.connect(
  url = "ws://...",
  token = token
).onFailure { error ->
  println("Error: $error")
}.onSuccess {
  println("Connected")
}

All SDKs handle WebRTC negotiation, codec handling, and network resilience.

Latency Optimisation for Real-Time AI

Voice AI agents need <500ms latency end-to-end. Optimize:

  1. Use regional servers: LiveKit Cloud runs in multiple regions. Route users to nearest region.

  2. Optimize audio processing: Stream audio in chunks rather than buffering entire sentences.

const audioProcessor = new AudioWorkletProcessor();

audioProcessor.process((inputData) => {
  // Send immediately, don't wait for silence
  sendToOpenAI(inputData);
});
  1. Parallel processing: Start synthesizing the response while waiting for STT to finish.

  2. Model selection: Smaller models (Whisper tiny, GPT-3.5-turbo) have lower latency than larger ones.

const transcript = await openai.audio.transcriptions.create({
  file: audioBuffer,
  model: 'whisper-1', // Faster than large-v3
  language: 'en' // Pre-specifying language helps
});

const completion = await openai.chat.completions.create({
  model: 'gpt-4-turbo', // Faster than gpt-4
  messages: [...]
});
  1. Connection multiplexing: Reuse TCP/WebRTC connections. Don't create new connections per request.

Checklist

  • Create LiveKit Cloud account
  • Set up API key and secret
  • Implement token generation endpoint
  • Build React/Swift/Android client with video
  • Deploy LiveKit Agents for voice AI
  • Integrate OpenAI Realtime or custom LLM
  • Set up recording
  • Configure webhooks for room events
  • Test latency and optimize
  • Monitor participant-minutes and costs

Conclusion

LiveKit abstracts WebRTC's complexity. Build real-time voice and video applications without infrastructure expertise. Pair with OpenAI Realtime API or custom LLMs to create voice agents that respond in real time.

For voice AI, this is the stack: LiveKit + OpenAI Realtime. For video conferencing, LiveKit + React/Swift SDK. Deploy agents or use Cloud. Let LiveKit handle the rest.

Real-time AI is no longer a luxury for well-funded teams. It's accessible to anyone with a laptop and an API key.