LlamaIndex Complete Guide — Build RAG Apps

Introduction

LlamaIndex is the go-to framework for building Retrieval-Augmented Generation (RAG) systems. It handles the complexity of ingesting, indexing, and retrieving documents at scale. This guide takes you from setup to production-ready systems.

What is LlamaIndex?
Installation and Setup
Loading and Indexing Documents
Simple Vector Index
Loading from Persistence
Advanced Indexing Strategies
Hierarchical Index
Keyword Index with Hybrid Search
Query Engines
Query Engine with Customization
Chat Engine for Conversational RAG
Advanced Document Processing
Custom Metadata Extraction
Multi-Document Agents
Sub-Question Query Engine
Best Practices
Conclusion
FAQ

What is LlamaIndex?

LlamaIndex bridges your documents and language models. It ingests unstructured data (PDFs, web pages, databases), indexes it intelligently, and enables semantic search with LLM-powered question answering.

Core concepts:

Documents: Raw text data
Nodes: Chunks of documents with metadata
Embeddings: Vector representations of text
Indexes: Data structures for efficient retrieval
Query Engines: Interface for asking questions

Installation and Setup

pip install llama-index llama-index-embeddings-openai llama-index-llms-openai

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure defaults
Settings.llm = OpenAI(model="gpt-4", temperature=0.7)
Settings.embed_model = OpenAIEmbedding()

Loading and Indexing Documents

Simple Vector Index

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load all documents from directory
documents = SimpleDirectoryReader("./data").load_data()

# Create index (automatically chunks and embeds)
index = VectorStoreIndex.from_documents(documents)

# Save for later
index.storage_context.persist(persist_dir="./storage")

Loading from Persistence

from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(
    persist_dir="./storage"
)
index = load_index_from_storage(storage_context)

Advanced Indexing Strategies

Hierarchical Index

Perfect for documents with structure:

from llama_index.core.indices.composability import ComposableGraph
from llama_index.core import VectorStoreIndex, SummaryIndex

# Create indexes at different levels
vector_index = VectorStoreIndex.from_documents(documents)
summary_index = SummaryIndex.from_documents(documents)

# Route queries intelligently
graph = ComposableGraph.from_indices(
    SummaryIndex,
    [vector_index, summary_index],
    index_summaries=["vector search", "summary search"]
)

query_engine = graph.as_query_engine()
response = query_engine.query("Find specific information")

Keyword Index with Hybrid Search

from llama_index.core import KeywordTableIndex, VectorStoreIndex

# Create both keyword and vector indexes
keyword_index = KeywordTableIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)

# Use keyword retrieval for precise matching
keyword_engine = keyword_index.as_query_engine()
vector_engine = vector_index.as_query_engine()

# Combine results
keyword_result = keyword_engine.query("technical term")
vector_result = vector_engine.query("technical term")

Query Engines

Query engines convert natural language to structured retrieval and generation:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

# Default: retrieves chunks and generates answer
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic?")

print(f"Answer: {response}")
print(f"Retrieved nodes: {response.source_nodes}")

Query Engine with Customization

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever

retriever = VectorIndexRetriever(index, similarity_top_k=5)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=None  # Use default
)

response = query_engine.query("Question?")

Chat Engine for Conversational RAG

from llama_index.core.chat_engine import SimpleChatEngine

chat_engine = index.as_chat_engine()

# Multi-turn conversation
response1 = chat_engine.chat("What is the main topic?")
response2 = chat_engine.chat("Tell me more about the subtopic")
response3 = chat_engine.chat("How does it relate to X?")

# Chat engine maintains conversation context

Advanced Document Processing

Custom Metadata Extraction

from llama_index.core.schema import TextNode, Document
from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor
)
from llama_index.core.ingestion import IngestionPipeline

# Build processing pipeline
pipeline = IngestionPipeline(
    transformers=[
        TitleExtractor(),
        KeywordExtractor(keywords=5),
        QuestionsAnsweredExtractor(questions=3)
    ]
)

# Process documents
nodes = pipeline.run(documents=documents)
index = VectorStoreIndex(nodes)

Multi-Document Agents

from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import ReActAgent

# Create indexes for multiple documents
pdf_index = VectorStoreIndex.from_documents(pdf_docs)
csv_index = VectorStoreIndex.from_documents(csv_docs)

# Create tools
pdf_tool = QueryEngineTool.from_query_engine(
    pdf_index.as_query_engine(),
    name="PDF Search",
    description="Search through PDF documents"
)

csv_tool = QueryEngineTool.from_query_engine(
    csv_index.as_query_engine(),
    name="CSV Search",
    description="Search through CSV data"
)

# Create agent
agent = ReActAgent.from_tools([pdf_tool, csv_tool], verbose=True)

response = agent.chat("Find information across both sources")

Sub-Question Query Engine

For complex questions requiring multiple sub-queries:

from llama_index.core.query_engine import SubQuestionQueryEngine

question_gen_prompt = "Generate 3 sub-questions to answer: {query}"

sub_question_engine = SubQuestionQueryEngine.from_query_engines(
    [index1.as_query_engine(), index2.as_query_engine()],
    question_gen_prompt=question_gen_prompt
)

response = sub_question_engine.query("Complex multi-faceted question")

Best Practices

Chunk Wisely: Experiment with chunk sizes (512-2048 tokens)
Use Metadata Filters: Filter documents by date, category before retrieval
Monitor Costs: Track embedding and LLM API usage
Evaluate Quality: Use metrics to assess retrieval effectiveness
Persist Indexes: Save indexes to avoid re-indexing

Conclusion

LlamaIndex transforms documents into queryable knowledge bases. With proper indexing strategies and query engines, you can build RAG systems that rival specialized search solutions. Its flexibility accommodates simple use cases and scales to enterprise complexity.

FAQ

Q: How do I handle large document collections? A: Use external vector stores (Pinecone, Weaviate) instead of in-memory storage. LlamaIndex integrates with major providers.

Q: Can I update indexes without re-indexing everything? A: Yes. LlamaIndex supports incremental indexing for adding documents without full reprocessing.

Q: How do I improve retrieval accuracy? A: Experiment with chunk sizes, use hybrid search (keyword plus vector), adjust similarity thresholds, and evaluate with retrieval benchmarks.