LlamaIndex Complete Guide — Build RAG Apps

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Introduction

LlamaIndex is the go-to framework for building Retrieval-Augmented Generation (RAG) systems. It handles the complexity of ingesting, indexing, and retrieving documents at scale. This guide takes you from setup to production-ready systems.

What is LlamaIndex?

LlamaIndex bridges your documents and language models. It ingests unstructured data (PDFs, web pages, databases), indexes it intelligently, and enables semantic search with LLM-powered question answering.

Core concepts:

  • Documents: Raw text data
  • Nodes: Chunks of documents with metadata
  • Embeddings: Vector representations of text
  • Indexes: Data structures for efficient retrieval
  • Query Engines: Interface for asking questions

Installation and Setup

pip install llama-index llama-index-embeddings-openai llama-index-llms-openai
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure defaults
Settings.llm = OpenAI(model="gpt-4", temperature=0.7)
Settings.embed_model = OpenAIEmbedding()

Loading and Indexing Documents

Simple Vector Index

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load all documents from directory
documents = SimpleDirectoryReader("./data").load_data()

# Create index (automatically chunks and embeds)
index = VectorStoreIndex.from_documents(documents)

# Save for later
index.storage_context.persist(persist_dir="./storage")

Loading from Persistence

from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(
    persist_dir="./storage"
)
index = load_index_from_storage(storage_context)

Advanced Indexing Strategies

Hierarchical Index

Perfect for documents with structure:

from llama_index.core.indices.composability import ComposableGraph
from llama_index.core import VectorStoreIndex, SummaryIndex

# Create indexes at different levels
vector_index = VectorStoreIndex.from_documents(documents)
summary_index = SummaryIndex.from_documents(documents)

# Route queries intelligently
graph = ComposableGraph.from_indices(
    SummaryIndex,
    [vector_index, summary_index],
    index_summaries=["vector search", "summary search"]
)

query_engine = graph.as_query_engine()
response = query_engine.query("Find specific information")
from llama_index.core import KeywordTableIndex, VectorStoreIndex

# Create both keyword and vector indexes
keyword_index = KeywordTableIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)

# Use keyword retrieval for precise matching
keyword_engine = keyword_index.as_query_engine()
vector_engine = vector_index.as_query_engine()

# Combine results
keyword_result = keyword_engine.query("technical term")
vector_result = vector_engine.query("technical term")

Query Engines

Query engines convert natural language to structured retrieval and generation:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

# Default: retrieves chunks and generates answer
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic?")

print(f"Answer: {response}")
print(f"Retrieved nodes: {response.source_nodes}")

Query Engine with Customization

from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever

retriever = VectorIndexRetriever(index, similarity_top_k=5)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=None  # Use default
)

response = query_engine.query("Question?")

Chat Engine for Conversational RAG

from llama_index.core.chat_engine import SimpleChatEngine

chat_engine = index.as_chat_engine()

# Multi-turn conversation
response1 = chat_engine.chat("What is the main topic?")
response2 = chat_engine.chat("Tell me more about the subtopic")
response3 = chat_engine.chat("How does it relate to X?")

# Chat engine maintains conversation context

Advanced Document Processing

Custom Metadata Extraction

from llama_index.core.schema import TextNode, Document
from llama_index.core.extractors import (
    TitleExtractor,
    KeywordExtractor,
    QuestionsAnsweredExtractor
)
from llama_index.core.ingestion import IngestionPipeline

# Build processing pipeline
pipeline = IngestionPipeline(
    transformers=[
        TitleExtractor(),
        KeywordExtractor(keywords=5),
        QuestionsAnsweredExtractor(questions=3)
    ]
)

# Process documents
nodes = pipeline.run(documents=documents)
index = VectorStoreIndex(nodes)

Multi-Document Agents

from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import ReActAgent

# Create indexes for multiple documents
pdf_index = VectorStoreIndex.from_documents(pdf_docs)
csv_index = VectorStoreIndex.from_documents(csv_docs)

# Create tools
pdf_tool = QueryEngineTool.from_query_engine(
    pdf_index.as_query_engine(),
    name="PDF Search",
    description="Search through PDF documents"
)

csv_tool = QueryEngineTool.from_query_engine(
    csv_index.as_query_engine(),
    name="CSV Search",
    description="Search through CSV data"
)

# Create agent
agent = ReActAgent.from_tools([pdf_tool, csv_tool], verbose=True)

response = agent.chat("Find information across both sources")

Sub-Question Query Engine

For complex questions requiring multiple sub-queries:

from llama_index.core.query_engine import SubQuestionQueryEngine

question_gen_prompt = "Generate 3 sub-questions to answer: {query}"

sub_question_engine = SubQuestionQueryEngine.from_query_engines(
    [index1.as_query_engine(), index2.as_query_engine()],
    question_gen_prompt=question_gen_prompt
)

response = sub_question_engine.query("Complex multi-faceted question")

Best Practices

  1. Chunk Wisely: Experiment with chunk sizes (512-2048 tokens)
  2. Use Metadata Filters: Filter documents by date, category before retrieval
  3. Monitor Costs: Track embedding and LLM API usage
  4. Evaluate Quality: Use metrics to assess retrieval effectiveness
  5. Persist Indexes: Save indexes to avoid re-indexing

Conclusion

LlamaIndex transforms documents into queryable knowledge bases. With proper indexing strategies and query engines, you can build RAG systems that rival specialized search solutions. Its flexibility accommodates simple use cases and scales to enterprise complexity.

FAQ

Q: How do I handle large document collections? A: Use external vector stores (Pinecone, Weaviate) instead of in-memory storage. LlamaIndex integrates with major providers.

Q: Can I update indexes without re-indexing everything? A: Yes. LlamaIndex supports incremental indexing for adding documents without full reprocessing.

Q: How do I improve retrieval accuracy? A: Experiment with chunk sizes, use hybrid search (keyword plus vector), adjust similarity thresholds, and evaluate with retrieval benchmarks.

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro