LlamaIndex Complete Guide — Build RAG Apps
Advertisement
Introduction
LlamaIndex is the go-to framework for building Retrieval-Augmented Generation (RAG) systems. It handles the complexity of ingesting, indexing, and retrieving documents at scale. This guide takes you from setup to production-ready systems.
- What is LlamaIndex?
- Installation and Setup
- Loading and Indexing Documents
- Simple Vector Index
- Loading from Persistence
- Advanced Indexing Strategies
- Hierarchical Index
- Keyword Index with Hybrid Search
- Query Engines
- Query Engine with Customization
- Chat Engine for Conversational RAG
- Advanced Document Processing
- Custom Metadata Extraction
- Multi-Document Agents
- Sub-Question Query Engine
- Best Practices
- Conclusion
- FAQ
What is LlamaIndex?
LlamaIndex bridges your documents and language models. It ingests unstructured data (PDFs, web pages, databases), indexes it intelligently, and enables semantic search with LLM-powered question answering.
Core concepts:
- Documents: Raw text data
- Nodes: Chunks of documents with metadata
- Embeddings: Vector representations of text
- Indexes: Data structures for efficient retrieval
- Query Engines: Interface for asking questions
Installation and Setup
pip install llama-index llama-index-embeddings-openai llama-index-llms-openai
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure defaults
Settings.llm = OpenAI(model="gpt-4", temperature=0.7)
Settings.embed_model = OpenAIEmbedding()
Loading and Indexing Documents
Simple Vector Index
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
# Load all documents from directory
documents = SimpleDirectoryReader("./data").load_data()
# Create index (automatically chunks and embeds)
index = VectorStoreIndex.from_documents(documents)
# Save for later
index.storage_context.persist(persist_dir="./storage")
Loading from Persistence
from llama_index.core import StorageContext, load_index_from_storage
storage_context = StorageContext.from_defaults(
persist_dir="./storage"
)
index = load_index_from_storage(storage_context)
Advanced Indexing Strategies
Hierarchical Index
Perfect for documents with structure:
from llama_index.core.indices.composability import ComposableGraph
from llama_index.core import VectorStoreIndex, SummaryIndex
# Create indexes at different levels
vector_index = VectorStoreIndex.from_documents(documents)
summary_index = SummaryIndex.from_documents(documents)
# Route queries intelligently
graph = ComposableGraph.from_indices(
SummaryIndex,
[vector_index, summary_index],
index_summaries=["vector search", "summary search"]
)
query_engine = graph.as_query_engine()
response = query_engine.query("Find specific information")
Keyword Index with Hybrid Search
from llama_index.core import KeywordTableIndex, VectorStoreIndex
# Create both keyword and vector indexes
keyword_index = KeywordTableIndex.from_documents(documents)
vector_index = VectorStoreIndex.from_documents(documents)
# Use keyword retrieval for precise matching
keyword_engine = keyword_index.as_query_engine()
vector_engine = vector_index.as_query_engine()
# Combine results
keyword_result = keyword_engine.query("technical term")
vector_result = vector_engine.query("technical term")
Query Engines
Query engines convert natural language to structured retrieval and generation:
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Default: retrieves chunks and generates answer
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic?")
print(f"Answer: {response}")
print(f"Retrieved nodes: {response.source_nodes}")
Query Engine with Customization
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
retriever = VectorIndexRetriever(index, similarity_top_k=5)
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=None # Use default
)
response = query_engine.query("Question?")
Chat Engine for Conversational RAG
from llama_index.core.chat_engine import SimpleChatEngine
chat_engine = index.as_chat_engine()
# Multi-turn conversation
response1 = chat_engine.chat("What is the main topic?")
response2 = chat_engine.chat("Tell me more about the subtopic")
response3 = chat_engine.chat("How does it relate to X?")
# Chat engine maintains conversation context
Advanced Document Processing
Custom Metadata Extraction
from llama_index.core.schema import TextNode, Document
from llama_index.core.extractors import (
TitleExtractor,
KeywordExtractor,
QuestionsAnsweredExtractor
)
from llama_index.core.ingestion import IngestionPipeline
# Build processing pipeline
pipeline = IngestionPipeline(
transformers=[
TitleExtractor(),
KeywordExtractor(keywords=5),
QuestionsAnsweredExtractor(questions=3)
]
)
# Process documents
nodes = pipeline.run(documents=documents)
index = VectorStoreIndex(nodes)
Multi-Document Agents
from llama_index.core import VectorStoreIndex
from llama_index.core.tools import QueryEngineTool
from llama_index.core.agent import ReActAgent
# Create indexes for multiple documents
pdf_index = VectorStoreIndex.from_documents(pdf_docs)
csv_index = VectorStoreIndex.from_documents(csv_docs)
# Create tools
pdf_tool = QueryEngineTool.from_query_engine(
pdf_index.as_query_engine(),
name="PDF Search",
description="Search through PDF documents"
)
csv_tool = QueryEngineTool.from_query_engine(
csv_index.as_query_engine(),
name="CSV Search",
description="Search through CSV data"
)
# Create agent
agent = ReActAgent.from_tools([pdf_tool, csv_tool], verbose=True)
response = agent.chat("Find information across both sources")
Sub-Question Query Engine
For complex questions requiring multiple sub-queries:
from llama_index.core.query_engine import SubQuestionQueryEngine
question_gen_prompt = "Generate 3 sub-questions to answer: {query}"
sub_question_engine = SubQuestionQueryEngine.from_query_engines(
[index1.as_query_engine(), index2.as_query_engine()],
question_gen_prompt=question_gen_prompt
)
response = sub_question_engine.query("Complex multi-faceted question")
Best Practices
- Chunk Wisely: Experiment with chunk sizes (512-2048 tokens)
- Use Metadata Filters: Filter documents by date, category before retrieval
- Monitor Costs: Track embedding and LLM API usage
- Evaluate Quality: Use metrics to assess retrieval effectiveness
- Persist Indexes: Save indexes to avoid re-indexing
Conclusion
LlamaIndex transforms documents into queryable knowledge bases. With proper indexing strategies and query engines, you can build RAG systems that rival specialized search solutions. Its flexibility accommodates simple use cases and scales to enterprise complexity.
FAQ
Q: How do I handle large document collections? A: Use external vector stores (Pinecone, Weaviate) instead of in-memory storage. LlamaIndex integrates with major providers.
Q: Can I update indexes without re-indexing everything? A: Yes. LlamaIndex supports incremental indexing for adding documents without full reprocessing.
Q: How do I improve retrieval accuracy? A: Experiment with chunk sizes, use hybrid search (keyword plus vector), adjust similarity thresholds, and evaluate with retrieval benchmarks.
Advertisement