Build a RAG Application with LangChain and OpenAI — Complete Guide 2026
Advertisement
Build a RAG Application from Scratch in 2026
Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about YOUR data — not just their training data. In this guide, you'll build a production-ready RAG pipeline in Python.
- What is RAG?
- Setup
- Step 1: Load and Chunk Documents
- Step 2: Create Embeddings and Vector Store
- Step 3: Build the Retriever
- Step 4: Build the RAG Chain
- Step 5: Add Reranking (Production Must-Have)
- Step 6: FastAPI Endpoint
- RAG Evaluation Checklist
- Common RAG Failures and Fixes
What is RAG?
RAG = Retrieve relevant documents → Augment the LLM prompt with them → Generate a grounded answer.
Without RAG, GPT-4o can only answer from what it learned before its training cutoff. With RAG, it answers from your docs, PDFs, databases, and real-time data.
User Query
↓
Embed query → vector
↓
Search vector DB for similar chunks
↓
Retrieve top-K relevant chunks
↓
Stuff chunks into LLM prompt
↓
LLM generates grounded answer
Setup
pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
Step 1: Load and Chunk Documents
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load PDFs from a directory
loader = DirectoryLoader('./docs', glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()
# Smart chunking — overlap keeps context across chunk boundaries
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ".", "!", "?", " "],
length_function=len,
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")
Chunking tips:
- 500–1000 tokens per chunk works for most use cases
- 20% overlap prevents losing context at boundaries
- Smaller chunks = more precise retrieval, less context per chunk
Step 2: Create Embeddings and Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# Create and persist vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="my_docs"
)
# Load existing store later
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
collection_name="my_docs"
)
Step 3: Build the Retriever
# Basic retriever — top 4 most similar chunks
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance — diverse results
search_kwargs={
"k": 6, # Return top 6 chunks
"fetch_k": 20, # Fetch 20, then pick 6 most diverse
"lambda_mult": 0.7, # 1.0 = pure similarity, 0.0 = pure diversity
}
)
Step 4: Build the RAG Chain
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt_template = """You are a helpful assistant. Use the following context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information."
Do not make up answers.
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)
# Query it
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata.get('source', 'unknown')}, page {doc.metadata.get('page', '?')}")
Step 5: Add Reranking (Production Must-Have)
Basic retrieval returns the most similar chunks, not necessarily the most relevant. Reranking fixes this:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# CrossEncoder reranker (free, runs locally)
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=retriever
)
Step 6: FastAPI Endpoint
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
question: str
@app.post("/ask")
async def ask(req: QueryRequest):
result = qa_chain.invoke({"query": req.question})
return {
"answer": result["result"],
"sources": [
{"file": doc.metadata.get("source"), "page": doc.metadata.get("page")}
for doc in result["source_documents"]
]
}
RAG Evaluation Checklist
- Faithfulness: Does the answer match the retrieved context?
- Answer Relevance: Does the answer address the question?
- Context Precision: Were the right chunks retrieved?
- Context Recall: Were all relevant chunks retrieved?
Use RAGAS library for automated RAG evaluation:
pip install ragas
Common RAG Failures and Fixes
| Problem | Cause | Fix |
|---|---|---|
| Hallucination | LLM ignores retrieved context | Stricter prompt + temp=0 |
| Wrong chunks retrieved | Poor chunking | Hierarchical chunking + metadata filtering |
| Slow responses | Large context window | Reranking to reduce top-K |
| Out-of-date answers | Stale vector store | Incremental indexing pipeline |
Advertisement