Build a RAG Application with LangChain and OpenAI — Complete Guide 2026

Sanjeev SharmaSanjeev Sharma
4 min read

Advertisement

Build a RAG Application from Scratch in 2026

Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about YOUR data — not just their training data. In this guide, you'll build a production-ready RAG pipeline in Python.

What is RAG?

RAG = Retrieve relevant documents → Augment the LLM prompt with them → Generate a grounded answer.

Without RAG, GPT-4o can only answer from what it learned before its training cutoff. With RAG, it answers from your docs, PDFs, databases, and real-time data.

User Query
Embed query → vector
Search vector DB for similar chunks
Retrieve top-K relevant chunks
Stuff chunks into LLM prompt
LLM generates grounded answer

Setup

pip install langchain langchain-openai langchain-community chromadb pypdf tiktoken
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"

Step 1: Load and Chunk Documents

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load PDFs from a directory
loader = DirectoryLoader('./docs', glob="**/*.pdf", loader_cls=PyPDFLoader)
docs = loader.load()

# Smart chunking — overlap keeps context across chunk boundaries
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", "!", "?", " "],
    length_function=len,
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")

Chunking tips:

  • 500–1000 tokens per chunk works for most use cases
  • 20% overlap prevents losing context at boundaries
  • Smaller chunks = more precise retrieval, less context per chunk

Step 2: Create Embeddings and Vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Create and persist vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs"
)

# Load existing store later
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_docs"
)

Step 3: Build the Retriever

# Basic retriever — top 4 most similar chunks
retriever = vectorstore.as_retriever(
    search_type="mmr",          # Maximum Marginal Relevance — diverse results
    search_kwargs={
        "k": 6,                 # Return top 6 chunks
        "fetch_k": 20,          # Fetch 20, then pick 6 most diverse
        "lambda_mult": 0.7,     # 1.0 = pure similarity, 0.0 = pure diversity
    }
)

Step 4: Build the RAG Chain

from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt_template = """You are a helpful assistant. Use the following context to answer the question.
If you don't know the answer based on the context, say "I don't have enough information."
Do not make up answers.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

# Query it
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata.get('source', 'unknown')}, page {doc.metadata.get('page', '?')}")

Step 5: Add Reranking (Production Must-Have)

Basic retrieval returns the most similar chunks, not necessarily the most relevant. Reranking fixes this:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# CrossEncoder reranker (free, runs locally)
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=3)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Step 6: FastAPI Endpoint

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str

@app.post("/ask")
async def ask(req: QueryRequest):
    result = qa_chain.invoke({"query": req.question})
    return {
        "answer": result["result"],
        "sources": [
            {"file": doc.metadata.get("source"), "page": doc.metadata.get("page")}
            for doc in result["source_documents"]
        ]
    }

RAG Evaluation Checklist

  • Faithfulness: Does the answer match the retrieved context?
  • Answer Relevance: Does the answer address the question?
  • Context Precision: Were the right chunks retrieved?
  • Context Recall: Were all relevant chunks retrieved?

Use RAGAS library for automated RAG evaluation:

pip install ragas

Common RAG Failures and Fixes

ProblemCauseFix
HallucinationLLM ignores retrieved contextStricter prompt + temp=0
Wrong chunks retrievedPoor chunkingHierarchical chunking + metadata filtering
Slow responsesLarge context windowReranking to reduce top-K
Out-of-date answersStale vector storeIncremental indexing pipeline

Advertisement

Sanjeev Sharma

Written by

Sanjeev Sharma

Full Stack Engineer · E-mopro