Build a Production RAG Chatbot in Python: From Zero to Deployed in 2 Hours
Stop watching tutorials. Build a real RAG chatbot with LangChain, ChromaDB, and Claude that answers questions from your own documents. Complete code included.
RAG (Retrieval Augmented Generation) is the most practical AI pattern for businesses. It lets you build a chatbot that answers questions from YOUR data — company docs, product manuals, research papers, whatever — without fine-tuning a model or exposing your data to a third party.
This tutorial builds a production-quality RAG chatbot from scratch. Not a toy demo. A real system with proper chunking, hybrid search, conversation memory, and source citations. By the end, you’ll have something you can actually deploy.
What We’re Building
A chatbot that:
- Ingests PDF, Markdown, and text documents
- Chunks and embeds them into a vector database
- Retrieves relevant context for each user query
- Generates accurate answers with source citations
- Maintains conversation history
- Runs as a web app you can deploy
Tech stack: Python 3.11+, LangChain, ChromaDB, Claude API, FastAPI, and a simple React frontend.
Prerequisites
# Create project directory
mkdir rag-chatbot && cd rag-chatbot
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install langchain langchain-anthropic langchain-chroma \
langchain-huggingface chromadb fastembed \
fastapi uvicorn python-multipart pypdf \
unstructured markdown
You’ll need an Anthropic API key. Set it as an environment variable:
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
Step 1: Document Loading and Chunking
The quality of your RAG system depends entirely on how you chunk your documents. Chunk too small, you lose context. Chunk too large, you dilute relevant information with noise.
# src/ingestion.py
from langchain_community.document_loaders import (
PyPDFLoader,
UnstructuredMarkdownLoader,
TextLoader,
DirectoryLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path
def load_documents(docs_dir: str) -> list:
"""Load documents from a directory, handling multiple file types."""
documents = []
docs_path = Path(docs_dir)
# PDF files
for pdf_file in docs_path.glob("**/*.pdf"):
loader = PyPDFLoader(str(pdf_file))
documents.extend(loader.load())
# Markdown files
for md_file in docs_path.glob("**/*.md"):
loader = UnstructuredMarkdownLoader(str(md_file))
documents.extend(loader.load())
# Text files
for txt_file in docs_path.glob("**/*.txt"):
loader = TextLoader(str(txt_file))
documents.extend(loader.load())
print(f"Loaded {len(documents)} document pages")
return documents
def chunk_documents(documents: list, chunk_size: int = 1000, chunk_overlap: int = 200) -> list:
"""Split documents into chunks with overlap for context preservation."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
return chunks
Why These Settings?
- chunk_size=1000: Roughly 250 words. Large enough to contain a complete thought, small enough for precise retrieval.
- chunk_overlap=200: 20% overlap ensures no information is lost at chunk boundaries. If an important passage spans two chunks, both will contain enough context.
- RecursiveCharacterTextSplitter: Tries to split at paragraph boundaries first, then sentences, then words. This produces more coherent chunks than naive fixed-size splitting.
Step 2: Embedding and Vector Storage
# src/vectorstore.py
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from pathlib import Path
# Use a high-quality, free embedding model
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5"
PERSIST_DIR = "./chroma_db"
def get_embeddings():
"""Initialize the embedding model."""
return HuggingFaceEmbeddings(
model_name=EMBEDDING_MODEL,
model_kwargs={"device": "cpu"}, # Use "cuda" if you have a GPU
encode_kwargs={"normalize_embeddings": True},
)
def create_vectorstore(chunks: list) -> Chroma:
"""Create a ChromaDB vector store from document chunks."""
embeddings = get_embeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=PERSIST_DIR,
collection_metadata={"hnsw:space": "cosine"},
)
print(f"Created vector store with {vectorstore._collection.count()} vectors")
return vectorstore
def load_vectorstore() -> Chroma:
"""Load an existing ChromaDB vector store."""
embeddings = get_embeddings()
return Chroma(
persist_directory=PERSIST_DIR,
embedding_function=embeddings,
)
Why BGE-base?
BAAI/bge-base-en-v1.5 is a 110M parameter embedding model that ranks near the top of the MTEB benchmark. It’s free, runs locally, and produces 768-dimensional embeddings. For most RAG applications, it outperforms OpenAI’s text-embedding-3-small while costing nothing.
Step 3: The Retrieval Chain
This is where most RAG tutorials fall short. Basic retrieval — “find the 4 most similar chunks” — works for simple questions but fails on complex ones. We’ll implement a hybrid approach.
# src/retriever.py
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma
def create_hybrid_retriever(vectorstore: Chroma, chunks: list, k: int = 4):
"""Create a hybrid retriever combining semantic search and BM25."""
# Semantic search (vector similarity)
semantic_retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance for diversity
search_kwargs={
"k": k,
"fetch_k": k * 3, # Fetch more, then diversify
"lambda_mult": 0.7, # Balance between relevance and diversity
},
)
# Keyword search (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks, k=k)
# Combine with equal weights
ensemble_retriever = EnsembleRetriever(
retrievers=[semantic_retriever, bm25_retriever],
weights=[0.6, 0.4], # Slightly favor semantic search
)
return ensemble_retriever
Why Hybrid Search?
Semantic search finds conceptually similar content (“machine learning” matches “neural network training”). BM25 finds exact keyword matches (“error code 0x4F2A” matches documents containing that exact string). Real-world queries need both:
| Query Type | Best Retriever |
|---|---|
| ”How does authentication work?” | Semantic |
| ”Error code AUTH_FAILED_403” | BM25 |
| ”What’s the rate limit for the API?” | Both (semantic for concept, BM25 for “rate limit”) |
Step 4: The Conversation Chain
# src/chain.py
from langchain_anthropic import ChatAnthropic
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate
SYSTEM_PROMPT = """You are a helpful assistant that answers questions based on the provided context.
Rules:
1. ONLY answer based on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer that question."
2. Always cite your sources by mentioning the document name and relevant section.
3. Be specific and precise. Include numbers, dates, and exact terms from the context.
4. If the question is ambiguous, ask for clarification.
5. Format your response with markdown for readability.
Context:
{context}
Chat History:
{chat_history}
Question: {question}
Answer:"""
def create_chain(retriever):
"""Create the conversational RAG chain."""
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
temperature=0.1, # Low temperature for factual accuracy
max_tokens=2048,
)
memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
output_key="answer",
k=5, # Remember last 5 exchanges
)
prompt = PromptTemplate(
template=SYSTEM_PROMPT,
input_variables=["context", "chat_history", "question"],
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=memory,
return_source_documents=True,
combine_docs_chain_kwargs={"prompt": prompt},
verbose=False,
)
return chain
Key Design Decisions
Temperature 0.1: We want factual, consistent answers, not creative ones. Low temperature ensures the model sticks to the retrieved context.
ConversationBufferWindowMemory with k=5: Remembers the last 5 exchanges. This allows follow-up questions (“What about the pricing?” after asking about a product) without accumulating unlimited context that would slow responses and increase costs.
Claude Sonnet: The best balance of quality and cost for RAG applications. Opus is overkill for most Q&A tasks; Haiku might miss nuance in complex answers.
Step 5: The API Server
# src/api.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import shutil
from pathlib import Path
from .ingestion import load_documents, chunk_documents
from .vectorstore import create_vectorstore, load_vectorstore
from .retriever import create_hybrid_retriever
from .chain import create_chain
app = FastAPI(title="RAG Chatbot API")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
DOCS_DIR = "./documents"
Path(DOCS_DIR).mkdir(exist_ok=True)
# Global state
chain = None
chunks = None
class QueryRequest(BaseModel):
question: str
class QueryResponse(BaseModel):
answer: str
sources: list[dict]
@app.on_event("startup")
async def startup():
"""Initialize the chain on startup if documents exist."""
global chain, chunks
if any(Path(DOCS_DIR).iterdir()):
documents = load_documents(DOCS_DIR)
chunks = chunk_documents(documents)
vectorstore = create_vectorstore(chunks)
retriever = create_hybrid_retriever(vectorstore, chunks)
chain = create_chain(retriever)
@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
"""Upload a document to the knowledge base."""
global chain, chunks
allowed_extensions = {".pdf", ".md", ".txt"}
file_ext = Path(file.filename).suffix.lower()
if file_ext not in allowed_extensions:
raise HTTPException(400, f"Unsupported file type: {file_ext}")
file_path = Path(DOCS_DIR) / file.filename
with open(file_path, "wb") as f:
shutil.copyfileobj(file.file, f)
# Re-index all documents
documents = load_documents(DOCS_DIR)
chunks = chunk_documents(documents)
vectorstore = create_vectorstore(chunks)
retriever = create_hybrid_retriever(vectorstore, chunks)
chain = create_chain(retriever)
return {"message": f"Uploaded {file.filename}", "total_chunks": len(chunks)}
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
"""Query the RAG chatbot."""
if chain is None:
raise HTTPException(400, "No documents loaded. Upload documents first.")
result = chain.invoke({"question": request.question})
sources = []
for doc in result.get("source_documents", []):
sources.append({
"content": doc.page_content[:200] + "...",
"source": doc.metadata.get("source", "Unknown"),
"page": doc.metadata.get("page", None),
})
return QueryResponse(answer=result["answer"], sources=sources)
@app.get("/health")
async def health():
return {"status": "ok", "documents_loaded": chain is not None}
Step 6: Run It
# Create the package structure
mkdir -p src
touch src/__init__.py
# Run the server
uvicorn src.api:app --reload --port 8000
Test with curl:
# Upload a document
curl -X POST http://localhost:8000/upload \
-F "file=@your-document.pdf"
# Ask a question
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the return policy?"}'
Step 7: Evaluation — Don’t Skip This
Most RAG tutorials end at “it works!” That’s not good enough for production. You need to evaluate retrieval quality and answer quality.
# src/evaluate.py
from dataclasses import dataclass
@dataclass
class TestCase:
question: str
expected_answer: str
expected_source: str
# Define test cases based on your documents
TEST_CASES = [
TestCase(
question="What is the return policy for electronics?",
expected_answer="30-day return window",
expected_source="return-policy.pdf",
),
TestCase(
question="What are the API rate limits?",
expected_answer="100 requests per minute",
expected_source="api-docs.md",
),
]
def evaluate_retrieval(retriever, test_cases: list[TestCase]):
"""Evaluate retrieval quality."""
results = []
for tc in test_cases:
docs = retriever.invoke(tc.question)
sources = [d.metadata.get("source", "") for d in docs]
# Check if expected source is in retrieved documents
source_found = any(tc.expected_source in s for s in sources)
# Check if expected answer content is in retrieved chunks
content = " ".join([d.page_content for d in docs])
answer_found = tc.expected_answer.lower() in content.lower()
results.append({
"question": tc.question,
"source_found": source_found,
"answer_in_context": answer_found,
})
# Calculate metrics
source_recall = sum(r["source_found"] for r in results) / len(results)
context_recall = sum(r["answer_in_context"] for r in results) / len(results)
print(f"Source Recall: {source_recall:.1%}")
print(f"Context Recall: {context_recall:.1%}")
return results
Target metrics:
- Source Recall > 90%: The correct source document should appear in retrieved chunks for 9 out of 10 queries
- Context Recall > 80%: The information needed to answer should be present in retrieved chunks for 8 out of 10 queries
If you’re below these thresholds, adjust chunk_size, chunk_overlap, or retriever parameters before worrying about the LLM prompt.
Common Mistakes (and How to Avoid Them)
Mistake 1: Chunks Too Small
If your chunks are 200 tokens, they lack context. The LLM receives disconnected sentence fragments and hallucinates to fill the gaps. Start with 1,000 characters and adjust based on your content type.
Mistake 2: No Overlap
Zero overlap means information at chunk boundaries is split. If a key paragraph is cut in half, neither chunk contains the full answer. Always use at least 10-20% overlap.
Mistake 3: Ignoring Metadata
Don’t strip metadata from documents. Source filename, page number, section headers — all of these help the LLM cite sources and help you debug retrieval issues.
Mistake 4: Too Many Retrieved Chunks
Retrieving 10 chunks when 4 would suffice dilutes the signal with noise. The LLM has to wade through irrelevant content, increasing the chance of hallucination. Start with k=4 and increase only if you’re getting low context recall.
Mistake 5: No Evaluation
You cannot improve what you don’t measure. Build the evaluation pipeline FIRST, before you start tweaking parameters.
Deploying to Production
For a real deployment, you’ll want:
- Authentication: Add API key middleware to protect your endpoints
- Rate limiting: Prevent abuse with
slowapior similar - Persistent storage: Use a hosted ChromaDB instance or switch to Pinecone/Weaviate for production vector storage
- Monitoring: Log queries, retrieval results, and response times
- Error handling: Graceful degradation when the LLM API is down
# Deploy with Docker
docker build -t rag-chatbot .
docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY rag-chatbot
The Bottom Line
RAG is not complicated. The engineering challenge is in the details — chunking strategy, retrieval quality, prompt design, and evaluation. Get these right, and you have a system that can answer questions from any document corpus with high accuracy and full source citations.
The code in this tutorial is production-ready as a starting point. Clone it, load your documents, evaluate, tune, and deploy. You’ll have a working knowledge assistant faster than you can read most AI whitepapers.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.