Build a Production RAG Chatbot in Python: From Zero to Deployed in 2 Hours

RAG (Retrieval Augmented Generation) is the most practical AI pattern for businesses. It lets you build a chatbot that answers questions from YOUR data — company docs, product manuals, research papers, whatever — without fine-tuning a model or exposing your data to a third party.

This tutorial builds a production-quality RAG chatbot from scratch. Not a toy demo. A real system with proper chunking, hybrid search, conversation memory, and source citations. By the end, you’ll have something you can actually deploy.

What We’re Building

A chatbot that:

Ingests PDF, Markdown, and text documents
Chunks and embeds them into a vector database
Retrieves relevant context for each user query
Generates accurate answers with source citations
Maintains conversation history
Runs as a web app you can deploy

Tech stack: Python 3.11+, LangChain, ChromaDB, Claude API, FastAPI, and a simple React frontend.

Prerequisites

# Create project directory
mkdir rag-chatbot && cd rag-chatbot
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install langchain langchain-anthropic langchain-chroma \
    langchain-huggingface chromadb fastembed \
    fastapi uvicorn python-multipart pypdf \
    unstructured markdown

You’ll need an Anthropic API key. Set it as an environment variable:

export ANTHROPIC_API_KEY="sk-ant-your-key-here"

Step 1: Document Loading and Chunking

The quality of your RAG system depends entirely on how you chunk your documents. Chunk too small, you lose context. Chunk too large, you dilute relevant information with noise.

# src/ingestion.py
from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredMarkdownLoader,
    TextLoader,
    DirectoryLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path


def load_documents(docs_dir: str) -> list:
    """Load documents from a directory, handling multiple file types."""
    documents = []
    docs_path = Path(docs_dir)

    # PDF files
    for pdf_file in docs_path.glob("**/*.pdf"):
        loader = PyPDFLoader(str(pdf_file))
        documents.extend(loader.load())

    # Markdown files
    for md_file in docs_path.glob("**/*.md"):
        loader = UnstructuredMarkdownLoader(str(md_file))
        documents.extend(loader.load())

    # Text files
    for txt_file in docs_path.glob("**/*.txt"):
        loader = TextLoader(str(txt_file))
        documents.extend(loader.load())

    print(f"Loaded {len(documents)} document pages")
    return documents


def chunk_documents(documents: list, chunk_size: int = 1000, chunk_overlap: int = 200) -> list:
    """Split documents into chunks with overlap for context preservation."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    return chunks

Why These Settings?

chunk_size=1000: Roughly 250 words. Large enough to contain a complete thought, small enough for precise retrieval.
chunk_overlap=200: 20% overlap ensures no information is lost at chunk boundaries. If an important passage spans two chunks, both will contain enough context.
RecursiveCharacterTextSplitter: Tries to split at paragraph boundaries first, then sentences, then words. This produces more coherent chunks than naive fixed-size splitting.

Step 2: Embedding and Vector Storage

# src/vectorstore.py
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from pathlib import Path

# Use a high-quality, free embedding model
EMBEDDING_MODEL = "BAAI/bge-base-en-v1.5"
PERSIST_DIR = "./chroma_db"


def get_embeddings():
    """Initialize the embedding model."""
    return HuggingFaceEmbeddings(
        model_name=EMBEDDING_MODEL,
        model_kwargs={"device": "cpu"},  # Use "cuda" if you have a GPU
        encode_kwargs={"normalize_embeddings": True},
    )


def create_vectorstore(chunks: list) -> Chroma:
    """Create a ChromaDB vector store from document chunks."""
    embeddings = get_embeddings()

    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=PERSIST_DIR,
        collection_metadata={"hnsw:space": "cosine"},
    )

    print(f"Created vector store with {vectorstore._collection.count()} vectors")
    return vectorstore


def load_vectorstore() -> Chroma:
    """Load an existing ChromaDB vector store."""
    embeddings = get_embeddings()
    return Chroma(
        persist_directory=PERSIST_DIR,
        embedding_function=embeddings,
    )

Why BGE-base?

BAAI/bge-base-en-v1.5 is a 110M parameter embedding model that ranks near the top of the MTEB benchmark. It’s free, runs locally, and produces 768-dimensional embeddings. For most RAG applications, it outperforms OpenAI’s text-embedding-3-small while costing nothing.

Step 3: The Retrieval Chain

This is where most RAG tutorials fall short. Basic retrieval — “find the 4 most similar chunks” — works for simple questions but fails on complex ones. We’ll implement a hybrid approach.

# src/retriever.py
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_chroma import Chroma


def create_hybrid_retriever(vectorstore: Chroma, chunks: list, k: int = 4):
    """Create a hybrid retriever combining semantic search and BM25."""

    # Semantic search (vector similarity)
    semantic_retriever = vectorstore.as_retriever(
        search_type="mmr",  # Maximum Marginal Relevance for diversity
        search_kwargs={
            "k": k,
            "fetch_k": k * 3,  # Fetch more, then diversify
            "lambda_mult": 0.7,  # Balance between relevance and diversity
        },
    )

    # Keyword search (BM25)
    bm25_retriever = BM25Retriever.from_documents(chunks, k=k)

    # Combine with equal weights
    ensemble_retriever = EnsembleRetriever(
        retrievers=[semantic_retriever, bm25_retriever],
        weights=[0.6, 0.4],  # Slightly favor semantic search
    )

    return ensemble_retriever

Why Hybrid Search?

Semantic search finds conceptually similar content (“machine learning” matches “neural network training”). BM25 finds exact keyword matches (“error code 0x4F2A” matches documents containing that exact string). Real-world queries need both:

Query Type	Best Retriever
”How does authentication work?”	Semantic
”Error code AUTH_FAILED_403”	BM25
”What’s the rate limit for the API?”	Both (semantic for concept, BM25 for “rate limit”)

Step 4: The Conversation Chain

# src/chain.py
from langchain_anthropic import ChatAnthropic
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate


SYSTEM_PROMPT = """You are a helpful assistant that answers questions based on the provided context.

Rules:
1. ONLY answer based on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer that question."
2. Always cite your sources by mentioning the document name and relevant section.
3. Be specific and precise. Include numbers, dates, and exact terms from the context.
4. If the question is ambiguous, ask for clarification.
5. Format your response with markdown for readability.

Context:
{context}

Chat History:
{chat_history}

Question: {question}

Answer:"""


def create_chain(retriever):
    """Create the conversational RAG chain."""
    llm = ChatAnthropic(
        model="claude-sonnet-4-20250514",
        temperature=0.1,  # Low temperature for factual accuracy
        max_tokens=2048,
    )

    memory = ConversationBufferWindowMemory(
        memory_key="chat_history",
        return_messages=True,
        output_key="answer",
        k=5,  # Remember last 5 exchanges
    )

    prompt = PromptTemplate(
        template=SYSTEM_PROMPT,
        input_variables=["context", "chat_history", "question"],
    )

    chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=retriever,
        memory=memory,
        return_source_documents=True,
        combine_docs_chain_kwargs={"prompt": prompt},
        verbose=False,
    )

    return chain

Key Design Decisions

Temperature 0.1: We want factual, consistent answers, not creative ones. Low temperature ensures the model sticks to the retrieved context.

ConversationBufferWindowMemory with k=5: Remembers the last 5 exchanges. This allows follow-up questions (“What about the pricing?” after asking about a product) without accumulating unlimited context that would slow responses and increase costs.

Claude Sonnet: The best balance of quality and cost for RAG applications. Opus is overkill for most Q&A tasks; Haiku might miss nuance in complex answers.

Step 5: The API Server

# src/api.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import shutil
from pathlib import Path

from .ingestion import load_documents, chunk_documents
from .vectorstore import create_vectorstore, load_vectorstore
from .retriever import create_hybrid_retriever
from .chain import create_chain

app = FastAPI(title="RAG Chatbot API")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

DOCS_DIR = "./documents"
Path(DOCS_DIR).mkdir(exist_ok=True)

# Global state
chain = None
chunks = None


class QueryRequest(BaseModel):
    question: str


class QueryResponse(BaseModel):
    answer: str
    sources: list[dict]


@app.on_event("startup")
async def startup():
    """Initialize the chain on startup if documents exist."""
    global chain, chunks
    if any(Path(DOCS_DIR).iterdir()):
        documents = load_documents(DOCS_DIR)
        chunks = chunk_documents(documents)
        vectorstore = create_vectorstore(chunks)
        retriever = create_hybrid_retriever(vectorstore, chunks)
        chain = create_chain(retriever)


@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
    """Upload a document to the knowledge base."""
    global chain, chunks

    allowed_extensions = {".pdf", ".md", ".txt"}
    file_ext = Path(file.filename).suffix.lower()
    if file_ext not in allowed_extensions:
        raise HTTPException(400, f"Unsupported file type: {file_ext}")

    file_path = Path(DOCS_DIR) / file.filename
    with open(file_path, "wb") as f:
        shutil.copyfileobj(file.file, f)

    # Re-index all documents
    documents = load_documents(DOCS_DIR)
    chunks = chunk_documents(documents)
    vectorstore = create_vectorstore(chunks)
    retriever = create_hybrid_retriever(vectorstore, chunks)
    chain = create_chain(retriever)

    return {"message": f"Uploaded {file.filename}", "total_chunks": len(chunks)}


@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
    """Query the RAG chatbot."""
    if chain is None:
        raise HTTPException(400, "No documents loaded. Upload documents first.")

    result = chain.invoke({"question": request.question})

    sources = []
    for doc in result.get("source_documents", []):
        sources.append({
            "content": doc.page_content[:200] + "...",
            "source": doc.metadata.get("source", "Unknown"),
            "page": doc.metadata.get("page", None),
        })

    return QueryResponse(answer=result["answer"], sources=sources)


@app.get("/health")
async def health():
    return {"status": "ok", "documents_loaded": chain is not None}

Step 6: Run It

# Create the package structure
mkdir -p src
touch src/__init__.py

# Run the server
uvicorn src.api:app --reload --port 8000

Test with curl:

# Upload a document
curl -X POST http://localhost:8000/upload \
  -F "file=@your-document.pdf"

# Ask a question
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the return policy?"}'

Step 7: Evaluation — Don’t Skip This

Most RAG tutorials end at “it works!” That’s not good enough for production. You need to evaluate retrieval quality and answer quality.

# src/evaluate.py
from dataclasses import dataclass


@dataclass
class TestCase:
    question: str
    expected_answer: str
    expected_source: str


# Define test cases based on your documents
TEST_CASES = [
    TestCase(
        question="What is the return policy for electronics?",
        expected_answer="30-day return window",
        expected_source="return-policy.pdf",
    ),
    TestCase(
        question="What are the API rate limits?",
        expected_answer="100 requests per minute",
        expected_source="api-docs.md",
    ),
]


def evaluate_retrieval(retriever, test_cases: list[TestCase]):
    """Evaluate retrieval quality."""
    results = []
    for tc in test_cases:
        docs = retriever.invoke(tc.question)
        sources = [d.metadata.get("source", "") for d in docs]

        # Check if expected source is in retrieved documents
        source_found = any(tc.expected_source in s for s in sources)

        # Check if expected answer content is in retrieved chunks
        content = " ".join([d.page_content for d in docs])
        answer_found = tc.expected_answer.lower() in content.lower()

        results.append({
            "question": tc.question,
            "source_found": source_found,
            "answer_in_context": answer_found,
        })

    # Calculate metrics
    source_recall = sum(r["source_found"] for r in results) / len(results)
    context_recall = sum(r["answer_in_context"] for r in results) / len(results)

    print(f"Source Recall: {source_recall:.1%}")
    print(f"Context Recall: {context_recall:.1%}")
    return results

Target metrics:

Source Recall > 90%: The correct source document should appear in retrieved chunks for 9 out of 10 queries
Context Recall > 80%: The information needed to answer should be present in retrieved chunks for 8 out of 10 queries

If you’re below these thresholds, adjust chunk_size, chunk_overlap, or retriever parameters before worrying about the LLM prompt.

Common Mistakes (and How to Avoid Them)

Mistake 1: Chunks Too Small

If your chunks are 200 tokens, they lack context. The LLM receives disconnected sentence fragments and hallucinates to fill the gaps. Start with 1,000 characters and adjust based on your content type.

Mistake 2: No Overlap

Zero overlap means information at chunk boundaries is split. If a key paragraph is cut in half, neither chunk contains the full answer. Always use at least 10-20% overlap.

Mistake 3: Ignoring Metadata

Don’t strip metadata from documents. Source filename, page number, section headers — all of these help the LLM cite sources and help you debug retrieval issues.

Mistake 4: Too Many Retrieved Chunks

Retrieving 10 chunks when 4 would suffice dilutes the signal with noise. The LLM has to wade through irrelevant content, increasing the chance of hallucination. Start with k=4 and increase only if you’re getting low context recall.

Mistake 5: No Evaluation

You cannot improve what you don’t measure. Build the evaluation pipeline FIRST, before you start tweaking parameters.

Deploying to Production

For a real deployment, you’ll want:

Authentication: Add API key middleware to protect your endpoints
Rate limiting: Prevent abuse with slowapi or similar
Persistent storage: Use a hosted ChromaDB instance or switch to Pinecone/Weaviate for production vector storage
Monitoring: Log queries, retrieval results, and response times
Error handling: Graceful degradation when the LLM API is down

# Deploy with Docker
docker build -t rag-chatbot .
docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY rag-chatbot

The Bottom Line

RAG is not complicated. The engineering challenge is in the details — chunking strategy, retrieval quality, prompt design, and evaluation. Get these right, and you have a system that can answer questions from any document corpus with high accuracy and full source citations.

The code in this tutorial is production-ready as a starting point. Clone it, load your documents, evaluate, tune, and deploy. You’ll have a working knowledge assistant faster than you can read most AI whitepapers.