How to Deploy AI Models on a Raspberry Pi: Run LLMs on a $75 Computer
You don't need a $10,000 GPU to run AI locally. This tutorial shows you how to deploy small language models on a Raspberry Pi 5 — from hardware setup to running inference in under 30 minutes.
Running AI on a $75 single-board computer sounds absurd until you realize that a quantized 3B parameter model in 2026 is more capable than GPT-3 was in 2022 — and a Raspberry Pi 5 with 8GB RAM can run it at reasonable speeds.
This isn’t about replacing cloud AI. It’s about having a private, offline, always-available AI assistant that costs nothing per query, runs on your home network, and processes your data without sending a single byte to any external server.
Here’s how to set it up.
What You’ll Need
Hardware
| Component | Recommended | Minimum | Cost |
|---|---|---|---|
| Raspberry Pi 5 | 8GB RAM | 4GB RAM | $60-80 |
| MicroSD Card | 64GB A2 | 32GB A1 | $10-15 |
| Power Supply | USB-C 27W official | USB-C 15W | $12 |
| Cooling | Active cooler (fan + heatsink) | Passive heatsink | $5-15 |
| Case | Official Pi 5 case | Any case | $10 |
| NVMe SSD (optional) | 256GB+ via Pi HAT | Not required | $25-40 |
Total cost: $75-160
The 8GB model is strongly recommended. With 4GB, you’re limited to tiny models (1-2B parameters) that are noticeably less capable.
The NVMe SSD isn’t required but significantly improves model loading time (models are 2-4GB files that need to be read from storage).
Software
- Raspberry Pi OS (64-bit, Bookworm or later)
- Ollama or llama.cpp
- Python 3.11+ (for custom scripts)
Step 1: Set Up Your Raspberry Pi
Flash Raspberry Pi OS 64-bit to your microSD card using the Raspberry Pi Imager:
# After first boot, update everything
sudo apt update && sudo apt upgrade -y
# Install essential build tools
sudo apt install -y build-essential cmake git python3-pip
# Check your system
uname -m # Should output: aarch64
free -h # Check available RAM
df -h # Check storage space
Important: You must use the 64-bit OS. The 32-bit version cannot address enough memory for AI models.
Optional: Set Up NVMe SSD
If you have an NVMe HAT:
# Check NVMe is detected
lsblk
# Format and mount
sudo mkfs.ext4 /dev/nvme0n1p1
sudo mkdir /mnt/ssd
sudo mount /dev/nvme0n1p1 /mnt/ssd
# Add to fstab for auto-mount
echo '/dev/nvme0n1p1 /mnt/ssd ext4 defaults 0 2' | sudo tee -a /etc/fstab
# Use SSD for model storage
mkdir -p /mnt/ssd/models
Step 2: Install Ollama (Easy Method)
Ollama is the simplest way to run LLMs locally. It handles model downloading, quantization, and serving with a single binary:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama service (runs in background)
ollama serve &
Download a Model
For Raspberry Pi 5 (8GB), these models work well:
# Recommended: Phi-3 Mini (3.8B parameters, ~2.3GB)
ollama pull phi3:mini
# Alternative: Gemma 2 2B (~1.6GB)
ollama pull gemma2:2b
# Alternative: Llama 3.2 3B (~2GB)
ollama pull llama3.2:3b
# Alternative: Qwen 2.5 3B (~2GB)
ollama pull qwen2.5:3b
# List downloaded models
ollama list
Test It
# Interactive chat
ollama run phi3:mini
# Single query
ollama run phi3:mini "Explain quantum computing in 3 sentences"
# API endpoint (for integration with other tools)
curl http://localhost:11434/api/generate -d '{
"model": "phi3:mini",
"prompt": "Write a Python function to calculate fibonacci numbers",
"stream": false
}'
Performance Expectations
| Model | Size | Tokens/sec (Pi 5 8GB) | Quality |
|---|---|---|---|
| Qwen 2.5 0.5B | 400MB | 15-20 tok/s | Basic |
| Gemma 2 2B | 1.6GB | 6-10 tok/s | Good |
| Phi-3 Mini 3.8B | 2.3GB | 3-6 tok/s | Very Good |
| Llama 3.2 3B | 2.0GB | 4-7 tok/s | Very Good |
| Mistral 7B (Q4) | 4.1GB | 1-3 tok/s | Excellent (slow) |
3-7 tokens per second is slower than typing speed but fast enough for useful interactions. You won’t be generating novels, but answering questions, summarizing text, and writing code snippets is practical.
Step 3: Install llama.cpp (Advanced Method)
For more control over performance and model selection, build llama.cpp from source:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with optimizations for ARM (Raspberry Pi)
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CPU_AARCH64=ON
cmake --build build --config Release -j4
# Verify build
./build/bin/llama-cli --help
Download Models from Hugging Face
# Install huggingface-cli
pip3 install huggingface-hub
# Download a GGUF model (pre-quantized for llama.cpp)
huggingface-cli download \
TheBloke/Phi-3-mini-4k-instruct-GGUF \
phi-3-mini-4k-instruct.Q4_K_M.gguf \
--local-dir ./models/
# Or use the built-in converter for other models
python3 convert_hf_to_gguf.py \
--outfile models/custom-model.gguf \
/path/to/huggingface/model/
Run the Model
# Interactive chat
./build/bin/llama-cli \
-m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
-c 2048 \ # Context window
-n 512 \ # Max tokens to generate
-t 4 \ # Number of threads (Pi 5 has 4 cores)
--interactive \
-p "You are a helpful assistant."
# Start an API server (OpenAI-compatible)
./build/bin/llama-server \
-m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
-c 2048 \
-t 4 \
--host 0.0.0.0 \
--port 8080
Quantization Levels
| Quantization | Size Reduction | Quality Loss | Recommended |
|---|---|---|---|
| Q8_0 | ~50% | Minimal | If RAM allows |
| Q6_K | ~60% | Very small | Good balance |
| Q5_K_M | ~65% | Small | Good balance |
| Q4_K_M | ~75% | Moderate | Best for Pi |
| Q3_K_M | ~80% | Notable | Only if needed |
| Q2_K | ~85% | Significant | Not recommended |
For Raspberry Pi, Q4_K_M is the sweet spot — it provides the best balance of model quality and memory usage.
Step 4: Build a Useful Application
Home Assistant AI
Create a local AI that answers questions about your smart home:
#!/usr/bin/env python3
"""Local AI assistant accessible via your home network."""
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434/api/generate"
SYSTEM_PROMPT = """You are a helpful home assistant running on a
Raspberry Pi. You help with:
- Answering general knowledge questions
- Writing and explaining code
- Summarizing text
- Giving recipe suggestions
- Helping with math and science
Keep responses concise since you're running on limited hardware."""
@app.route('/ask', methods=['POST'])
def ask():
user_message = request.json.get('message', '')
response = requests.post(OLLAMA_URL, json={
"model": "phi3:mini",
"prompt": f"{SYSTEM_PROMPT}\n\nUser: {user_message}\nAssistant:",
"stream": False,
"options": {
"temperature": 0.7,
"num_predict": 256
}
})
result = response.json()
return jsonify({"response": result['response']})
@app.route('/summarize', methods=['POST'])
def summarize():
text = request.json.get('text', '')
response = requests.post(OLLAMA_URL, json={
"model": "phi3:mini",
"prompt": f"Summarize the following text in 3 bullet points:\n\n{text}",
"stream": False,
"options": {"num_predict": 200}
})
result = response.json()
return jsonify({"summary": result['response']})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
# Install dependencies
pip3 install flask requests
# Run the server
python3 home_assistant.py
# Access from any device on your network
# http://raspberrypi.local:5000/ask
Offline Document Q&A
#!/usr/bin/env python3
"""Ask questions about local documents without internet."""
import os
import requests
OLLAMA_URL = "http://localhost:11434/api/generate"
def read_document(filepath):
with open(filepath, 'r') as f:
return f.read()
def ask_about_document(doc_text, question):
# Truncate document to fit in context window
max_chars = 4000 # ~1000 tokens
if len(doc_text) > max_chars:
doc_text = doc_text[:max_chars] + "\n[Document truncated...]"
prompt = f"""Based on the following document, answer the question.
Document:
{doc_text}
Question: {question}
Answer:"""
response = requests.post(OLLAMA_URL, json={
"model": "phi3:mini",
"prompt": prompt,
"stream": False,
"options": {"num_predict": 300, "temperature": 0.3}
})
return response.json()['response']
# Usage
doc = read_document("/path/to/your/document.txt")
answer = ask_about_document(doc, "What are the main conclusions?")
print(answer)
Step 5: Run as a System Service
Make your AI assistant start automatically on boot:
# Create systemd service file
sudo tee /etc/systemd/system/local-ai.service << 'EOF'
[Unit]
Description=Local AI Assistant
After=network.target
[Service]
Type=simple
User=pi
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=10
Environment=OLLAMA_HOST=0.0.0.0
[Install]
WantedBy=multi-user.target
EOF
# Enable and start the service
sudo systemctl enable local-ai.service
sudo systemctl start local-ai.service
# Check status
sudo systemctl status local-ai.service
Performance Optimization Tips
# 1. Increase swap (helps when RAM is tight)
sudo dphys-swapfile swapoff
sudo sed -i 's/CONF_SWAPSIZE=.*/CONF_SWAPSIZE=4096/' /etc/dphys-swapfile
sudo dphys-swapfile setup
sudo dphys-swapfile swapon
# 2. Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# 3. Monitor temperature (throttling kills performance)
vcgencmd measure_temp
# 4. Use NVMe SSD for model storage (faster loading)
# 5. Close unnecessary services
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon
What You Can Build
With a Raspberry Pi running a local LLM, you can:
- Private AI assistant — Ask questions without data leaving your home
- Document summarizer — Drop files into a folder, get summaries automatically
- Code helper — Local coding assistant for when you’re offline
- Email drafting — Generate email responses locally
- Language learning — Practice conversations with an AI tutor
- Home automation — Natural language control for IoT devices
- Kids’ homework helper — Safe, offline AI tutor with no account required
The key insight: a 3B model on a Raspberry Pi isn’t competing with GPT-5. It’s competing with not having any AI at all — and for many use cases, a local, private, free-to-query AI assistant is more than good enough.
The total cost of ownership after setup: electricity. About $5-10 per year. That’s your entire AI budget. No subscriptions, no API keys, no data privacy concerns. Just a small computer under your desk, answering questions whenever you ask.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.