How to Deploy AI Models on a Raspberry Pi: Run LLMs on a $75 Computer

Running AI on a $75 single-board computer sounds absurd until you realize that a quantized 3B parameter model in 2026 is more capable than GPT-3 was in 2022 — and a Raspberry Pi 5 with 8GB RAM can run it at reasonable speeds.

This isn’t about replacing cloud AI. It’s about having a private, offline, always-available AI assistant that costs nothing per query, runs on your home network, and processes your data without sending a single byte to any external server.

Here’s how to set it up.

What You’ll Need

Hardware

Component	Recommended	Minimum	Cost
Raspberry Pi 5	8GB RAM	4GB RAM	$60-80
MicroSD Card	64GB A2	32GB A1	$10-15
Power Supply	USB-C 27W official	USB-C 15W	$12
Cooling	Active cooler (fan + heatsink)	Passive heatsink	$5-15
Case	Official Pi 5 case	Any case	$10
NVMe SSD (optional)	256GB+ via Pi HAT	Not required	$25-40

Total cost: $75-160

The 8GB model is strongly recommended. With 4GB, you’re limited to tiny models (1-2B parameters) that are noticeably less capable.

The NVMe SSD isn’t required but significantly improves model loading time (models are 2-4GB files that need to be read from storage).

Software

Raspberry Pi OS (64-bit, Bookworm or later)
Ollama or llama.cpp
Python 3.11+ (for custom scripts)

Step 1: Set Up Your Raspberry Pi

Flash Raspberry Pi OS 64-bit to your microSD card using the Raspberry Pi Imager:

# After first boot, update everything
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential cmake git python3-pip

# Check your system
uname -m     # Should output: aarch64
free -h      # Check available RAM
df -h        # Check storage space

Important: You must use the 64-bit OS. The 32-bit version cannot address enough memory for AI models.

Optional: Set Up NVMe SSD

If you have an NVMe HAT:

# Check NVMe is detected
lsblk

# Format and mount
sudo mkfs.ext4 /dev/nvme0n1p1
sudo mkdir /mnt/ssd
sudo mount /dev/nvme0n1p1 /mnt/ssd

# Add to fstab for auto-mount
echo '/dev/nvme0n1p1 /mnt/ssd ext4 defaults 0 2' | sudo tee -a /etc/fstab

# Use SSD for model storage
mkdir -p /mnt/ssd/models

Step 2: Install Ollama (Easy Method)

Ollama is the simplest way to run LLMs locally. It handles model downloading, quantization, and serving with a single binary:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start the Ollama service (runs in background)
ollama serve &

Download a Model

For Raspberry Pi 5 (8GB), these models work well:

# Recommended: Phi-3 Mini (3.8B parameters, ~2.3GB)
ollama pull phi3:mini

# Alternative: Gemma 2 2B (~1.6GB)
ollama pull gemma2:2b

# Alternative: Llama 3.2 3B (~2GB)
ollama pull llama3.2:3b

# Alternative: Qwen 2.5 3B (~2GB)
ollama pull qwen2.5:3b

# List downloaded models
ollama list

Test It

# Interactive chat
ollama run phi3:mini

# Single query
ollama run phi3:mini "Explain quantum computing in 3 sentences"

# API endpoint (for integration with other tools)
curl http://localhost:11434/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Write a Python function to calculate fibonacci numbers",
  "stream": false
}'

Performance Expectations

Model	Size	Tokens/sec (Pi 5 8GB)	Quality
Qwen 2.5 0.5B	400MB	15-20 tok/s	Basic
Gemma 2 2B	1.6GB	6-10 tok/s	Good
Phi-3 Mini 3.8B	2.3GB	3-6 tok/s	Very Good
Llama 3.2 3B	2.0GB	4-7 tok/s	Very Good
Mistral 7B (Q4)	4.1GB	1-3 tok/s	Excellent (slow)

3-7 tokens per second is slower than typing speed but fast enough for useful interactions. You won’t be generating novels, but answering questions, summarizing text, and writing code snippets is practical.

Step 3: Install llama.cpp (Advanced Method)

For more control over performance and model selection, build llama.cpp from source:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with optimizations for ARM (Raspberry Pi)
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_AARCH64=ON

cmake --build build --config Release -j4

# Verify build
./build/bin/llama-cli --help

Download Models from Hugging Face

# Install huggingface-cli
pip3 install huggingface-hub

# Download a GGUF model (pre-quantized for llama.cpp)
huggingface-cli download \
  TheBloke/Phi-3-mini-4k-instruct-GGUF \
  phi-3-mini-4k-instruct.Q4_K_M.gguf \
  --local-dir ./models/

# Or use the built-in converter for other models
python3 convert_hf_to_gguf.py \
  --outfile models/custom-model.gguf \
  /path/to/huggingface/model/

Run the Model

# Interactive chat
./build/bin/llama-cli \
  -m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 2048 \       # Context window
  -n 512 \        # Max tokens to generate
  -t 4 \          # Number of threads (Pi 5 has 4 cores)
  --interactive \
  -p "You are a helpful assistant."

# Start an API server (OpenAI-compatible)
./build/bin/llama-server \
  -m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 2048 \
  -t 4 \
  --host 0.0.0.0 \
  --port 8080

Quantization Levels

Quantization	Size Reduction	Quality Loss	Recommended
Q8_0	~50%	Minimal	If RAM allows
Q6_K	~60%	Very small	Good balance
Q5_K_M	~65%	Small	Good balance
Q4_K_M	~75%	Moderate	Best for Pi
Q3_K_M	~80%	Notable	Only if needed
Q2_K	~85%	Significant	Not recommended

For Raspberry Pi, Q4_K_M is the sweet spot — it provides the best balance of model quality and memory usage.

Step 4: Build a Useful Application

Home Assistant AI

Create a local AI that answers questions about your smart home:

#!/usr/bin/env python3
"""Local AI assistant accessible via your home network."""

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434/api/generate"

SYSTEM_PROMPT = """You are a helpful home assistant running on a 
Raspberry Pi. You help with:
- Answering general knowledge questions
- Writing and explaining code
- Summarizing text
- Giving recipe suggestions
- Helping with math and science
Keep responses concise since you're running on limited hardware."""

@app.route('/ask', methods=['POST'])
def ask():
    user_message = request.json.get('message', '')
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": f"{SYSTEM_PROMPT}\n\nUser: {user_message}\nAssistant:",
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 256
        }
    })
    
    result = response.json()
    return jsonify({"response": result['response']})

@app.route('/summarize', methods=['POST'])
def summarize():
    text = request.json.get('text', '')
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": f"Summarize the following text in 3 bullet points:\n\n{text}",
        "stream": False,
        "options": {"num_predict": 200}
    })
    
    result = response.json()
    return jsonify({"summary": result['response']})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

# Install dependencies
pip3 install flask requests

# Run the server
python3 home_assistant.py

# Access from any device on your network
# http://raspberrypi.local:5000/ask

Offline Document Q&A

#!/usr/bin/env python3
"""Ask questions about local documents without internet."""

import os
import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

def read_document(filepath):
    with open(filepath, 'r') as f:
        return f.read()

def ask_about_document(doc_text, question):
    # Truncate document to fit in context window
    max_chars = 4000  # ~1000 tokens
    if len(doc_text) > max_chars:
        doc_text = doc_text[:max_chars] + "\n[Document truncated...]"
    
    prompt = f"""Based on the following document, answer the question.
    
Document:
{doc_text}

Question: {question}

Answer:"""
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": prompt,
        "stream": False,
        "options": {"num_predict": 300, "temperature": 0.3}
    })
    
    return response.json()['response']

# Usage
doc = read_document("/path/to/your/document.txt")
answer = ask_about_document(doc, "What are the main conclusions?")
print(answer)

Step 5: Run as a System Service

Make your AI assistant start automatically on boot:

# Create systemd service file
sudo tee /etc/systemd/system/local-ai.service << 'EOF'
[Unit]
Description=Local AI Assistant
After=network.target

[Service]
Type=simple
User=pi
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=10
Environment=OLLAMA_HOST=0.0.0.0

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl enable local-ai.service
sudo systemctl start local-ai.service

# Check status
sudo systemctl status local-ai.service

Performance Optimization Tips

# 1. Increase swap (helps when RAM is tight)
sudo dphys-swapfile swapoff
sudo sed -i 's/CONF_SWAPSIZE=.*/CONF_SWAPSIZE=4096/' /etc/dphys-swapfile
sudo dphys-swapfile setup
sudo dphys-swapfile swapon

# 2. Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 3. Monitor temperature (throttling kills performance)
vcgencmd measure_temp

# 4. Use NVMe SSD for model storage (faster loading)
# 5. Close unnecessary services
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon

What You Can Build

With a Raspberry Pi running a local LLM, you can:

Private AI assistant — Ask questions without data leaving your home
Document summarizer — Drop files into a folder, get summaries automatically
Code helper — Local coding assistant for when you’re offline
Email drafting — Generate email responses locally
Language learning — Practice conversations with an AI tutor
Home automation — Natural language control for IoT devices
Kids’ homework helper — Safe, offline AI tutor with no account required

The key insight: a 3B model on a Raspberry Pi isn’t competing with GPT-5. It’s competing with not having any AI at all — and for many use cases, a local, private, free-to-query AI assistant is more than good enough.

The total cost of ownership after setup: electricity. About $5-10 per year. That’s your entire AI budget. No subscriptions, no API keys, no data privacy concerns. Just a small computer under your desk, answering questions whenever you ask.

How to Deploy AI Models on a Raspberry Pi: Run LLMs on a $75 Computer

What You’ll Need

Hardware

Software

Step 1: Set Up Your Raspberry Pi

Optional: Set Up NVMe SSD

Step 2: Install Ollama (Easy Method)

Download a Model

Test It

Performance Expectations

Step 3: Install llama.cpp (Advanced Method)

Download Models from Hugging Face

Run the Model

Quantization Levels

Step 4: Build a Useful Application

Home Assistant AI

Offline Document Q&A

Step 5: Run as a System Service

Performance Optimization Tips

What You Can Build

Sources

Share this article

> Want more like this?

> Related Articles

Web Scraping with AI: Build a Smart Data Extraction Pipeline

Create an AI Art Portfolio: From Generation to Gallery in One Weekend

Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes

Tags

> Stay in the loop