TUTORIALS 12 min read

How to Deploy AI Models on a Raspberry Pi: Run LLMs on a $75 Computer

You don't need a $10,000 GPU to run AI locally. This tutorial shows you how to deploy small language models on a Raspberry Pi 5 — from hardware setup to running inference in under 30 minutes.

By EgoistAI ·
How to Deploy AI Models on a Raspberry Pi: Run LLMs on a $75 Computer

Running AI on a $75 single-board computer sounds absurd until you realize that a quantized 3B parameter model in 2026 is more capable than GPT-3 was in 2022 — and a Raspberry Pi 5 with 8GB RAM can run it at reasonable speeds.

This isn’t about replacing cloud AI. It’s about having a private, offline, always-available AI assistant that costs nothing per query, runs on your home network, and processes your data without sending a single byte to any external server.

Here’s how to set it up.


What You’ll Need

Hardware

ComponentRecommendedMinimumCost
Raspberry Pi 58GB RAM4GB RAM$60-80
MicroSD Card64GB A232GB A1$10-15
Power SupplyUSB-C 27W officialUSB-C 15W$12
CoolingActive cooler (fan + heatsink)Passive heatsink$5-15
CaseOfficial Pi 5 caseAny case$10
NVMe SSD (optional)256GB+ via Pi HATNot required$25-40

Total cost: $75-160

The 8GB model is strongly recommended. With 4GB, you’re limited to tiny models (1-2B parameters) that are noticeably less capable.

The NVMe SSD isn’t required but significantly improves model loading time (models are 2-4GB files that need to be read from storage).

Software

  • Raspberry Pi OS (64-bit, Bookworm or later)
  • Ollama or llama.cpp
  • Python 3.11+ (for custom scripts)

Step 1: Set Up Your Raspberry Pi

Flash Raspberry Pi OS 64-bit to your microSD card using the Raspberry Pi Imager:

# After first boot, update everything
sudo apt update && sudo apt upgrade -y

# Install essential build tools
sudo apt install -y build-essential cmake git python3-pip

# Check your system
uname -m     # Should output: aarch64
free -h      # Check available RAM
df -h        # Check storage space

Important: You must use the 64-bit OS. The 32-bit version cannot address enough memory for AI models.

Optional: Set Up NVMe SSD

If you have an NVMe HAT:

# Check NVMe is detected
lsblk

# Format and mount
sudo mkfs.ext4 /dev/nvme0n1p1
sudo mkdir /mnt/ssd
sudo mount /dev/nvme0n1p1 /mnt/ssd

# Add to fstab for auto-mount
echo '/dev/nvme0n1p1 /mnt/ssd ext4 defaults 0 2' | sudo tee -a /etc/fstab

# Use SSD for model storage
mkdir -p /mnt/ssd/models

Step 2: Install Ollama (Easy Method)

Ollama is the simplest way to run LLMs locally. It handles model downloading, quantization, and serving with a single binary:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Start the Ollama service (runs in background)
ollama serve &

Download a Model

For Raspberry Pi 5 (8GB), these models work well:

# Recommended: Phi-3 Mini (3.8B parameters, ~2.3GB)
ollama pull phi3:mini

# Alternative: Gemma 2 2B (~1.6GB)
ollama pull gemma2:2b

# Alternative: Llama 3.2 3B (~2GB)
ollama pull llama3.2:3b

# Alternative: Qwen 2.5 3B (~2GB)
ollama pull qwen2.5:3b

# List downloaded models
ollama list

Test It

# Interactive chat
ollama run phi3:mini

# Single query
ollama run phi3:mini "Explain quantum computing in 3 sentences"

# API endpoint (for integration with other tools)
curl http://localhost:11434/api/generate -d '{
  "model": "phi3:mini",
  "prompt": "Write a Python function to calculate fibonacci numbers",
  "stream": false
}'

Performance Expectations

ModelSizeTokens/sec (Pi 5 8GB)Quality
Qwen 2.5 0.5B400MB15-20 tok/sBasic
Gemma 2 2B1.6GB6-10 tok/sGood
Phi-3 Mini 3.8B2.3GB3-6 tok/sVery Good
Llama 3.2 3B2.0GB4-7 tok/sVery Good
Mistral 7B (Q4)4.1GB1-3 tok/sExcellent (slow)

3-7 tokens per second is slower than typing speed but fast enough for useful interactions. You won’t be generating novels, but answering questions, summarizing text, and writing code snippets is practical.


Step 3: Install llama.cpp (Advanced Method)

For more control over performance and model selection, build llama.cpp from source:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with optimizations for ARM (Raspberry Pi)
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CPU_AARCH64=ON

cmake --build build --config Release -j4

# Verify build
./build/bin/llama-cli --help

Download Models from Hugging Face

# Install huggingface-cli
pip3 install huggingface-hub

# Download a GGUF model (pre-quantized for llama.cpp)
huggingface-cli download \
  TheBloke/Phi-3-mini-4k-instruct-GGUF \
  phi-3-mini-4k-instruct.Q4_K_M.gguf \
  --local-dir ./models/

# Or use the built-in converter for other models
python3 convert_hf_to_gguf.py \
  --outfile models/custom-model.gguf \
  /path/to/huggingface/model/

Run the Model

# Interactive chat
./build/bin/llama-cli \
  -m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 2048 \       # Context window
  -n 512 \        # Max tokens to generate
  -t 4 \          # Number of threads (Pi 5 has 4 cores)
  --interactive \
  -p "You are a helpful assistant."

# Start an API server (OpenAI-compatible)
./build/bin/llama-server \
  -m models/phi-3-mini-4k-instruct.Q4_K_M.gguf \
  -c 2048 \
  -t 4 \
  --host 0.0.0.0 \
  --port 8080

Quantization Levels

QuantizationSize ReductionQuality LossRecommended
Q8_0~50%MinimalIf RAM allows
Q6_K~60%Very smallGood balance
Q5_K_M~65%SmallGood balance
Q4_K_M~75%ModerateBest for Pi
Q3_K_M~80%NotableOnly if needed
Q2_K~85%SignificantNot recommended

For Raspberry Pi, Q4_K_M is the sweet spot — it provides the best balance of model quality and memory usage.


Step 4: Build a Useful Application

Home Assistant AI

Create a local AI that answers questions about your smart home:

#!/usr/bin/env python3
"""Local AI assistant accessible via your home network."""

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)
OLLAMA_URL = "http://localhost:11434/api/generate"

SYSTEM_PROMPT = """You are a helpful home assistant running on a 
Raspberry Pi. You help with:
- Answering general knowledge questions
- Writing and explaining code
- Summarizing text
- Giving recipe suggestions
- Helping with math and science
Keep responses concise since you're running on limited hardware."""

@app.route('/ask', methods=['POST'])
def ask():
    user_message = request.json.get('message', '')
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": f"{SYSTEM_PROMPT}\n\nUser: {user_message}\nAssistant:",
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_predict": 256
        }
    })
    
    result = response.json()
    return jsonify({"response": result['response']})

@app.route('/summarize', methods=['POST'])
def summarize():
    text = request.json.get('text', '')
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": f"Summarize the following text in 3 bullet points:\n\n{text}",
        "stream": False,
        "options": {"num_predict": 200}
    })
    
    result = response.json()
    return jsonify({"summary": result['response']})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
# Install dependencies
pip3 install flask requests

# Run the server
python3 home_assistant.py

# Access from any device on your network
# http://raspberrypi.local:5000/ask

Offline Document Q&A

#!/usr/bin/env python3
"""Ask questions about local documents without internet."""

import os
import requests

OLLAMA_URL = "http://localhost:11434/api/generate"

def read_document(filepath):
    with open(filepath, 'r') as f:
        return f.read()

def ask_about_document(doc_text, question):
    # Truncate document to fit in context window
    max_chars = 4000  # ~1000 tokens
    if len(doc_text) > max_chars:
        doc_text = doc_text[:max_chars] + "\n[Document truncated...]"
    
    prompt = f"""Based on the following document, answer the question.
    
Document:
{doc_text}

Question: {question}

Answer:"""
    
    response = requests.post(OLLAMA_URL, json={
        "model": "phi3:mini",
        "prompt": prompt,
        "stream": False,
        "options": {"num_predict": 300, "temperature": 0.3}
    })
    
    return response.json()['response']

# Usage
doc = read_document("/path/to/your/document.txt")
answer = ask_about_document(doc, "What are the main conclusions?")
print(answer)

Step 5: Run as a System Service

Make your AI assistant start automatically on boot:

# Create systemd service file
sudo tee /etc/systemd/system/local-ai.service << 'EOF'
[Unit]
Description=Local AI Assistant
After=network.target

[Service]
Type=simple
User=pi
ExecStart=/usr/local/bin/ollama serve
Restart=always
RestartSec=10
Environment=OLLAMA_HOST=0.0.0.0

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl enable local-ai.service
sudo systemctl start local-ai.service

# Check status
sudo systemctl status local-ai.service

Performance Optimization Tips

# 1. Increase swap (helps when RAM is tight)
sudo dphys-swapfile swapoff
sudo sed -i 's/CONF_SWAPSIZE=.*/CONF_SWAPSIZE=4096/' /etc/dphys-swapfile
sudo dphys-swapfile setup
sudo dphys-swapfile swapon

# 2. Set CPU governor to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# 3. Monitor temperature (throttling kills performance)
vcgencmd measure_temp

# 4. Use NVMe SSD for model storage (faster loading)
# 5. Close unnecessary services
sudo systemctl disable bluetooth
sudo systemctl disable avahi-daemon

What You Can Build

With a Raspberry Pi running a local LLM, you can:

  • Private AI assistant — Ask questions without data leaving your home
  • Document summarizer — Drop files into a folder, get summaries automatically
  • Code helper — Local coding assistant for when you’re offline
  • Email drafting — Generate email responses locally
  • Language learning — Practice conversations with an AI tutor
  • Home automation — Natural language control for IoT devices
  • Kids’ homework helper — Safe, offline AI tutor with no account required

The key insight: a 3B model on a Raspberry Pi isn’t competing with GPT-5. It’s competing with not having any AI at all — and for many use cases, a local, private, free-to-query AI assistant is more than good enough.

The total cost of ownership after setup: electricity. About $5-10 per year. That’s your entire AI budget. No subscriptions, no API keys, no data privacy concerns. Just a small computer under your desk, answering questions whenever you ask.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

Raspberry Piedge AIlocal LLMllama.cpptutorial

> Stay in the loop

Weekly AI tools & insights.