Build a Recommendation Engine with Python: From Zero to 'You Might Also Like'
Build a content recommendation engine using collaborative filtering, content-based filtering, and embeddings. Full Python implementation with real-world examples and deployment guidance.
Every time Netflix suggests a show you end up binge-watching, or Amazon recommends a product you didn’t know you needed, or Spotify creates a playlist that feels personally curated — a recommendation engine is behind it.
These systems aren’t magic. They’re math. And building a basic one is surprisingly straightforward. In this tutorial, we’ll implement three approaches — collaborative filtering, content-based filtering, and embedding-based recommendations — and combine them into a hybrid system that handles real-world scenarios.
The Three Approaches
Before writing code, understand the fundamental strategies:
1. Collaborative Filtering
"Users who liked what you liked also liked..."
Based on: User behavior patterns
Pro: No content analysis needed
Con: Cold start problem (new users/items)
2. Content-Based Filtering
"Because you liked [item with features X, Y, Z]..."
Based on: Item attributes/features
Pro: Works for new items
Con: Limited discovery (filter bubble)
3. Embedding-Based (Neural)
"Items that are semantically similar..."
Based on: Vector representations
Pro: Captures nuanced similarity
Con: Requires embeddings infrastructure
Setup
pip install pandas numpy scikit-learn scipy
We’ll use a movie recommendation scenario, but the techniques apply to any domain (products, articles, music, etc.).
Approach 1: Collaborative Filtering
User-Item Matrix
# collaborative.py
"""Collaborative filtering recommendation engine."""
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
class CollaborativeRecommender:
"""Recommend items based on similar users' behavior."""
def __init__(self, n_neighbors: int = 20):
self.n_neighbors = n_neighbors
self.model = NearestNeighbors(
metric='cosine',
algorithm='brute',
n_neighbors=n_neighbors
)
self.user_item_matrix = None
self.user_map = {} # user_id -> matrix_index
self.item_map = {} # item_id -> matrix_index
self.reverse_item_map = {} # matrix_index -> item_id
def fit(self, interactions: pd.DataFrame):
"""
Fit the model on user-item interactions.
Args:
interactions: DataFrame with columns [user_id, item_id, rating]
"""
# Create mappings
users = interactions['user_id'].unique()
items = interactions['item_id'].unique()
self.user_map = {uid: idx for idx, uid in enumerate(users)}
self.item_map = {iid: idx for idx, iid in enumerate(items)}
self.reverse_item_map = {idx: iid for iid, idx in self.item_map.items()}
# Build sparse user-item matrix
rows = interactions['user_id'].map(self.user_map)
cols = interactions['item_id'].map(self.item_map)
values = interactions['rating']
self.user_item_matrix = csr_matrix(
(values, (rows, cols)),
shape=(len(users), len(items))
)
# Fit nearest neighbors on user vectors
self.model.fit(self.user_item_matrix)
return self
def recommend(self, user_id, n_recommendations: int = 10) -> list[dict]:
"""
Get recommendations for a user.
Returns list of {item_id, score} dicts.
"""
if user_id not in self.user_map:
return []
user_idx = self.user_map[user_id]
user_vector = self.user_item_matrix[user_idx]
# Find similar users
distances, indices = self.model.kneighbors(
user_vector,
n_neighbors=self.n_neighbors + 1
)
# Skip first result (the user themselves)
similar_users = indices.flatten()[1:]
similarity_scores = 1 - distances.flatten()[1:]
# Aggregate ratings from similar users
items_already_rated = set(
self.user_item_matrix[user_idx].nonzero()[1]
)
item_scores = {}
for neighbor_idx, sim_score in zip(similar_users, similarity_scores):
neighbor_ratings = self.user_item_matrix[neighbor_idx]
rated_items = neighbor_ratings.nonzero()[1]
for item_idx in rated_items:
if item_idx not in items_already_rated:
rating = neighbor_ratings[0, item_idx]
if item_idx not in item_scores:
item_scores[item_idx] = {'weighted_sum': 0, 'sim_sum': 0}
item_scores[item_idx]['weighted_sum'] += sim_score * rating
item_scores[item_idx]['sim_sum'] += sim_score
# Calculate predicted ratings
recommendations = []
for item_idx, scores in item_scores.items():
if scores['sim_sum'] > 0:
predicted_rating = scores['weighted_sum'] / scores['sim_sum']
recommendations.append({
'item_id': self.reverse_item_map[item_idx],
'score': round(predicted_rating, 3)
})
# Sort by predicted rating
recommendations.sort(key=lambda x: x['score'], reverse=True)
return recommendations[:n_recommendations]
Usage
# Example usage
import pandas as pd
# Sample data
data = pd.DataFrame({
'user_id': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
'item_id': ['A','B','C','A','B','D','B','C','E','A','D','E','C','D','E'],
'rating': [5, 4, 3, 4, 5, 4, 5, 4, 3, 3, 5, 4, 4, 3, 5],
})
rec = CollaborativeRecommender(n_neighbors=3)
rec.fit(data)
# Get recommendations for user 1
recs = rec.recommend(user_id=1, n_recommendations=5)
print(recs)
# [{'item_id': 'E', 'score': 4.2}, {'item_id': 'D', 'score': 3.8}]
Approach 2: Content-Based Filtering
# content_based.py
"""Content-based filtering using item features."""
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
class ContentBasedRecommender:
"""Recommend items based on content similarity."""
def __init__(self):
self.tfidf = TfidfVectorizer(
stop_words='english',
max_features=5000,
ngram_range=(1, 2)
)
self.item_vectors = None
self.item_ids = []
self.item_data = {}
def fit(self, items: pd.DataFrame, content_column: str = 'description'):
"""
Fit on item content.
Args:
items: DataFrame with columns [item_id, description, ...]
content_column: Column containing text content
"""
self.item_ids = items['item_id'].tolist()
self.item_data = items.set_index('item_id').to_dict('index')
# Build TF-IDF vectors from content
self.item_vectors = self.tfidf.fit_transform(items[content_column])
return self
def get_similar_items(self, item_id, n: int = 10) -> list[dict]:
"""Find items similar to a given item."""
if item_id not in self.item_ids:
return []
idx = self.item_ids.index(item_id)
item_vector = self.item_vectors[idx]
# Compute similarity with all items
similarities = cosine_similarity(item_vector, self.item_vectors).flatten()
# Get top N similar items (excluding itself)
similar_indices = similarities.argsort()[::-1][1:n+1]
results = []
for i in similar_indices:
results.append({
'item_id': self.item_ids[i],
'similarity': round(float(similarities[i]), 3),
})
return results
def recommend_for_user(
self,
liked_item_ids: list,
n_recommendations: int = 10
) -> list[dict]:
"""
Recommend items based on a user's liked items.
Args:
liked_item_ids: List of item_ids the user has liked
n_recommendations: Number of recommendations
"""
if not liked_item_ids:
return []
# Build user profile as average of liked item vectors
liked_indices = [
self.item_ids.index(iid)
for iid in liked_item_ids
if iid in self.item_ids
]
if not liked_indices:
return []
user_profile = self.item_vectors[liked_indices].mean(axis=0)
user_profile = np.asarray(user_profile)
# Compute similarity with all items
similarities = cosine_similarity(user_profile, self.item_vectors).flatten()
# Exclude already-liked items
liked_set = set(liked_item_ids)
results = []
for idx in similarities.argsort()[::-1]:
item_id = self.item_ids[idx]
if item_id not in liked_set:
results.append({
'item_id': item_id,
'score': round(float(similarities[idx]), 3),
})
if len(results) >= n_recommendations:
break
return results
Usage
items = pd.DataFrame({
'item_id': ['movie1', 'movie2', 'movie3', 'movie4', 'movie5'],
'title': [
'The Matrix', 'Inception', 'Interstellar',
'The Notebook', 'Pride and Prejudice'
],
'description': [
'sci-fi action hacker virtual reality dystopian future',
'sci-fi thriller dreams heist mind-bending inception layers',
'sci-fi space exploration time travel wormhole family',
'romance love story couple letters passion drama',
'romance period drama england love class society',
]
})
cb = ContentBasedRecommender()
cb.fit(items, content_column='description')
# Find movies similar to The Matrix
similar = cb.get_similar_items('movie1', n=3)
# [{'item_id': 'movie2', 'similarity': 0.42},
# {'item_id': 'movie3', 'similarity': 0.35}]
# Recommend based on liked movies
recs = cb.recommend_for_user(['movie1', 'movie2'], n_recommendations=3)
Approach 3: Embedding-Based Recommendations
# embedding_recommender.py
"""Recommendation using text embeddings for semantic similarity."""
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import anthropic
class EmbeddingRecommender:
"""Recommend items using semantic embeddings."""
def __init__(self, api_key: str = None):
"""Initialize with optional API key for generating embeddings."""
self.embeddings = {} # item_id -> embedding vector
self.item_ids = []
self.embedding_matrix = None
def add_embeddings(self, item_id: str, embedding: list[float]):
"""Add a pre-computed embedding for an item."""
self.embeddings[item_id] = np.array(embedding)
def build_index(self):
"""Build the similarity index from stored embeddings."""
self.item_ids = list(self.embeddings.keys())
self.embedding_matrix = np.array(
[self.embeddings[iid] for iid in self.item_ids]
)
def get_similar(self, item_id: str, n: int = 10) -> list[dict]:
"""Find semantically similar items."""
if item_id not in self.embeddings:
return []
query_embedding = self.embeddings[item_id].reshape(1, -1)
similarities = cosine_similarity(
query_embedding, self.embedding_matrix
).flatten()
results = []
for idx in similarities.argsort()[::-1]:
if self.item_ids[idx] != item_id:
results.append({
'item_id': self.item_ids[idx],
'similarity': round(float(similarities[idx]), 4)
})
if len(results) >= n:
break
return results
def recommend_for_user(
self,
liked_ids: list[str],
n: int = 10
) -> list[dict]:
"""Recommend based on average embedding of liked items."""
liked_embeddings = [
self.embeddings[iid] for iid in liked_ids
if iid in self.embeddings
]
if not liked_embeddings:
return []
user_vector = np.mean(liked_embeddings, axis=0).reshape(1, -1)
similarities = cosine_similarity(
user_vector, self.embedding_matrix
).flatten()
liked_set = set(liked_ids)
results = []
for idx in similarities.argsort()[::-1]:
if self.item_ids[idx] not in liked_set:
results.append({
'item_id': self.item_ids[idx],
'score': round(float(similarities[idx]), 4)
})
if len(results) >= n:
break
return results
Combining Into a Hybrid System
# hybrid.py
"""Hybrid recommendation engine combining multiple approaches."""
class HybridRecommender:
"""Combine collaborative, content-based, and embedding recommendations."""
def __init__(
self,
collaborative_weight: float = 0.4,
content_weight: float = 0.3,
embedding_weight: float = 0.3
):
self.weights = {
'collaborative': collaborative_weight,
'content': content_weight,
'embedding': embedding_weight
}
self.collaborative = None
self.content_based = None
self.embedding_based = None
def recommend(
self,
user_id,
liked_item_ids: list,
n_recommendations: int = 10
) -> list[dict]:
"""Get hybrid recommendations."""
all_scores = {} # item_id -> weighted score
# Collaborative filtering
if self.collaborative:
collab_recs = self.collaborative.recommend(
user_id, n_recommendations=n_recommendations * 2
)
max_score = max((r['score'] for r in collab_recs), default=1)
for rec in collab_recs:
normalized = rec['score'] / max_score if max_score > 0 else 0
item_id = rec['item_id']
all_scores[item_id] = all_scores.get(item_id, 0) + (
normalized * self.weights['collaborative']
)
# Content-based filtering
if self.content_based and liked_item_ids:
content_recs = self.content_based.recommend_for_user(
liked_item_ids, n_recommendations=n_recommendations * 2
)
max_score = max((r['score'] for r in content_recs), default=1)
for rec in content_recs:
normalized = rec['score'] / max_score if max_score > 0 else 0
item_id = rec['item_id']
all_scores[item_id] = all_scores.get(item_id, 0) + (
normalized * self.weights['content']
)
# Embedding-based
if self.embedding_based and liked_item_ids:
embed_recs = self.embedding_based.recommend_for_user(
liked_item_ids, n_recommendations=n_recommendations * 2
)
max_score = max((r['score'] for r in embed_recs), default=1)
for rec in embed_recs:
normalized = rec['score'] / max_score if max_score > 0 else 0
item_id = rec['item_id']
all_scores[item_id] = all_scores.get(item_id, 0) + (
normalized * self.weights['embedding']
)
# Sort by combined score
results = [
{'item_id': iid, 'score': round(score, 4)}
for iid, score in sorted(
all_scores.items(),
key=lambda x: x[1],
reverse=True
)
]
return results[:n_recommendations]
Serving via API
# api.py
"""Simple Flask API for recommendations."""
from flask import Flask, request, jsonify
app = Flask(__name__)
# Initialize your recommender (load models, data, etc.)
# hybrid = HybridRecommender(...)
@app.route('/recommend', methods=['GET'])
def recommend():
user_id = request.args.get('user_id', type=int)
n = request.args.get('n', default=10, type=int)
if not user_id:
return jsonify({'error': 'user_id required'}), 400
recommendations = hybrid.recommend(
user_id=user_id,
liked_item_ids=get_user_liked_items(user_id),
n_recommendations=n
)
return jsonify({'recommendations': recommendations})
@app.route('/similar', methods=['GET'])
def similar():
item_id = request.args.get('item_id')
n = request.args.get('n', default=10, type=int)
if not item_id:
return jsonify({'error': 'item_id required'}), 400
similar_items = content_based.get_similar_items(item_id, n=n)
return jsonify({'similar_items': similar_items})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8000)
Key Takeaways
Start with collaborative filtering if you have user interaction data (clicks, purchases, ratings). It’s the most powerful approach when you have enough data.
Add content-based filtering for the cold start problem — new items with no interaction data can still be recommended based on their features.
Use embeddings when you need semantic understanding beyond keyword matching. An embedding model understands that “thriller” and “suspense” are related even though they share no words.
Combine all three for production systems. Weight them based on your data quality: more user data = higher collaborative weight. More item metadata = higher content weight.
The difference between a recommendation engine that feels magical and one that feels random isn’t algorithmic sophistication — it’s data quality. Clean, abundant interaction data beats a clever algorithm on sparse data every time. Focus on collecting and cleaning your data before optimizing your models.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.