TUTORIALS 14 min read

Build a Recommendation Engine with Python: From Zero to 'You Might Also Like'

Build a content recommendation engine using collaborative filtering, content-based filtering, and embeddings. Full Python implementation with real-world examples and deployment guidance.

By EgoistAI ·
Build a Recommendation Engine with Python: From Zero to 'You Might Also Like'

Every time Netflix suggests a show you end up binge-watching, or Amazon recommends a product you didn’t know you needed, or Spotify creates a playlist that feels personally curated — a recommendation engine is behind it.

These systems aren’t magic. They’re math. And building a basic one is surprisingly straightforward. In this tutorial, we’ll implement three approaches — collaborative filtering, content-based filtering, and embedding-based recommendations — and combine them into a hybrid system that handles real-world scenarios.


The Three Approaches

Before writing code, understand the fundamental strategies:

1. Collaborative Filtering
   "Users who liked what you liked also liked..."
   Based on: User behavior patterns
   Pro: No content analysis needed
   Con: Cold start problem (new users/items)

2. Content-Based Filtering
   "Because you liked [item with features X, Y, Z]..."
   Based on: Item attributes/features
   Pro: Works for new items
   Con: Limited discovery (filter bubble)

3. Embedding-Based (Neural)
   "Items that are semantically similar..."
   Based on: Vector representations
   Pro: Captures nuanced similarity
   Con: Requires embeddings infrastructure

Setup

pip install pandas numpy scikit-learn scipy

We’ll use a movie recommendation scenario, but the techniques apply to any domain (products, articles, music, etc.).


Approach 1: Collaborative Filtering

User-Item Matrix

# collaborative.py
"""Collaborative filtering recommendation engine."""

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors


class CollaborativeRecommender:
    """Recommend items based on similar users' behavior."""
    
    def __init__(self, n_neighbors: int = 20):
        self.n_neighbors = n_neighbors
        self.model = NearestNeighbors(
            metric='cosine', 
            algorithm='brute',
            n_neighbors=n_neighbors
        )
        self.user_item_matrix = None
        self.user_map = {}  # user_id -> matrix_index
        self.item_map = {}  # item_id -> matrix_index
        self.reverse_item_map = {}  # matrix_index -> item_id
    
    def fit(self, interactions: pd.DataFrame):
        """
        Fit the model on user-item interactions.
        
        Args:
            interactions: DataFrame with columns [user_id, item_id, rating]
        """
        # Create mappings
        users = interactions['user_id'].unique()
        items = interactions['item_id'].unique()
        
        self.user_map = {uid: idx for idx, uid in enumerate(users)}
        self.item_map = {iid: idx for idx, iid in enumerate(items)}
        self.reverse_item_map = {idx: iid for iid, idx in self.item_map.items()}
        
        # Build sparse user-item matrix
        rows = interactions['user_id'].map(self.user_map)
        cols = interactions['item_id'].map(self.item_map)
        values = interactions['rating']
        
        self.user_item_matrix = csr_matrix(
            (values, (rows, cols)),
            shape=(len(users), len(items))
        )
        
        # Fit nearest neighbors on user vectors
        self.model.fit(self.user_item_matrix)
        
        return self
    
    def recommend(self, user_id, n_recommendations: int = 10) -> list[dict]:
        """
        Get recommendations for a user.
        
        Returns list of {item_id, score} dicts.
        """
        if user_id not in self.user_map:
            return []
        
        user_idx = self.user_map[user_id]
        user_vector = self.user_item_matrix[user_idx]
        
        # Find similar users
        distances, indices = self.model.kneighbors(
            user_vector, 
            n_neighbors=self.n_neighbors + 1
        )
        
        # Skip first result (the user themselves)
        similar_users = indices.flatten()[1:]
        similarity_scores = 1 - distances.flatten()[1:]
        
        # Aggregate ratings from similar users
        items_already_rated = set(
            self.user_item_matrix[user_idx].nonzero()[1]
        )
        
        item_scores = {}
        
        for neighbor_idx, sim_score in zip(similar_users, similarity_scores):
            neighbor_ratings = self.user_item_matrix[neighbor_idx]
            rated_items = neighbor_ratings.nonzero()[1]
            
            for item_idx in rated_items:
                if item_idx not in items_already_rated:
                    rating = neighbor_ratings[0, item_idx]
                    if item_idx not in item_scores:
                        item_scores[item_idx] = {'weighted_sum': 0, 'sim_sum': 0}
                    item_scores[item_idx]['weighted_sum'] += sim_score * rating
                    item_scores[item_idx]['sim_sum'] += sim_score
        
        # Calculate predicted ratings
        recommendations = []
        for item_idx, scores in item_scores.items():
            if scores['sim_sum'] > 0:
                predicted_rating = scores['weighted_sum'] / scores['sim_sum']
                recommendations.append({
                    'item_id': self.reverse_item_map[item_idx],
                    'score': round(predicted_rating, 3)
                })
        
        # Sort by predicted rating
        recommendations.sort(key=lambda x: x['score'], reverse=True)
        return recommendations[:n_recommendations]

Usage

# Example usage
import pandas as pd

# Sample data
data = pd.DataFrame({
    'user_id': [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],
    'item_id': ['A','B','C','A','B','D','B','C','E','A','D','E','C','D','E'],
    'rating':  [5, 4, 3, 4, 5, 4, 5, 4, 3, 3, 5, 4, 4, 3, 5],
})

rec = CollaborativeRecommender(n_neighbors=3)
rec.fit(data)

# Get recommendations for user 1
recs = rec.recommend(user_id=1, n_recommendations=5)
print(recs)
# [{'item_id': 'E', 'score': 4.2}, {'item_id': 'D', 'score': 3.8}]

Approach 2: Content-Based Filtering

# content_based.py
"""Content-based filtering using item features."""

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


class ContentBasedRecommender:
    """Recommend items based on content similarity."""
    
    def __init__(self):
        self.tfidf = TfidfVectorizer(
            stop_words='english',
            max_features=5000,
            ngram_range=(1, 2)
        )
        self.item_vectors = None
        self.item_ids = []
        self.item_data = {}
    
    def fit(self, items: pd.DataFrame, content_column: str = 'description'):
        """
        Fit on item content.
        
        Args:
            items: DataFrame with columns [item_id, description, ...]
            content_column: Column containing text content
        """
        self.item_ids = items['item_id'].tolist()
        self.item_data = items.set_index('item_id').to_dict('index')
        
        # Build TF-IDF vectors from content
        self.item_vectors = self.tfidf.fit_transform(items[content_column])
        
        return self
    
    def get_similar_items(self, item_id, n: int = 10) -> list[dict]:
        """Find items similar to a given item."""
        if item_id not in self.item_ids:
            return []
        
        idx = self.item_ids.index(item_id)
        item_vector = self.item_vectors[idx]
        
        # Compute similarity with all items
        similarities = cosine_similarity(item_vector, self.item_vectors).flatten()
        
        # Get top N similar items (excluding itself)
        similar_indices = similarities.argsort()[::-1][1:n+1]
        
        results = []
        for i in similar_indices:
            results.append({
                'item_id': self.item_ids[i],
                'similarity': round(float(similarities[i]), 3),
            })
        
        return results
    
    def recommend_for_user(
        self, 
        liked_item_ids: list, 
        n_recommendations: int = 10
    ) -> list[dict]:
        """
        Recommend items based on a user's liked items.
        
        Args:
            liked_item_ids: List of item_ids the user has liked
            n_recommendations: Number of recommendations
        """
        if not liked_item_ids:
            return []
        
        # Build user profile as average of liked item vectors
        liked_indices = [
            self.item_ids.index(iid) 
            for iid in liked_item_ids 
            if iid in self.item_ids
        ]
        
        if not liked_indices:
            return []
        
        user_profile = self.item_vectors[liked_indices].mean(axis=0)
        user_profile = np.asarray(user_profile)
        
        # Compute similarity with all items
        similarities = cosine_similarity(user_profile, self.item_vectors).flatten()
        
        # Exclude already-liked items
        liked_set = set(liked_item_ids)
        
        results = []
        for idx in similarities.argsort()[::-1]:
            item_id = self.item_ids[idx]
            if item_id not in liked_set:
                results.append({
                    'item_id': item_id,
                    'score': round(float(similarities[idx]), 3),
                })
                if len(results) >= n_recommendations:
                    break
        
        return results

Usage

items = pd.DataFrame({
    'item_id': ['movie1', 'movie2', 'movie3', 'movie4', 'movie5'],
    'title': [
        'The Matrix', 'Inception', 'Interstellar', 
        'The Notebook', 'Pride and Prejudice'
    ],
    'description': [
        'sci-fi action hacker virtual reality dystopian future',
        'sci-fi thriller dreams heist mind-bending inception layers',
        'sci-fi space exploration time travel wormhole family',
        'romance love story couple letters passion drama',
        'romance period drama england love class society',
    ]
})

cb = ContentBasedRecommender()
cb.fit(items, content_column='description')

# Find movies similar to The Matrix
similar = cb.get_similar_items('movie1', n=3)
# [{'item_id': 'movie2', 'similarity': 0.42}, 
#  {'item_id': 'movie3', 'similarity': 0.35}]

# Recommend based on liked movies
recs = cb.recommend_for_user(['movie1', 'movie2'], n_recommendations=3)

Approach 3: Embedding-Based Recommendations

# embedding_recommender.py
"""Recommendation using text embeddings for semantic similarity."""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import anthropic


class EmbeddingRecommender:
    """Recommend items using semantic embeddings."""
    
    def __init__(self, api_key: str = None):
        """Initialize with optional API key for generating embeddings."""
        self.embeddings = {}  # item_id -> embedding vector
        self.item_ids = []
        self.embedding_matrix = None
    
    def add_embeddings(self, item_id: str, embedding: list[float]):
        """Add a pre-computed embedding for an item."""
        self.embeddings[item_id] = np.array(embedding)
    
    def build_index(self):
        """Build the similarity index from stored embeddings."""
        self.item_ids = list(self.embeddings.keys())
        self.embedding_matrix = np.array(
            [self.embeddings[iid] for iid in self.item_ids]
        )
    
    def get_similar(self, item_id: str, n: int = 10) -> list[dict]:
        """Find semantically similar items."""
        if item_id not in self.embeddings:
            return []
        
        query_embedding = self.embeddings[item_id].reshape(1, -1)
        similarities = cosine_similarity(
            query_embedding, self.embedding_matrix
        ).flatten()
        
        results = []
        for idx in similarities.argsort()[::-1]:
            if self.item_ids[idx] != item_id:
                results.append({
                    'item_id': self.item_ids[idx],
                    'similarity': round(float(similarities[idx]), 4)
                })
                if len(results) >= n:
                    break
        
        return results
    
    def recommend_for_user(
        self, 
        liked_ids: list[str], 
        n: int = 10
    ) -> list[dict]:
        """Recommend based on average embedding of liked items."""
        liked_embeddings = [
            self.embeddings[iid] for iid in liked_ids 
            if iid in self.embeddings
        ]
        
        if not liked_embeddings:
            return []
        
        user_vector = np.mean(liked_embeddings, axis=0).reshape(1, -1)
        similarities = cosine_similarity(
            user_vector, self.embedding_matrix
        ).flatten()
        
        liked_set = set(liked_ids)
        results = []
        for idx in similarities.argsort()[::-1]:
            if self.item_ids[idx] not in liked_set:
                results.append({
                    'item_id': self.item_ids[idx],
                    'score': round(float(similarities[idx]), 4)
                })
                if len(results) >= n:
                    break
        
        return results

Combining Into a Hybrid System

# hybrid.py
"""Hybrid recommendation engine combining multiple approaches."""


class HybridRecommender:
    """Combine collaborative, content-based, and embedding recommendations."""
    
    def __init__(
        self, 
        collaborative_weight: float = 0.4,
        content_weight: float = 0.3,
        embedding_weight: float = 0.3
    ):
        self.weights = {
            'collaborative': collaborative_weight,
            'content': content_weight,
            'embedding': embedding_weight
        }
        self.collaborative = None
        self.content_based = None
        self.embedding_based = None
    
    def recommend(
        self, 
        user_id, 
        liked_item_ids: list,
        n_recommendations: int = 10
    ) -> list[dict]:
        """Get hybrid recommendations."""
        
        all_scores = {}  # item_id -> weighted score
        
        # Collaborative filtering
        if self.collaborative:
            collab_recs = self.collaborative.recommend(
                user_id, n_recommendations=n_recommendations * 2
            )
            max_score = max((r['score'] for r in collab_recs), default=1)
            for rec in collab_recs:
                normalized = rec['score'] / max_score if max_score > 0 else 0
                item_id = rec['item_id']
                all_scores[item_id] = all_scores.get(item_id, 0) + (
                    normalized * self.weights['collaborative']
                )
        
        # Content-based filtering
        if self.content_based and liked_item_ids:
            content_recs = self.content_based.recommend_for_user(
                liked_item_ids, n_recommendations=n_recommendations * 2
            )
            max_score = max((r['score'] for r in content_recs), default=1)
            for rec in content_recs:
                normalized = rec['score'] / max_score if max_score > 0 else 0
                item_id = rec['item_id']
                all_scores[item_id] = all_scores.get(item_id, 0) + (
                    normalized * self.weights['content']
                )
        
        # Embedding-based
        if self.embedding_based and liked_item_ids:
            embed_recs = self.embedding_based.recommend_for_user(
                liked_item_ids, n_recommendations=n_recommendations * 2
            )
            max_score = max((r['score'] for r in embed_recs), default=1)
            for rec in embed_recs:
                normalized = rec['score'] / max_score if max_score > 0 else 0
                item_id = rec['item_id']
                all_scores[item_id] = all_scores.get(item_id, 0) + (
                    normalized * self.weights['embedding']
                )
        
        # Sort by combined score
        results = [
            {'item_id': iid, 'score': round(score, 4)}
            for iid, score in sorted(
                all_scores.items(), 
                key=lambda x: x[1], 
                reverse=True
            )
        ]
        
        return results[:n_recommendations]

Serving via API

# api.py
"""Simple Flask API for recommendations."""

from flask import Flask, request, jsonify

app = Flask(__name__)

# Initialize your recommender (load models, data, etc.)
# hybrid = HybridRecommender(...)

@app.route('/recommend', methods=['GET'])
def recommend():
    user_id = request.args.get('user_id', type=int)
    n = request.args.get('n', default=10, type=int)
    
    if not user_id:
        return jsonify({'error': 'user_id required'}), 400
    
    recommendations = hybrid.recommend(
        user_id=user_id,
        liked_item_ids=get_user_liked_items(user_id),
        n_recommendations=n
    )
    
    return jsonify({'recommendations': recommendations})

@app.route('/similar', methods=['GET'])
def similar():
    item_id = request.args.get('item_id')
    n = request.args.get('n', default=10, type=int)
    
    if not item_id:
        return jsonify({'error': 'item_id required'}), 400
    
    similar_items = content_based.get_similar_items(item_id, n=n)
    
    return jsonify({'similar_items': similar_items})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Key Takeaways

Start with collaborative filtering if you have user interaction data (clicks, purchases, ratings). It’s the most powerful approach when you have enough data.

Add content-based filtering for the cold start problem — new items with no interaction data can still be recommended based on their features.

Use embeddings when you need semantic understanding beyond keyword matching. An embedding model understands that “thriller” and “suspense” are related even though they share no words.

Combine all three for production systems. Weight them based on your data quality: more user data = higher collaborative weight. More item metadata = higher content weight.

The difference between a recommendation engine that feels magical and one that feels random isn’t algorithmic sophistication — it’s data quality. Clean, abundant interaction data beats a clever algorithm on sparse data every time. Focus on collecting and cleaning your data before optimizing your models.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

recommendation enginemachine learningPythoncollaborative filteringembeddings

> Stay in the loop

Weekly AI tools & insights.