Intermediate

Multi-Model Recommendation & Personalization

Modern recommendation systems combine multiple model types — embedding models for retrieval, ranking models for relevance, and LLMs for explanation and conversational recommendations — to deliver highly personalized user experiences at scale.

Modern Recommendation Architecture

Today's recommendation systems are multi-stage pipelines. Each stage uses a different model optimized for its specific task: fast retrieval from millions of candidates, precise ranking of hundreds, and rich explanation of the final results.

Architecture Overview
# Multi-Stage Recommendation Pipeline
User Data (behavior, profile, context)
  → Embedding Model (user & item embeddings)
    → Candidate Retrieval (ANN search, ~1000 candidates from millions)
      → Ranking Model (cross-encoder, pointwise/pairwise scoring)
        → Business Rules (filtering, diversity, freshness)
          → LLM (explanation generation, personalized descriptions)
            → Final Recommendations (top 10–50 with explanations)

# Models Used at Each Stage:
# 1. Embedding: Two-Tower model, sentence-transformers, OpenAI ada-002
# 2. Retrieval: FAISS, Pinecone, Weaviate (ANN index)
# 3. Ranking: XGBoost, LightGBM, cross-encoder transformer
# 4. Explanation: Claude, GPT-4, Llama (LLM)

Combining Multiple Model Types

The power of modern recommenders comes from combining different signal types, each captured by a different model:

  • Collaborative filtering: Learns from user-item interaction patterns (users who bought X also bought Y). Implemented as matrix factorization or neural collaborative filtering.
  • Content-based embeddings: Encode item features (text descriptions, images, categories) into dense vectors. Uses sentence-transformers, CLIP, or domain-specific models.
  • Behavioral signals: Click sequences, dwell time, purchase history encoded as sequential embeddings (transformers, GRU4Rec).
  • Contextual features: Time of day, device, location, session intent — fed as features to the ranking model.
  • LLM-generated metadata: Use LLMs to enrich item descriptions, extract tags, generate summaries that improve content-based matching.

Two-Tower Retrieval + Cross-Encoder Reranking

The two-tower architecture is the industry standard for large-scale retrieval. A user tower and an item tower independently produce embeddings, enabling precomputation of item embeddings and fast approximate nearest neighbor (ANN) search at query time.

Python - Two-Tower Retrieval Model
import torch
import torch.nn as nn

class TwoTowerModel(nn.Module):
    """Two-tower retrieval model for candidate generation."""

    def __init__(self, user_features_dim, item_features_dim, embedding_dim=128):
        super().__init__()

        # User tower: maps user features to embedding space
        self.user_tower = nn.Sequential(
            nn.Linear(user_features_dim, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, embedding_dim),
            nn.functional.normalize  # L2 normalize for cosine similarity
        )

        # Item tower: maps item features to same embedding space
        self.item_tower = nn.Sequential(
            nn.Linear(item_features_dim, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, embedding_dim),
        )

    def encode_user(self, user_features):
        emb = self.user_tower(user_features)
        return nn.functional.normalize(emb, dim=-1)

    def encode_item(self, item_features):
        emb = self.item_tower(item_features)
        return nn.functional.normalize(emb, dim=-1)

    def forward(self, user_features, item_features):
        user_emb = self.encode_user(user_features)
        item_emb = self.encode_item(item_features)
        # Cosine similarity as relevance score
        return torch.sum(user_emb * item_emb, dim=-1)

# Training with contrastive loss
model = TwoTowerModel(user_features_dim=64, item_features_dim=128)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# In-batch negatives: positives on diagonal, negatives off-diagonal
def contrastive_loss(user_embs, item_embs, temperature=0.07):
    similarity = torch.matmul(user_embs, item_embs.T) / temperature
    labels = torch.arange(similarity.shape[0], device=similarity.device)
    return nn.CrossEntropyLoss()(similarity, labels)

LLM-Enhanced Recommendations

LLMs add a powerful new dimension to recommendation systems: they can explain why an item was recommended, generate personalized descriptions, and enable conversational recommendation experiences.

Generating Recommendation Explanations

Python - Product Recommender with LLM Explanations
import anthropic
import numpy as np
from sentence_transformers import SentenceTransformer

class SmartRecommender:
    def __init__(self):
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.llm = anthropic.Anthropic()
        self.product_db = {}  # product_id -> {name, description, category, ...}
        self.product_embeddings = {}  # product_id -> np.array

    def index_products(self, products: list[dict]):
        """Encode all products into the embedding space."""
        for product in products:
            pid = product["id"]
            self.product_db[pid] = product
            # Combine name + description + category for rich embedding
            text = f"{product['name']}. {product['description']}. Category: {product['category']}"
            self.product_embeddings[pid] = self.encoder.encode(text)

    def get_user_embedding(self, user_history: list[str]) -> np.ndarray:
        """Create user embedding from their interaction history."""
        # Average embeddings of interacted products (weighted by recency)
        embeddings = []
        weights = []
        for i, pid in enumerate(user_history):
            if pid in self.product_embeddings:
                embeddings.append(self.product_embeddings[pid])
                weights.append(1.0 + i * 0.1)  # More recent = higher weight

        weights = np.array(weights) / sum(weights)
        return np.average(embeddings, axis=0, weights=weights)

    def retrieve_candidates(self, user_embedding: np.ndarray, k: int = 20) -> list:
        """Retrieve top-k similar products using cosine similarity."""
        scores = {}
        for pid, emb in self.product_embeddings.items():
            similarity = np.dot(user_embedding, emb) / (
                np.linalg.norm(user_embedding) * np.linalg.norm(emb)
            )
            scores[pid] = float(similarity)

        top_pids = sorted(scores, key=scores.get, reverse=True)[:k]
        return [(pid, scores[pid]) for pid in top_pids]

    def generate_explanations(self, user_history: list[str],
                               recommendations: list[tuple]) -> list[dict]:
        """Use LLM to generate personalized explanations for each recommendation."""
        # Build context about user's history
        history_items = [self.product_db[pid]["name"]
                        for pid in user_history if pid in self.product_db]
        rec_items = [
            f"- {self.product_db[pid]['name']}: {self.product_db[pid]['description']}"
            for pid, score in recommendations[:5]
        ]

        response = self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"""A user has previously interacted with these products:
{', '.join(history_items[-5:])}

We are recommending these products:
{chr(10).join(rec_items)}

For each recommended product, write a brief 1-sentence personalized explanation
of why this user would like it, based on their history. Format as JSON array:
[{{"product": "name", "explanation": "Because you liked X, ..."}}]"""
            }]
        )

        import json
        explanations = json.loads(response.content[0].text)

        results = []
        for i, (pid, score) in enumerate(recommendations[:5]):
            product = self.product_db[pid]
            results.append({
                "product": product,
                "score": score,
                "explanation": explanations[i]["explanation"] if i < len(explanations) else ""
            })
        return results

    def recommend(self, user_history: list[str]) -> list[dict]:
        """Full recommendation pipeline: embed → retrieve → explain."""
        user_emb = self.get_user_embedding(user_history)
        candidates = self.retrieve_candidates(user_emb)
        # Filter out already-seen products
        candidates = [(pid, s) for pid, s in candidates if pid not in user_history]
        return self.generate_explanations(user_history, candidates)

# Usage
recommender = SmartRecommender()
recommender.index_products(product_catalog)
results = recommender.recommend(user_history=["prod_1", "prod_42", "prod_7"])
for r in results:
    print(f"{r['product']['name']} (score: {r['score']:.3f})")
    print(f"  {r['explanation']}")

Personalized Search with Embeddings + LLM Reranking

Personalized search combines the user's query embedding with their profile embedding and uses an LLM to rerank results for maximum relevance and personalization.

Python - Personalized Search with LLM Reranking
import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np

class PersonalizedSearch:
    def __init__(self):
        self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.llm = anthropic.Anthropic()

    def search(self, query: str, user_profile: dict, items: list[dict],
                top_k: int = 10) -> list[dict]:
        """Search with personalized reranking."""
        # Step 1: Semantic search with query embedding
        query_emb = self.encoder.encode(query)
        item_scores = []
        for item in items:
            item_emb = self.encoder.encode(item["description"])
            score = float(np.dot(query_emb, item_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(item_emb)
            ))
            item_scores.append((item, score))

        # Step 2: Get top candidates from embedding search
        candidates = sorted(item_scores, key=lambda x: x[1], reverse=True)[:top_k * 2]

        # Step 3: LLM reranking with user profile context
        items_text = "\n".join(
            f"[{i}] {item['name']}: {item['description'][:100]}"
            for i, (item, _) in enumerate(candidates)
        )

        response = self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Rerank these search results for the query "{query}"
considering this user profile:
- Preferences: {user_profile.get('preferences', 'none')}
- Past purchases: {user_profile.get('past_categories', 'none')}
- Budget: {user_profile.get('budget', 'any')}

Items:
{items_text}

Return the indices in order of best match as JSON: [0, 3, 1, ...]"""
            }]
        )

        import json
        reranked_indices = json.loads(response.content[0].text)
        return [candidates[i][0] for i in reranked_indices[:top_k]]

Recommendation Architecture by Scale

Scale Items Users Retrieval Ranking Infrastructure
Startup < 10K < 100K Simple embedding similarity (NumPy) LLM reranking or rule-based Single server, SQLite/Postgres
Growth 10K–1M 100K–10M FAISS or Pinecone (ANN) XGBoost/LightGBM ranker Managed vector DB, Redis cache
Enterprise 1M+ 10M+ Two-tower model + distributed ANN Deep cross-encoder + multi-objective Kubernetes, feature store, A/B platform
Hyperscale 100M+ 1B+ Multi-stage retrieval (hash + ANN) Mixture of experts, real-time training Custom infra, thousands of GPUs

Real-Time Feature Computation

Production recommendation systems need features computed in real time — a user's last click, trending items in the past hour, or current session intent. This requires a feature store architecture:

Python - Real-Time Feature Store Pattern
import redis
import json
from datetime import datetime, timedelta

class RealtimeFeatureStore:
    """Compute and serve real-time features for recommendations."""

    def __init__(self):
        self.redis = redis.Redis(host="localhost", port=6379, db=0)

    def record_interaction(self, user_id: str, item_id: str, event_type: str):
        """Record a user interaction for real-time feature updates."""
        timestamp = datetime.now().isoformat()
        event = json.dumps({
            "item_id": item_id,
            "event": event_type,
            "timestamp": timestamp
        })

        # User's recent interactions (sliding window)
        self.redis.lpush(f"user:{user_id}:recent", event)
        self.redis.ltrim(f"user:{user_id}:recent", 0, 99)

        # Item popularity counter (hourly bucket)
        hour_key = datetime.now().strftime("%Y%m%d%H")
        self.redis.hincrby(f"trending:{hour_key}", item_id, 1)
        self.redis.expire(f"trending:{hour_key}", 86400)

    def get_user_features(self, user_id: str) -> dict:
        """Get real-time user features for recommendation scoring."""
        recent = self.redis.lrange(f"user:{user_id}:recent", 0, 9)
        recent_items = [json.loads(e)["item_id"] for e in recent]

        return {
            "recent_items": recent_items,
            "session_length": len(recent),
            "last_event_type": json.loads(recent[0])["event"] if recent else None
        }

    def get_trending_items(self, hours: int = 24, top_k: int = 50) -> list:
        """Get trending items over the past N hours."""
        counts = {}
        now = datetime.now()
        for h in range(hours):
            hour_key = (now - timedelta(hours=h)).strftime("%Y%m%d%H")
            hour_counts = self.redis.hgetall(f"trending:{hour_key}")
            for item_id, count in hour_counts.items():
                item_id = item_id.decode()
                counts[item_id] = counts.get(item_id, 0) + int(count)

        return sorted(counts.items(), key=lambda x: x[1], reverse=True)[:top_k]

A/B Testing and Online Evaluation

Recommendation systems require rigorous A/B testing because offline metrics (precision, recall, NDCG) often do not correlate perfectly with online business metrics (revenue, engagement, retention).

  • Offline metrics: Precision@K, Recall@K, NDCG, Mean Reciprocal Rank (MRR), catalog coverage, diversity
  • Online metrics: Click-through rate (CTR), conversion rate, revenue per session, time on site, return visits
  • Interleaving experiments: Mix results from two models in a single list and measure which model's items get more clicks. Requires fewer users than traditional A/B tests.
  • Multi-armed bandits: Dynamically allocate more traffic to better-performing models using Thompson Sampling or UCB. Reduces regret during experiments.
  • Long-term effects: Measure user retention and lifetime value, not just immediate clicks. Some models optimize short-term engagement at the cost of long-term satisfaction.

Cold Start Solutions with LLM Content Understanding

The cold start problem — making recommendations for new users or new items with no interaction history — is where LLMs provide a significant advantage:

Python - LLM Cold Start Handler
import anthropic

def cold_start_item_enrichment(item: dict) -> dict:
    """Use LLM to generate rich features for new items with no interaction data."""
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Analyze this product for a recommendation system.
Product: {item['name']}
Description: {item['description']}
Category: {item['category']}
Price: ${item.get('price', 'N/A')}

Generate a JSON object with:
- "target_audience": ["audience segment 1", ...],
- "use_cases": ["use case 1", ...],
- "similar_to": ["comparable product types"],
- "keywords": ["semantic keywords for matching"],
- "appeal_factors": ["what makes this appealing"],
- "complementary_categories": ["categories often bought together"]"""
        }]
    )

    import json
    enrichment = json.loads(response.content[0].text)
    item["llm_features"] = enrichment
    return item

def cold_start_user_onboarding(user_preferences: str) -> list[str]:
    """Convert new user's stated preferences into recommendation signals."""
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""A new user described their preferences as:
"{user_preferences}"

Generate a JSON array of 10 search queries that would match products
this user would likely enjoy. Be specific and diverse.
Example: ["wireless noise-canceling headphones under $200", ...]"""
        }]
    )

    import json
    return json.loads(response.content[0].text)

Use Cases

  • E-commerce: Product recommendations with “because you bought X” explanations, personalized homepage, complementary items at checkout
  • Streaming (Netflix, Spotify): Content recommendations using multi-signal ranking — viewing history, explicit ratings, time-of-day patterns, and social graphs
  • News and content feeds: Personalized article ranking balancing relevance, recency, diversity, and avoiding filter bubbles
  • Education (course recommendation): Suggest next courses based on skill gaps, learning pace, career goals, and peer learning paths
  • Job matching: Match candidates to jobs using resume embeddings, skill extraction, experience matching, and culture fit scoring
💡
Architecture tip: Start simple. A content-based recommender with sentence-transformer embeddings and cosine similarity can outperform complex collaborative filtering systems when you have limited interaction data. Add collaborative signals and multi-stage ranking only when you have sufficient user-item interactions (typically 100K+ events).
Bias and fairness: Recommendation systems can amplify existing biases in user data — popular items get recommended more, reinforcing their popularity (the “rich get richer” effect). Monitor for diversity, ensure coverage of your catalog, and consider fairness-aware ranking algorithms to give newer or niche items fair exposure.