Beginner

How Embeddings Work

Understand the mechanics behind embedding models — from transformer architecture and training objectives to dimensionality, cosine similarity, and visualization techniques.

The Transformer Architecture for Embeddings

Modern embedding models are built on the transformer architecture (the same technology behind GPT and BERT). Here is how the process works at a high level:

  1. Tokenization

    The input text is split into tokens (subwords). "Embeddings are amazing" becomes ["Embed", "dings", "are", "amazing"]. Each token gets an initial vector from a learned vocabulary.

  2. Self-Attention Processing

    The transformer processes all tokens simultaneously through multiple self-attention layers. Each token's representation is updated based on its relationship to every other token in the input. This is how context is captured.

  3. Pooling

    The transformer outputs a vector for each token. To get a single vector for the whole input, a pooling strategy combines these per-token vectors. Common strategies: mean pooling (average all token vectors) or CLS token (use the first token's vector).

  4. Normalization

    The final vector is often L2-normalized to unit length. This ensures that cosine similarity equals dot product, simplifying distance calculations.

Training Objectives

Embedding models learn by training on tasks that force them to understand meaning:

Contrastive Learning

The model learns to make similar pairs close and dissimilar pairs far apart in vector space. Given an anchor text, a positive example (semantically similar), and negative examples (unrelated), the model is trained to minimize distance to positives and maximize distance to negatives.

Contrastive Learning Concept
# Training objective: bring similar pairs together,
# push dissimilar pairs apart

anchor   = "How to train a neural network"
positive = "Steps for building a deep learning model"  # Similar
negative = "Best Italian restaurants in New York"       # Unrelated

# After training:
# similarity(embed(anchor), embed(positive)) → HIGH (e.g., 0.89)
# similarity(embed(anchor), embed(negative)) → LOW  (e.g., 0.12)

Masked Language Modeling (BERT-style)

Randomly mask tokens in the input and train the model to predict them. This forces the model to learn deep contextual understanding. The resulting internal representations are used as embeddings.

Dimensionality

Embedding dimension is the number of values in the output vector. Common dimensions include:

Dimensions Example Models Use Case
384 all-MiniLM-L6-v2 Lightweight, fast, good for prototyping
768 all-mpnet-base-v2, BGE-base Good balance of quality and speed
1024 E5-large, BGE-large High quality, moderate cost
1536 OpenAI text-embedding-3-small Production standard for most applications
3072 OpenAI text-embedding-3-large Maximum quality, higher cost and storage
💡
Higher dimensions are not always better. More dimensions capture more nuance but require more storage, memory, and compute. For many applications, 768 or 1536 dimensions are sufficient. Some models support Matryoshka embeddings where you can truncate to lower dimensions with minimal quality loss.

Cosine Similarity Explained

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite) to 1 (identical direction).

Python - Cosine Similarity
import numpy as np

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example with simple vectors
cat  = np.array([0.9, 0.1, 0.8, 0.2])
dog  = np.array([0.85, 0.15, 0.75, 0.25])
car  = np.array([0.1, 0.9, 0.15, 0.85])

print(f"cat vs dog: {cosine_similarity(cat, dog):.4f}")  # ~0.99 (very similar)
print(f"cat vs car: {cosine_similarity(cat, car):.4f}")  # ~0.36 (very different)

# Interpretation:
# > 0.8  → Very similar
# 0.5-0.8 → Somewhat related
# < 0.3  → Unrelated

Visualizing Embeddings

Embeddings live in high-dimensional space (1536 dimensions). To visualize them, we use dimensionality reduction techniques that project vectors down to 2D or 3D while preserving relative distances.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Best for revealing clusters and local structure. Points that are close in high dimensions stay close in the visualization.

UMAP (Uniform Manifold Approximation and Projection)

Better at preserving both local and global structure. Faster than t-SNE and often preferred for large datasets.

Python - Visualize Embeddings with UMAP
import umap
import matplotlib.pyplot as plt
import numpy as np

# Assume `embeddings` is a numpy array of shape (n_samples, 1536)
# and `labels` is a list of category names

# Reduce to 2D with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
embeddings_2d = reducer.fit_transform(embeddings)

# Plot
plt.figure(figsize=(12, 8))
for label in set(labels):
    mask = [l == label for l in labels]
    plt.scatter(
        embeddings_2d[mask, 0],
        embeddings_2d[mask, 1],
        label=label, alpha=0.7, s=50
    )
plt.legend()
plt.title("Document Embeddings Visualized with UMAP")
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.show()
Visualization tip: Use UMAP for initial exploration and t-SNE for detailed cluster analysis. Both are non-deterministic — run multiple times and look for consistent patterns rather than trusting a single visualization.

💡 Try It Yourself

Install sentence-transformers and compute embeddings for 10 sentences about different topics (e.g., 5 about cooking, 5 about programming). Compute pairwise cosine similarities and verify that within-topic pairs score higher.

You should see cosine similarities above 0.7 for within-topic pairs and below 0.4 for cross-topic pairs.