How Embeddings Work
Understand the mechanics behind embedding models — from transformer architecture and training objectives to dimensionality, cosine similarity, and visualization techniques.
The Transformer Architecture for Embeddings
Modern embedding models are built on the transformer architecture (the same technology behind GPT and BERT). Here is how the process works at a high level:
-
Tokenization
The input text is split into tokens (subwords). "Embeddings are amazing" becomes ["Embed", "dings", "are", "amazing"]. Each token gets an initial vector from a learned vocabulary.
-
Self-Attention Processing
The transformer processes all tokens simultaneously through multiple self-attention layers. Each token's representation is updated based on its relationship to every other token in the input. This is how context is captured.
-
Pooling
The transformer outputs a vector for each token. To get a single vector for the whole input, a pooling strategy combines these per-token vectors. Common strategies: mean pooling (average all token vectors) or CLS token (use the first token's vector).
-
Normalization
The final vector is often L2-normalized to unit length. This ensures that cosine similarity equals dot product, simplifying distance calculations.
Training Objectives
Embedding models learn by training on tasks that force them to understand meaning:
Contrastive Learning
The model learns to make similar pairs close and dissimilar pairs far apart in vector space. Given an anchor text, a positive example (semantically similar), and negative examples (unrelated), the model is trained to minimize distance to positives and maximize distance to negatives.
# Training objective: bring similar pairs together,
# push dissimilar pairs apart
anchor = "How to train a neural network"
positive = "Steps for building a deep learning model" # Similar
negative = "Best Italian restaurants in New York" # Unrelated
# After training:
# similarity(embed(anchor), embed(positive)) → HIGH (e.g., 0.89)
# similarity(embed(anchor), embed(negative)) → LOW (e.g., 0.12)
Masked Language Modeling (BERT-style)
Randomly mask tokens in the input and train the model to predict them. This forces the model to learn deep contextual understanding. The resulting internal representations are used as embeddings.
Dimensionality
Embedding dimension is the number of values in the output vector. Common dimensions include:
| Dimensions | Example Models | Use Case |
|---|---|---|
| 384 | all-MiniLM-L6-v2 | Lightweight, fast, good for prototyping |
| 768 | all-mpnet-base-v2, BGE-base | Good balance of quality and speed |
| 1024 | E5-large, BGE-large | High quality, moderate cost |
| 1536 | OpenAI text-embedding-3-small | Production standard for most applications |
| 3072 | OpenAI text-embedding-3-large | Maximum quality, higher cost and storage |
Cosine Similarity Explained
Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite) to 1 (identical direction).
import numpy as np
def cosine_similarity(a, b):
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example with simple vectors
cat = np.array([0.9, 0.1, 0.8, 0.2])
dog = np.array([0.85, 0.15, 0.75, 0.25])
car = np.array([0.1, 0.9, 0.15, 0.85])
print(f"cat vs dog: {cosine_similarity(cat, dog):.4f}") # ~0.99 (very similar)
print(f"cat vs car: {cosine_similarity(cat, car):.4f}") # ~0.36 (very different)
# Interpretation:
# > 0.8 → Very similar
# 0.5-0.8 → Somewhat related
# < 0.3 → Unrelated
Visualizing Embeddings
Embeddings live in high-dimensional space (1536 dimensions). To visualize them, we use dimensionality reduction techniques that project vectors down to 2D or 3D while preserving relative distances.
t-SNE (t-distributed Stochastic Neighbor Embedding)
Best for revealing clusters and local structure. Points that are close in high dimensions stay close in the visualization.
UMAP (Uniform Manifold Approximation and Projection)
Better at preserving both local and global structure. Faster than t-SNE and often preferred for large datasets.
import umap
import matplotlib.pyplot as plt
import numpy as np
# Assume `embeddings` is a numpy array of shape (n_samples, 1536)
# and `labels` is a list of category names
# Reduce to 2D with UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
embeddings_2d = reducer.fit_transform(embeddings)
# Plot
plt.figure(figsize=(12, 8))
for label in set(labels):
mask = [l == label for l in labels]
plt.scatter(
embeddings_2d[mask, 0],
embeddings_2d[mask, 1],
label=label, alpha=0.7, s=50
)
plt.legend()
plt.title("Document Embeddings Visualized with UMAP")
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.show()
💡 Try It Yourself
Install sentence-transformers and compute embeddings for 10 sentences about different topics (e.g., 5 about cooking, 5 about programming). Compute pairwise cosine similarities and verify that within-topic pairs score higher.
Lilly Tech Systems