Fine-tuning Embeddings
Learn when and how to fine-tune embedding models for domain-specific tasks, training data preparation, Matryoshka embeddings, and evaluation metrics.
Why Fine-tune Embeddings?
Off-the-shelf embedding models are trained on general web data. They work well for general-purpose text, but may underperform on domain-specific content:
- Medical/legal/scientific text with specialized terminology.
- Company-specific jargon that general models have never seen.
- Non-English languages with limited training data.
- Specific retrieval patterns where the definition of "similar" differs from general usage.
Training Data Preparation
The quality of your fine-tuned model depends entirely on the quality of your training data. There are three main data formats:
Pairs (Positive Only)
Pairs of texts that should be similar. Simplest to create.
from sentence_transformers import InputExample
# Pair format: (anchor, positive)
train_examples = [
InputExample(texts=[
"What is the return policy?", # query
"Items can be returned within 30 days" # relevant document
]),
InputExample(texts=[
"How do I reset my password?",
"Go to Settings > Security > Reset Password"
]),
InputExample(texts=[
"What are your shipping options?",
"We offer standard (5-7 days) and express (1-2 days) shipping"
]),
]
Triplets (Anchor, Positive, Negative)
More informative training signal. Each example includes a hard negative — a text that seems related but is not the right answer.
# Triplet format: (anchor, positive, negative)
train_examples = [
InputExample(texts=[
"What is the return policy?", # anchor
"Items can be returned within 30 days", # positive (relevant)
"We have stores in 50 countries" # hard negative (same domain, wrong answer)
]),
InputExample(texts=[
"How do I track my order?",
"Use the tracking link in your confirmation email",
"Our customer service hours are 9am to 5pm" # hard negative
]),
]
Fine-tuning with Sentence Transformers
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
# 1. Load base model
model = SentenceTransformer("all-MiniLM-L6-v2")
# 2. Prepare training data (pairs)
train_examples = [
InputExample(texts=["query text", "relevant document"]),
# ... hundreds or thousands more pairs
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# 3. Choose a loss function
# MultipleNegativesRankingLoss works great with pairs
train_loss = losses.MultipleNegativesRankingLoss(model)
# For triplets, use TripletLoss instead:
# train_loss = losses.TripletLoss(model)
# 4. Train
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./fine-tuned-model",
show_progress_bar=True,
)
# 5. Load and use the fine-tuned model
fine_tuned = SentenceTransformer("./fine-tuned-model")
embeddings = fine_tuned.encode(["Your domain-specific text"])
Matryoshka Embeddings
Matryoshka Representation Learning (MRL) trains models to produce embeddings where the first N dimensions are a valid, lower-dimensional embedding. Like Russian nesting dolls, you can truncate the embedding to any prefix length and still get useful results.
from sentence_transformers import SentenceTransformer, losses
model = SentenceTransformer("all-MiniLM-L6-v2")
# Wrap the loss with Matryoshka loss
base_loss = losses.MultipleNegativesRankingLoss(model)
matryoshka_loss = losses.MatryoshkaLoss(
model,
loss=base_loss,
matryoshka_dims=[256, 128, 64, 32] # Train for multiple dimensions
)
# Train as usual
model.fit(
train_objectives=[(train_dataloader, matryoshka_loss)],
epochs=3,
)
# After training, you can use any prefix of the embedding:
full_embedding = model.encode("Some text") # 384 dims
short_embedding = full_embedding[:128] # Also valid! 128 dims
tiny_embedding = full_embedding[:32] # Also valid! 32 dims
dimensions parameter.Evaluation Metrics
Measure your fine-tuned model's quality with these standard metrics:
| Metric | What It Measures | Good Score |
|---|---|---|
| Recall@k | Fraction of relevant documents found in top-k results | > 0.90 for k=10 |
| MRR (Mean Reciprocal Rank) | Average of 1/rank of the first relevant result | > 0.70 |
| NDCG@k | Quality of ranking considering position and relevance grades | > 0.75 for k=10 |
| MAP (Mean Average Precision) | Average precision across all recall levels | > 0.60 |
from sentence_transformers import SentenceTransformer, util
import numpy as np
def evaluate_recall_at_k(model, queries, documents, relevance, k=10):
"""Compute Recall@k for an embedding model."""
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
scores = util.cos_sim(query_embeddings, doc_embeddings)
recalls = []
for i, query in enumerate(queries):
top_k_indices = scores[i].argsort(descending=True)[:k].tolist()
relevant_docs = set(relevance[i]) # indices of relevant docs
found = len(relevant_docs.intersection(set(top_k_indices)))
recalls.append(found / len(relevant_docs))
return np.mean(recalls)
# Compare base vs fine-tuned
base_model = SentenceTransformer("all-MiniLM-L6-v2")
fine_tuned_model = SentenceTransformer("./fine-tuned-model")
base_recall = evaluate_recall_at_k(base_model, test_queries, docs, relevance)
ft_recall = evaluate_recall_at_k(fine_tuned_model, test_queries, docs, relevance)
print(f"Base model Recall@10: {base_recall:.4f}")
print(f"Fine-tuned Recall@10: {ft_recall:.4f}")
When to Fine-tune vs Use Off-the-Shelf
| Scenario | Recommendation |
|---|---|
| General text search | Use off-the-shelf (OpenAI, Voyage) |
| Domain-specific (medical, legal) | Fine-tune if recall is below target |
| Limited labeled data (<100 pairs) | Use off-the-shelf; not enough data to fine-tune |
| Sufficient labeled data (>1000 pairs) | Fine-tune for measurable improvement |
| Multilingual search | Use multilingual off-the-shelf models |
| Cost-sensitive production | Fine-tune a small open-source model to avoid API costs |
Cost and Compute Considerations
- GPU requirement: Fine-tuning needs at least one GPU. A single NVIDIA T4 (16GB) is sufficient for most models. Training takes 30 minutes to a few hours.
- Cloud cost: A T4 GPU on cloud costs approximately $0.50–$1.00/hour. A typical fine-tuning run costs $1–$10 total.
- Free options: Google Colab offers free GPU access sufficient for small fine-tuning runs.
- Data cost: The real cost is creating high-quality training pairs. Budget more for data curation than compute.
💡 Try It Yourself
Create 50 query-document pairs from your domain, fine-tune all-MiniLM-L6-v2, and compare Recall@10 before and after fine-tuning. Even a small dataset can show measurable improvement.
Lilly Tech Systems