Advanced

Fine-tuning Embeddings

Learn when and how to fine-tune embedding models for domain-specific tasks, training data preparation, Matryoshka embeddings, and evaluation metrics.

Why Fine-tune Embeddings?

Off-the-shelf embedding models are trained on general web data. They work well for general-purpose text, but may underperform on domain-specific content:

  • Medical/legal/scientific text with specialized terminology.
  • Company-specific jargon that general models have never seen.
  • Non-English languages with limited training data.
  • Specific retrieval patterns where the definition of "similar" differs from general usage.
💡
Rule of thumb: Try off-the-shelf models first. Fine-tune only when you have measured a quality gap AND have enough training data (typically 1,000+ pairs minimum). Fine-tuning can improve retrieval metrics by 5–15% on domain-specific tasks.

Training Data Preparation

The quality of your fine-tuned model depends entirely on the quality of your training data. There are three main data formats:

Pairs (Positive Only)

Pairs of texts that should be similar. Simplest to create.

Python - Pair Format
from sentence_transformers import InputExample

# Pair format: (anchor, positive)
train_examples = [
    InputExample(texts=[
        "What is the return policy?",           # query
        "Items can be returned within 30 days"   # relevant document
    ]),
    InputExample(texts=[
        "How do I reset my password?",
        "Go to Settings > Security > Reset Password"
    ]),
    InputExample(texts=[
        "What are your shipping options?",
        "We offer standard (5-7 days) and express (1-2 days) shipping"
    ]),
]

Triplets (Anchor, Positive, Negative)

More informative training signal. Each example includes a hard negative — a text that seems related but is not the right answer.

Python - Triplet Format
# Triplet format: (anchor, positive, negative)
train_examples = [
    InputExample(texts=[
        "What is the return policy?",            # anchor
        "Items can be returned within 30 days",  # positive (relevant)
        "We have stores in 50 countries"         # hard negative (same domain, wrong answer)
    ]),
    InputExample(texts=[
        "How do I track my order?",
        "Use the tracking link in your confirmation email",
        "Our customer service hours are 9am to 5pm"  # hard negative
    ]),
]

Fine-tuning with Sentence Transformers

Python - Complete Fine-tuning Pipeline
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

# 1. Load base model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Prepare training data (pairs)
train_examples = [
    InputExample(texts=["query text", "relevant document"]),
    # ... hundreds or thousands more pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 3. Choose a loss function
# MultipleNegativesRankingLoss works great with pairs
train_loss = losses.MultipleNegativesRankingLoss(model)

# For triplets, use TripletLoss instead:
# train_loss = losses.TripletLoss(model)

# 4. Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./fine-tuned-model",
    show_progress_bar=True,
)

# 5. Load and use the fine-tuned model
fine_tuned = SentenceTransformer("./fine-tuned-model")
embeddings = fine_tuned.encode(["Your domain-specific text"])

Matryoshka Embeddings

Matryoshka Representation Learning (MRL) trains models to produce embeddings where the first N dimensions are a valid, lower-dimensional embedding. Like Russian nesting dolls, you can truncate the embedding to any prefix length and still get useful results.

Python - Training Matryoshka Embeddings
from sentence_transformers import SentenceTransformer, losses

model = SentenceTransformer("all-MiniLM-L6-v2")

# Wrap the loss with Matryoshka loss
base_loss = losses.MultipleNegativesRankingLoss(model)
matryoshka_loss = losses.MatryoshkaLoss(
    model,
    loss=base_loss,
    matryoshka_dims=[256, 128, 64, 32]  # Train for multiple dimensions
)

# Train as usual
model.fit(
    train_objectives=[(train_dataloader, matryoshka_loss)],
    epochs=3,
)

# After training, you can use any prefix of the embedding:
full_embedding = model.encode("Some text")       # 384 dims
short_embedding = full_embedding[:128]            # Also valid! 128 dims
tiny_embedding = full_embedding[:32]              # Also valid! 32 dims
Why Matryoshka matters: It lets you trade quality for efficiency at inference time. Use full dimensions for critical searches and truncated dimensions for less important or high-throughput scenarios. OpenAI's text-embedding-3 models already support this via the dimensions parameter.

Evaluation Metrics

Measure your fine-tuned model's quality with these standard metrics:

Metric What It Measures Good Score
Recall@k Fraction of relevant documents found in top-k results > 0.90 for k=10
MRR (Mean Reciprocal Rank) Average of 1/rank of the first relevant result > 0.70
NDCG@k Quality of ranking considering position and relevance grades > 0.75 for k=10
MAP (Mean Average Precision) Average precision across all recall levels > 0.60
Python - Evaluate Retrieval Quality
from sentence_transformers import SentenceTransformer, util
import numpy as np

def evaluate_recall_at_k(model, queries, documents, relevance, k=10):
    """Compute Recall@k for an embedding model."""
    query_embeddings = model.encode(queries)
    doc_embeddings = model.encode(documents)

    scores = util.cos_sim(query_embeddings, doc_embeddings)
    recalls = []

    for i, query in enumerate(queries):
        top_k_indices = scores[i].argsort(descending=True)[:k].tolist()
        relevant_docs = set(relevance[i])  # indices of relevant docs
        found = len(relevant_docs.intersection(set(top_k_indices)))
        recalls.append(found / len(relevant_docs))

    return np.mean(recalls)

# Compare base vs fine-tuned
base_model = SentenceTransformer("all-MiniLM-L6-v2")
fine_tuned_model = SentenceTransformer("./fine-tuned-model")

base_recall = evaluate_recall_at_k(base_model, test_queries, docs, relevance)
ft_recall = evaluate_recall_at_k(fine_tuned_model, test_queries, docs, relevance)

print(f"Base model Recall@10: {base_recall:.4f}")
print(f"Fine-tuned Recall@10: {ft_recall:.4f}")

When to Fine-tune vs Use Off-the-Shelf

Scenario Recommendation
General text search Use off-the-shelf (OpenAI, Voyage)
Domain-specific (medical, legal) Fine-tune if recall is below target
Limited labeled data (<100 pairs) Use off-the-shelf; not enough data to fine-tune
Sufficient labeled data (>1000 pairs) Fine-tune for measurable improvement
Multilingual search Use multilingual off-the-shelf models
Cost-sensitive production Fine-tune a small open-source model to avoid API costs

Cost and Compute Considerations

  • GPU requirement: Fine-tuning needs at least one GPU. A single NVIDIA T4 (16GB) is sufficient for most models. Training takes 30 minutes to a few hours.
  • Cloud cost: A T4 GPU on cloud costs approximately $0.50–$1.00/hour. A typical fine-tuning run costs $1–$10 total.
  • Free options: Google Colab offers free GPU access sufficient for small fine-tuning runs.
  • Data cost: The real cost is creating high-quality training pairs. Budget more for data curation than compute.

💡 Try It Yourself

Create 50 query-document pairs from your domain, fine-tune all-MiniLM-L6-v2, and compare Recall@10 before and after fine-tuning. Even a small dataset can show measurable improvement.

Start by mining training data from your existing search logs, FAQ pages, or documentation.