Fine-Tuning Multimodal Embedding Models with Sentence Transformers: A Practical Walkthrough

Sentence Transformers has been my go-to library for embedding and reranker models for years. It’s solid, well-maintained, and keeps evolving. The latest update? Full multimodal support for text, images, audio, and video. Tom Aarsen covered the basics in his previous post, but now he’s back with something more useful: a hands-on guide to training and fine-tuning these multimodal models on your own data.

The example he uses is Visual Document Retrieval (VDR). Think of it as finding the right page from a pile of document screenshots based on a text query like “What was the company’s Q3 revenue?” It’s a practical, messy problem that a lot of enterprises face, and the results are impressive.

Why Bother Fine-Tuning?

General-purpose multimodal models like Qwen/Qwen3-VL-Embedding-2B are trained on everything under the sun: image-text pairs, visual QA, document understanding, you name it. That breadth is useful, but it means the model is a jack of all trades and master of none. For a specific task like VDR, you’re leaving performance on the table.

Aarsen’s experiment proves it. The base model scored an NDCG@10 of 0.888. After fine-tuning on domain-specific data, that jumped to 0.947. That’s not just a small bump; it beat every other multimodal model he tested, including ones four times larger. Fine-tuning lets the model learn the patterns that matter for your use case: document layouts, chart structures, table formatting. It’s the difference between a generic tool and a purpose-built one.

The Training Pipeline

The training pipeline for multimodal models is the same as for text-only models. You’ve got your model, dataset, loss function, optional training arguments, an evaluator if you want one, and the trainer that ties it all together. The main difference is that your dataset now includes images (or other modalities), and the model’s processor handles preprocessing automatically. No manual resizing or encoding gymnastics.

Let’s walk through each piece.

Model

You have two options. The straightforward one is fine-tuning an existing multimodal embedding model. You load it with SentenceTransformer, pass in model_kwargs for things like attention implementation and precision, and processor_kwargs to control image resolution. Aarsen’s example uses Flash Attention 2 and bfloat16, with pixel bounds set to keep memory in check:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

The second option is starting from a fresh Vision-Language Model (VLM) checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers tries to detect the architecture, figure out the supported modalities from the processor, and set up the right forward method and pooling. If the auto-detection gets confused, you can edit the sentence_bert_config.json file to fix modality settings and output handling. It’s not perfect, but it’s flexible enough to handle most models.

Either way, the library inspects the processor to determine what modalities are available and adds pooling automatically if needed. You can check with print(model.modalities) or model.supports("image").

Dataset

For VDR, the dataset consists of text queries and document screenshots. Each query has a corresponding positive image (the correct page) and a set of negatives (other pages that don’t match). The dataset format is straightforward: a list of examples, each with a query string and an image path or URL.

Aarsen used a custom dataset for his experiments, but the library supports standard formats like Hugging Face datasets or plain lists of dictionaries. The key is that each example pairs a text query with one or more images.

Loss Function

This is where things get interesting. For multimodal embedding training, Aarsen recommends CachedMultipleNegativesRankingLoss. It’s a variant of the classic MultipleNegativesRankingLoss but with caching to handle large batch sizes efficiently. The idea is simple: for each query, the model learns to rank the correct image higher than all the incorrect ones in the batch.

He also combines it with MatryoshkaLoss, which trains the model to produce embeddings at multiple dimensions. This lets you trade off between speed and accuracy at inference time. You can use the full 2048-dimensional embedding for maximum accuracy or a smaller 256-dimensional one for faster retrieval. The loss function handles the rest.

Training Arguments

Nothing groundbreaking here. You set batch size, learning rate, number of epochs, and whether to use mixed precision. Aarsen used a batch size of 32, learning rate of 2e-5, and trained for 3 epochs. The usual advice applies: start with a small learning rate and monitor validation performance.

Evaluator

The library comes with built-in evaluators for retrieval tasks. Aarsen used InformationRetrievalEvaluator to compute NDCG@10 on a held-out validation set. This gives you a clear signal of whether your fine-tuning is actually helping. Without an evaluator, you’re flying blind.

Trainer

SentenceTransformerTrainer brings everything together. You pass it the model, training dataset, loss function, evaluator, and training arguments. Call trainer.train() and let it run. The trainer handles checkpointing, logging, and evaluation automatically.

Results Worth Noting

The fine-tuned model, tomaarsen/Qwen3-VL-Embedding-2B-vdr, achieved an NDCG@10 of 0.947 on the VDR task. That’s a 6.6% improvement over the base model’s 0.888. More importantly, it outperformed every other VDR model Aarsen tested, including models up to 4x its size. That’s a strong argument for fine-tuning over brute force scaling.

He also tested the Matryoshka dimensions. The full 2048-dimensional embedding gave the best results, but even at 256 dimensions, the model still outperformed the base model. That’s useful for production systems where latency matters.

Training Multimodal Reranker Models

Aarsen also covers training multimodal reranker models, which are used to re-rank the top results from an embedding-based retrieval step. The process is similar, but the loss function changes. Instead of a ranking loss, you use a cross-entropy loss where the model predicts a relevance score for each query-document pair. The dataset format is also different: each example is a query-image pair with a binary label (relevant or not).

Rerankers are more expensive to run than embedding models, but they can significantly improve final accuracy. If you have the budget, it’s worth adding one after the initial retrieval step.

Additional Resources

Aarsen links to prior blogposts on text-only training, sparse embeddings, and reranker models. The Sentence Transformers documentation has a full training guide, and there are example scripts for common tasks. If you’re serious about multimodal retrieval, start there.

My Take

This is the kind of practical content I wish more library releases included. Aarsen doesn’t just announce a feature; he shows you how to use it, with real numbers and a concrete use case. The VDR example is well-chosen because it’s a common enterprise problem that existing models handle poorly. The fact that a 2B parameter fine-tuned model beats 8B parameter competitors says a lot about the value of domain-specific training.

My only criticism is that the post assumes you already know the basics of Sentence Transformers. If you’re new, you’ll need to read the previous post first. But for anyone who’s already using the library, this is a goldmine. The code snippets are clean, the explanation of MatryoshkaLoss is clear, and the results are honest. No hype, just data.

If you’re building a retrieval system for documents, slides, or any visual content, this is the approach I’d start with.