Sentence Transformers Gets Multimodal: Embedding and Reranking Text, Images, Audio, and Video

If you’ve been using Sentence Transformers for text embeddings or reranking, you’re about to get a serious upgrade. The v5.4 release brings multimodal support to the library, meaning you can now encode and compare text, images, audio, and video using the same familiar API. No separate pipelines, no hacky workarounds. It just works.

What’s Actually New?

Traditional embedding models turn text into fixed-size vectors. Multimodal embedding models do the same thing but across modalities they map text, images, audio, and video into a single shared embedding space. That means you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that don’t care whether your source material is a PDF or a photo.

The reranker side gets the same treatment. Cross-Encoders that used to score text-text pairs can now handle mixed-modality pairs. Want to rank images by relevance to a text query? Done. Need to score a video against an audio description? Also done.

Installation: Pick Your Modalities

The base Sentence Transformers install won’t cut it for multimodal work. You need to pull in the extras:

pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"

Or just grab everything at once:

pip install -U "sentence-transformers[image,video,train]"

One thing to watch out for: VLM-based models like Qwen3-VL-2B need a GPU with at least 8GB of VRAM. The 8B variants will eat about 20GB. If you’re on CPU, stick with CLIP or text-only models. Don’t say I didn’t warn you.

Embedding Images: It’s the Same API

Loading a multimodal model is identical to loading a text-only one:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

The model auto-detects what modalities it supports. No extra configuration needed.

Encoding images works the same way as encoding text. Pass URLs, local file paths, or PIL Image objects:

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)

Cross-Modal Similarity: The Real Test

This is where it gets interesting. Since both modalities map into the same space, you can compute similarities between text and image embeddings directly:

text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)

The results make sense: “A green car” matches the car image at 0.51, “A bee on a pink flower” hits the bee image at 0.67. The hard negatives correctly score lower.

You’ll notice those scores aren’t close to 1.0. That’s the modality gap in action embeddings from different modalities cluster in separate regions of the space. Cross-modal similarities are lower than within-modal ones, but the relative ordering holds up, so retrieval still works.

Encoding Queries vs Documents

For retrieval tasks, use encode_query() and encode_document() instead of plain encode(). Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document. The dedicated methods handle this automatically.

query_emb = model.encode_query("Find images of green cars")
doc_emb = model.encode_document("https://example.com/car.jpg")

Multimodal Reranking

Reranking works similarly. Load a multimodal CrossEncoder and call rank() or predict():

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
scores = model.predict([
    ("A bee on a flower", "https://example.com/bee.jpg"),
    ("A car on the road", "https://example.com/bee.jpg"),
])
print(scores)  # First pair should score higher

Input Formats: What Works?

The library accepts a wide range of input types:

Text: strings
Images: URLs, file paths, PIL Image, numpy arrays, torch tensors, base64 strings, bytes
Audio: URLs, file paths, numpy arrays, torch tensors, bytes
Video: URLs, file paths, numpy arrays, torch tensors, bytes (video files or frames)

Check what a model supports with:

print(model.modalities)

Which Models Can You Use?

Hugging Face has been busy. Here are some of the supported models:

Qwen3-VL-Embedding-2B / 8B: Solid all-rounders for multimodal embedding
Qwen3-VL-Reranker-2B / 8B: The reranker counterparts
CLIP-based models: Lighter, faster, but less capable than VLM-based ones
SigLIP, ViT, and other vision-language models: Various options depending on your needs

VLM-based models are the most capable but need GPU. CLIP models work fine on CPU if you’re prototyping.

The Bottom Line

Sentence Transformers v5.4 isn’t just a minor update. It opens up genuinely new use cases: visual document retrieval, cross-modal search, multimodal RAG, video clip retrieval. The API stays clean and consistent, which is exactly what you want when you’re already running text pipelines.

If you’re building anything that mixes text with images, audio, or video, this is worth your time. The library handles the complexity so you don’t have to.