Integrate with Sentence Transformers v5.4
Hello!
Pull Request overview
- Integrate this model using a Sentence Transformers
SentenceTransformer
Details
This PR adds the configuration files needed to load this model directly as a SentenceTransformer via Sentence Transformers. The model uses a feature-extraction Transformer with a Normalize module, producing 512-dimensional normalized embeddings via CLIP's projection layers (get_text_features/get_image_features). The model supports text, image, and composed image+text inputs.
Because this model can create text+image retrieval (summing text and image projected embeddings), I've included a small custom BGEVLCLIPTransformer module (bge_vl_clip_transformer.py) that subclasses Sentence Transformers' Transformer. For the ("image", "text") compound modality, it runs text and image through their respective forward paths and sums the resulting embeddings. Text-only and image-only inputs are handled by the parent class directly. This requires trust_remote_code=True when loading the model with Sentence Transformers.
The custom module also overrides load to force trust_remote_code=False for the underlying AutoModel, since the repo's custom modeling_MMRet_CLIP.py has a non-persistent position_ids buffer issue on transformers v5+. The standard CLIPModel loads these weights fine.
Added files:
modules.json: pipeline:BGEVLCLIPTransformer&Normalizesentence_bert_config.json:feature-extractiontask, multimodal config withget_text_features/get_image_featuresconfig_sentence_transformers.json: cosine similaritybge_vl_clip_transformer.py: custom Transformer subclass for composed image+text late fusion
Once the Sentence Transformers v5.4 release is out, the model can be used immediately like so:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/BGE-VL-base", trust_remote_code=True, revision="refs/pr/7")
query_image = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_query.png"
candidate_1 = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_candi_1.png"
candidate_2 = "https://huggingface.co/BAAI/BGE-VL-base/resolve/main/assets/cir_candi_2.png"
# Encode text
text_embeddings = model.encode(["A dog sitting on a bench", "A cat sleeping on a couch"])
print(text_embeddings.shape)
# (2, 512)
# Encode images
image_embeddings = model.encode([query_image, candidate_1])
print(image_embeddings.shape)
# (2, 512)
# Composed image retrieval: encode image+text query, compare with image candidates
query_embeddings = model.encode([{
"image": query_image,
"text": "Make the background dark, as if the camera has taken the photo at night",
}])
candidate_embeddings = model.encode([candidate_1, candidate_2])
scores = model.similarity(query_embeddings, candidate_embeddings)
print(scores)
# tensor([[0.2645, 0.1251]])
And after merging, the revision argument can be dropped.
Note that none of the old behaviour is affected/changed. It only adds an additional way to run this model in a familiar and common format.
- Tom Aarsen