Papers
arxiv:2606.15134

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Published on Jun 13
· Submitted by
Shubhang Bhatnagar
on Jun 17
Authors:
,
,

Abstract

SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

Community

Paper submitter

SAGA uses a frozen multimodal LLM as the reward model for training a retrieval vision encoder. Think RLVR, but aimed at the encoder's representation rather than LLM reasoning.

We show the MLLM an image pair, ask same class or different, and reward correct verdicts with GRPO. Advantages cancel on the attributes the two images share and concentrate on the ones that differ, so one binary reward becomes dense attribute-level gradients on the encoder, with no attribute labels.

The MLLM is dropped at inference, so zero deployment overhead. +3 to 6 R@1 over SOTA on CUB, Cars, Aircraft, iNat-Aves.
Feedback welcome!

teaser

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15134
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15134 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15134 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15134 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.