Multi-image / interleaved document embeddings?

by vishaal27 - opened Jan 8

Jan 8

Hi, thanks for the great work and releasing your model checkpoints: this is very useful for the community.

I have a usecase where I want to produce embeddings for a multimodal document i.e. having interleaved images and text chunks (for examples, documents like in Obelics or MM-C4). Is it possible to get such multimodal embeddings from e5-omni?

Thanks!

Haon-Chen

Owner Jan 8

@vishaal27 Hi! Yes — e5-omni is designed to produce a single embedding for any non-empty modality composition (text / image / audio / video), including mixed inputs like text+image in one “item”.
So if your backbone input format supports it, you can represent an interleaved multimodal document as a single sequence (e.g., text chunks with images inserted in-order) and extract one embedding.

That said, for Obelics / MM-C4–style long documents, you’ll likely hit context limits (we trained with a relatively short max length), so in practice we recommend chunking the document into smaller interleaved segments, embedding each segment, and then either: indexing chunk embeddings for retrieval, or pooling them (e.g., mean/attention pooling) to form a document-level embedding.

Haon-Chen changed discussion status to closed Jan 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment