OmniSIFT-7B

OmniSIFT-7B is the 7B checkpoint for OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models.

OmniSIFT stands for Omni-modal Spatio-temporal Informed Fine-grained Token compression. This checkpoint uses a modified Qwen2.5-Omni architecture and cannot be loaded with the stock transformers.Qwen2_5OmniForConditionalGeneration class directly.

The custom implementation is included in this repository under:

README.md
omnisift/
  modeling_qwen2_5_omni.py
  compression_units.py
OmniSIFT-7B/
  config.json
  model-*.safetensors
  tokenizer and processor files

Inference

import sys
import torch
from huggingface_hub import snapshot_download
from qwen_omni_utils import process_mm_info
from transformers import Qwen2_5OmniProcessor

repo_dir = snapshot_download("dingyue1011/OmniSIFT-7B")
model_dir = f"{repo_dir}/OmniSIFT-7B"
sys.path.append(repo_dir)

from omnisift.modeling_qwen2_5_omni import (
    Qwen2_5OmniForConditionalGeneration,
)

model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2",
)
model.disable_talker()
model.thinker.omnizip_config = None
model.thinker.compression_config = {
    "rho_audio": 0.5,
    "rho_video": 0.77,
}
processor = Qwen2_5OmniProcessor.from_pretrained(model_dir)

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
            }
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "/path/to/video.mp4"},
            {"type": "text", "text": "Please describe the video."},
        ],
    },
]

use_audio_in_video = True
text = processor.apply_chat_template(
    conversation, add_generation_prompt=True, tokenize=False
)
audios, images, videos = process_mm_info(
    conversation, use_audio_in_video=use_audio_in_video
)
inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
    use_audio_in_video=use_audio_in_video,
)
inputs = inputs.to(model.device).to(model.dtype)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        use_audio_in_video=use_audio_in_video,
        do_sample=False,
        return_audio=False,
        max_new_tokens=512,
        use_cache=True,
    )

output = processor.batch_decode(
    output_ids[:, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)[0]
print(output)

Notes

Use the included omnisift code when loading this checkpoint.
rho_audio and rho_video control the audio/video compression ratios at inference time.
The model was trained from Qwen2.5-Omni and keeps the original processor/tokenizer files.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dingyue1011/OmniSIFT-7B

Base model

Qwen/Qwen2.5-Omni-7B

Finetuned

(53)

this model

Paper for dingyue1011/OmniSIFT-7B

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Paper • 2602.04804 • Published Feb 4 • 50