OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Paper • 2602.04804 • Published • 50
How to use dingyue1011/OmniSIFT-7B with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("dingyue1011/OmniSIFT-7B", dtype="auto")OmniSIFT-7B is the 7B checkpoint for OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models.
OmniSIFT stands for Omni-modal Spatio-temporal Informed Fine-grained Token compression. This checkpoint uses a modified Qwen2.5-Omni architecture and cannot be loaded with the stock transformers.Qwen2_5OmniForConditionalGeneration class directly.
The custom implementation is included in this repository under:
README.md
omnisift/
modeling_qwen2_5_omni.py
compression_units.py
OmniSIFT-7B/
config.json
model-*.safetensors
tokenizer and processor files
import sys
import torch
from huggingface_hub import snapshot_download
from qwen_omni_utils import process_mm_info
from transformers import Qwen2_5OmniProcessor
repo_dir = snapshot_download("dingyue1011/OmniSIFT-7B")
model_dir = f"{repo_dir}/OmniSIFT-7B"
sys.path.append(repo_dir)
from omnisift.modeling_qwen2_5_omni import (
Qwen2_5OmniForConditionalGeneration,
)
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
model_dir,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2",
)
model.disable_talker()
model.thinker.omnizip_config = None
model.thinker.compression_config = {
"rho_audio": 0.5,
"rho_video": 0.77,
}
processor = Qwen2_5OmniProcessor.from_pretrained(model_dir)
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "/path/to/video.mp4"},
{"type": "text", "text": "Please describe the video."},
],
},
]
use_audio_in_video = True
text = processor.apply_chat_template(
conversation, add_generation_prompt=True, tokenize=False
)
audios, images, videos = process_mm_info(
conversation, use_audio_in_video=use_audio_in_video
)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt",
padding=True,
use_audio_in_video=use_audio_in_video,
)
inputs = inputs.to(model.device).to(model.dtype)
with torch.no_grad():
output_ids = model.generate(
**inputs,
use_audio_in_video=use_audio_in_video,
do_sample=False,
return_audio=False,
max_new_tokens=512,
use_cache=True,
)
output = processor.batch_decode(
output_ids[:, inputs["input_ids"].shape[1] :],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
print(output)
omnisift code when loading this checkpoint.rho_audio and rho_video control the audio/video compression ratios at inference time.Base model
Qwen/Qwen2.5-Omni-7B