Papers
arxiv:2603.18558

HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Published on Mar 19
· Submitted by
Gabriele Serussi
on Mar 23
Authors:
,
,

Abstract

HiMu is a training-free framework for long-form video question answering that uses a hierarchical logic tree decomposition and lightweight experts to efficiently select relevant video frames while preserving temporal structure and cross-modal bindings.

AI-generated summary

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

Community

Paper author Paper submitter

TL;DR: HiMu is a training-free frame selector for long video QA that gets the best of both worlds — structured reasoning without expensive LVLM calls.
The problem: When answering questions about long videos, you need to pick the right frames. Fast methods (like CLIP retrieval) work okay for simple queries but fail on compositional ones ("What did the speaker say after the chart appeared?"). Smart methods (agent-based selectors) handle these well but cost 10–100× more compute.
How HiMu works: One cheap LLM call breaks the question into a logic tree of simple checks — visual appearance (CLIP), object detection, OCR, speech (ASR), and audio (CLAP). Each check scores every frame independently, then fuzzy logic combines the scores into a final ranking. No iterative LVLM reasoning needed.
Results: Matches or beats agent-based methods on Video-MME, LongVideoBench, and HERBench-Lite — at a fraction of the cost. Works as a drop-in module in front of any LVLM.
Paper: https://arxiv.org/abs/2603.18558

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.18558 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.18558 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.18558 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.