Abstract:Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing large language models (LLMs) for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA, which contains a lightweight temporal frame selector and a spatial feature interactor, within the internal blocks of LLMs to capture instruction-aware and fine-grained visual signals. Consequently, the proposed video-LLM facilitates a comprehensive understanding of long video content through appropriate long video modeling and precise visual interactions. We conducted extensive experiments on nine video understanding benchmarks and experimental results show that our interactive visual adapter significantly improves the performance of video LLMs on long video QA tasks. Ablation studies further verify the effectiveness of IVA in understanding long and short video.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key challenges in long - video understanding, especially those encountered when using large - language models (LLMs) for long - video processing. Specifically, these problems include: 1. **High computational cost**: Due to the large number of video frames, using LLMs to process long videos will incur high computational costs. 2. **Decreased visual clarity**: When aggregating video frames (for example, using average or maximum - representation pooling), visual clarity will be decreased. 3. **Irrelevant visual information leading to wrong answers**: When the information related to the question is embedded in long - time cues, irrelevant visual information may lead to wrong answers. To alleviate these problems, the authors propose a method named **Interactive Visual Adapter (IVA)**, which aims to enhance the interaction between LLMs and fine - grained visual elements. Specifically, IVA contains a lightweight temporal - frame selector and a spatial - feature interactor. Through these components, LLMs can more effectively capture instruction - related fine - grained visual signals, thereby improving the performance of long - video question - answering tasks. ### Main contributions: - **Analyzed the challenges of long - video modeling and proposed the Interactive Visual Adapter (IVA)**, enabling LLMs to interact deeply with long videos based on efficient video tokens and the IVA mechanism. - **Designed a parameter - sharing IVA architecture** that contains an instruction - aware temporal - frame selector and a spatial - feature interactor, which can select relevant frames and interact with their fine - grained spatial features. - **Experimental results show** that LLMs with IVA exhibit strong performance in long - video question - answering tasks, and ablation studies further verify the key role and effectiveness of IVA. ### Experimental verification: - **Dataset**: The authors conducted extensive experiments on four long - video question - answering benchmarks and five short - video understanding benchmarks. - **Performance improvement**: The experimental results show that the model with IVA significantly outperforms the baseline model and other strong video LLMs on multiple long - video and short - video benchmarks. - **Ablation study**: Through ablation studies, the authors verified the effectiveness of the IVA module, especially the significant performance improvement on long - video datasets. In conclusion, by introducing the Interactive Visual Adapter (IVA), this paper effectively solves the key challenges in long - video understanding and significantly improves the performance of LLMs in long - video question - answering tasks.

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

LongVLM: Efficient Long Video Understanding via Large Language Models

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Streaming Long Video Understanding with Large Language Models

Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner

ST-LLM: Large Language Models Are Effective Temporal Learners

VideoQA in the Era of LLMs: An Empirical Study

A Simple LLM Framework for Long-Range Video Question-Answering

Audio-Visual LLM for Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Visual Context Window Extension: A New Perspective for Long Video Understanding

Understanding Long Videos with Multimodal Language Models

Long Context Transfer from Language to Vision

Koala: Key frame-conditioned long video-LLM

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

VideoLLM: Modeling Video Sequence with Large Language Models

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Video Understanding with Large Language Models: A Survey