Abstract:A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in large - scale vision - language models (such as GPT - 4, LLaVA), although increasing the number of visual tokens can enhance visual understanding ability, it also significantly increases memory and computational costs, especially in the case of long - time dense video frame streams. Although existing methods (such as Q - Former and Perceiver Resampler) reduce the burden by reducing the number of visual tokens, these methods may ignore the context information (i.e., key - value cache) modeled causally by large language models (LLM), resulting in the loss of important visual cues when processing user queries. To solve this problem, the authors propose a new method - **VideoLLM - MoD**. Instead of reducing the number of visual tokens, this method reduces the amount of visual computation by skipping redundant visual token layers. Specifically, for each Transformer layer, it learns to skip the computation of a high proportion (for example, 80%) of visual tokens and directly pass them to the next layer. This method not only reduces the demand for computational resources but also avoids the problem of performance degradation caused by reducing visual tokens, thus achieving about 42% time savings and 30% memory savings while maintaining or improving performance. In addition, the paper verifies the effectiveness of VideoLLM - MoD through extensive experiments and shows its state - of - the - art results in multiple benchmark tests, including narration, prediction, and summarization tasks on the COIN, Ego4D, and Ego - Exo4D datasets. ### Main contributions 1. **Propose VideoLLM - MoD**: An efficient large - language model for online videos that can maintain or even improve performance while reducing computational costs. 2. **Introduce the LayerExpert module**: Used to determine which visual tokens need to be processed in a specific layer, enabling the model to adaptively allocate computation to key areas. 3. **Experimental proof of effectiveness**: Experiments in the Ego4D, EgoExo4D, and COIN benchmarks show that VideoLLM - MoD has good effectiveness and generalization ability. ### Key points of the solution - **Skip redundant visual tokens**: By skipping the redundant visual token computation in certain layers, unnecessary computational overhead is reduced. - **Maintain context integrity**: Compared with methods that directly discard or merge visual tokens, the skipping mechanism can preserve the integrity of the context, thus avoiding performance degradation. - **Dynamically allocate computational resources**: Use the LayerExpert module to dynamically select key visual tokens for processing and optimize the use of computational resources. This innovative method provides an efficient and effective solution for online video processing, especially for application scenarios that require real - time responses.

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

VideoLLM-online: Online Video Large Language Model for Streaming Video

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Streaming Long Video Understanding with Large Language Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Efficient Large Multi-modal Models via Visual Context Compression

EVLM: An Efficient Vision-Language Model for Visual Understanding

Efficient Multi-modal Large Language Models via Visual Token Grouping

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Visual Context Window Extension: A New Perspective for Long Video Understanding

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

LongVLM: Efficient Long Video Understanding via Large Language Models

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding