VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Shiwei Wu,Joya Chen,Kevin Qinghong Lin,Qimeng Wang,Yan Gao,Qianli Xu,Tong Xu,Yao Hu,Enhong Chen,Mike Zheng Shou
2024-08-30
Abstract:A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in large - scale vision - language models (such as GPT - 4, LLaVA), although increasing the number of visual tokens can enhance visual understanding ability, it also significantly increases memory and computational costs, especially in the case of long - time dense video frame streams. Although existing methods (such as Q - Former and Perceiver Resampler) reduce the burden by reducing the number of visual tokens, these methods may ignore the context information (i.e., key - value cache) modeled causally by large language models (LLM), resulting in the loss of important visual cues when processing user queries. To solve this problem, the authors propose a new method - **VideoLLM - MoD**. Instead of reducing the number of visual tokens, this method reduces the amount of visual computation by skipping redundant visual token layers. Specifically, for each Transformer layer, it learns to skip the computation of a high proportion (for example, 80%) of visual tokens and directly pass them to the next layer. This method not only reduces the demand for computational resources but also avoids the problem of performance degradation caused by reducing visual tokens, thus achieving about 42% time savings and 30% memory savings while maintaining or improving performance. In addition, the paper verifies the effectiveness of VideoLLM - MoD through extensive experiments and shows its state - of - the - art results in multiple benchmark tests, including narration, prediction, and summarization tasks on the COIN, Ego4D, and Ego - Exo4D datasets. ### Main contributions 1. **Propose VideoLLM - MoD**: An efficient large - language model for online videos that can maintain or even improve performance while reducing computational costs. 2. **Introduce the LayerExpert module**: Used to determine which visual tokens need to be processed in a specific layer, enabling the model to adaptively allocate computation to key areas. 3. **Experimental proof of effectiveness**: Experiments in the Ego4D, EgoExo4D, and COIN benchmarks show that VideoLLM - MoD has good effectiveness and generalization ability. ### Key points of the solution - **Skip redundant visual tokens**: By skipping the redundant visual token computation in certain layers, unnecessary computational overhead is reduced. - **Maintain context integrity**: Compared with methods that directly discard or merge visual tokens, the skipping mechanism can preserve the integrity of the context, thus avoiding performance degradation. - **Dynamically allocate computational resources**: Use the LayerExpert module to dynamically select key visual tokens for processing and optimize the use of computational resources. This innovative method provides an efficient and effective solution for online video processing, especially for application scenarios that require real - time responses.