LinVT: Empower Your Image-level Large Language Model to Understand Videos

Lishuai Gao,Yujie Zhong,Yingsen Zeng,Haoxian Tan,Dengjie Li,Zheng Zhao
2024-12-07
Abstract:Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: how to transform the existing, already - trained image - level large - language models (image - LLMs) into video - level large - language models (video - LLMs) that can understand and process video content without training these models from scratch. This not only saves a great deal of computing resources and time, but also can fully utilize the powerful capabilities of existing models in image understanding. Specifically, the paper proposes a module named **Linear Video Tokenizer (LinVT)**, which can be inserted into the existing image - level LLMs to enable them to have the ability to process videos. To achieve this goal, the paper introduces two design principles: 1. **Linear transformation**: Ensure that the video LLM retains the original visual - language alignment, that is, the output is a linear combination of the input. 2. **Representative information concentration**: Extract representative information from the lengthy video content, reduce unnecessary computational burdens, and adapt to video events of different lengths. Through these two principles, LinVT can effectively enhance the video - understanding ability of the original image LLM without undermining its performance. Experimental results show that LLMs based on LinVT perform excellently in multiple video - understanding benchmark tests, proving the effectiveness and efficiency of this method. ### Formula Explanation Some key formulas involved in the description are as follows: - **Linear transformation**: Let the input visual token sequence be \(\mathbf{V}=\{V_1, V_2,\dots, V_T\}\), where \(V_i\) represents the visual feature of the \(i\)-th frame. After linear transformation, the output visual token is: \[ V_{\text{out}}=\sum_{i = 1}^{T}w_iV_i \] where \(w_i\) is the weight coefficient, ensuring that the output is a linear combination of the input. - **Multi - scale token pooling**: To capture information at different time scales, multi - scale token pooling (MTP) is used. For each scale \(l\), the output token is: \[ T_l=\text{AvgPool}(V_{\text{selected}}) \] where \(V_{\text{selected}}\) is the visual token sequence after top - k selection. These formulas ensure that the design of the LinVT module conforms to the above two design principles and can operate effectively in practical applications.