Abstract:Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: how to transform the existing, already - trained image - level large - language models (image - LLMs) into video - level large - language models (video - LLMs) that can understand and process video content without training these models from scratch. This not only saves a great deal of computing resources and time, but also can fully utilize the powerful capabilities of existing models in image understanding. Specifically, the paper proposes a module named **Linear Video Tokenizer (LinVT)**, which can be inserted into the existing image - level LLMs to enable them to have the ability to process videos. To achieve this goal, the paper introduces two design principles: 1. **Linear transformation**: Ensure that the video LLM retains the original visual - language alignment, that is, the output is a linear combination of the input. 2. **Representative information concentration**: Extract representative information from the lengthy video content, reduce unnecessary computational burdens, and adapt to video events of different lengths. Through these two principles, LinVT can effectively enhance the video - understanding ability of the original image LLM without undermining its performance. Experimental results show that LLMs based on LinVT perform excellently in multiple video - understanding benchmark tests, proving the effectiveness and efficiency of this method. ### Formula Explanation Some key formulas involved in the description are as follows: - **Linear transformation**: Let the input visual token sequence be \(\mathbf{V}=\{V_1, V_2,\dots, V_T\}\), where \(V_i\) represents the visual feature of the \(i\)-th frame. After linear transformation, the output visual token is: \[ V_{\text{out}}=\sum_{i = 1}^{T}w_iV_i \] where \(w_i\) is the weight coefficient, ensuring that the output is a linear combination of the input. - **Multi - scale token pooling**: To capture information at different time scales, multi - scale token pooling (MTP) is used. For each scale \(l\), the output token is: \[ T_l=\text{AvgPool}(V_{\text{selected}}) \] where \(V_{\text{selected}}\) is the visual token sequence after top - k selection. These formulas ensure that the design of the LinVT module conforms to the above two design principles and can operate effectively in practical applications.

LinVT: Empower Your Image-level Large Language Model to Understand Videos

LongVLM: Efficient Long Video Understanding via Large Language Models

VideoLLM: Modeling Video Sequence with Large Language Models

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Video Understanding with Large Language Models: A Survey

LLMs Meet Long Video: Advancing Long Video Comprehension with an Interactive Visual Adapter in LLMs.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VLM-Eval: A General Evaluation on Video Large Language Models

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Understanding Long Videos with Multimodal Language Models

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

InfMLLM: A Unified Framework for Visual-Language Tasks.

Streaming Long Video Understanding with Large Language Models

Audio-Visual LLM for Video Understanding

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Valley: Video Assistant with Large Language model Enhanced abilitY

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models