InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Yi Wang,Kunchang Li,Xinhao Li,Jiashuo Yu,Yinan He,Chenting Wang,Guo Chen,Baoqi Pei,Ziang Yan,Rongkun Zheng,Jilan Xu,Zun Wang,Yansong Shi,Tianxiang Jiang,Songze Li,Hongjie Zhang,Yifei Huang,Yu Qiao,Yali Wang,Limin Wang
2024-08-14
Abstract:We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at <a class="link-external link-https" href="https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of constructing a video foundation model (Video Foundation Model, ViFM) that can effectively understand and process multimodal videos. Specifically, the paper introduces a new series of video foundation models—InternVideo2, which achieves state-of-the-art performance on multiple tasks such as video recognition, video-text tasks, and video-centric dialogues. ### Main Issues 1. **Video Representation Learning**: How to effectively learn transferable spatiotemporal representations that perform well across different downstream tasks. 2. **Multimodal Alignment**: How to better align video representations with information from other modalities such as text and audio through methods like cross-modal contrastive learning and masked video modeling. 3. **Long Video Understanding**: How to improve the model's ability to understand long videos, enabling it to handle and reason about complex sequential actions. 4. **Open-Ended Dialogue Support**: How to enhance the model's open-ended dialogue capabilities, enabling it to generate high-quality video descriptions and answer video-related questions. ### Solutions The paper addresses the above issues through a three-stage training strategy: 1. **Unmasked Video Token Reconstruction**: In the first stage, the model learns basic spatiotemporal perception capabilities by reconstructing unmasked video tokens. 2. **Multimodal Contrastive Learning**: In the second stage, the model enhances its semantic understanding by aligning video with information from other modalities such as text and audio through cross-modal contrastive learning. 3. **Video-Based Context Prediction**: In the third stage, the model further improves its open-ended dialogue capabilities and understanding of complex scenarios by connecting to a large language model (LLM) for joint training. ### Dataset To train InternVideo2, the authors constructed a large-scale multimodal video dataset containing 402 million data entries, including 2 million videos, 50 million video-text pairs, 50 million video-audio-speech-text pairs, and 300 million image-text pairs. The construction and annotation quality of these datasets were particularly emphasized to ensure the model's performance and generalization ability. ### Experimental Results Experimental results show that InternVideo2 achieves state-of-the-art performance on multiple video-related tasks, including action recognition, video retrieval, and question answering. Notably, InternVideo2 demonstrates significant advantages in long video understanding and video-centric dialogue tasks, showcasing its strong capabilities in handling complex scenarios and long-term sequential actions. ### Contributions 1. **Proposed a new series of video foundation models—InternVideo2**, which endows the model with perception, semantic understanding, and reasoning capabilities through methods like masked reconstruction, multimodal contrastive learning, and context prediction. 2. **Achieved state-of-the-art performance on over 60 video/audio tasks**, particularly excelling in video-related dialogue and long video understanding tasks. 3. **Provided an enhanced dataset** for training InternVideo2, including validation of audio data and improved annotation methods, significantly boosting the model's performance and generalization ability.