Abstract:We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at <a class="link-external link-https" href="https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of constructing a video foundation model (Video Foundation Model, ViFM) that can effectively understand and process multimodal videos. Specifically, the paper introduces a new series of video foundation models—InternVideo2, which achieves state-of-the-art performance on multiple tasks such as video recognition, video-text tasks, and video-centric dialogues. ### Main Issues 1. **Video Representation Learning**: How to effectively learn transferable spatiotemporal representations that perform well across different downstream tasks. 2. **Multimodal Alignment**: How to better align video representations with information from other modalities such as text and audio through methods like cross-modal contrastive learning and masked video modeling. 3. **Long Video Understanding**: How to improve the model's ability to understand long videos, enabling it to handle and reason about complex sequential actions. 4. **Open-Ended Dialogue Support**: How to enhance the model's open-ended dialogue capabilities, enabling it to generate high-quality video descriptions and answer video-related questions. ### Solutions The paper addresses the above issues through a three-stage training strategy: 1. **Unmasked Video Token Reconstruction**: In the first stage, the model learns basic spatiotemporal perception capabilities by reconstructing unmasked video tokens. 2. **Multimodal Contrastive Learning**: In the second stage, the model enhances its semantic understanding by aligning video with information from other modalities such as text and audio through cross-modal contrastive learning. 3. **Video-Based Context Prediction**: In the third stage, the model further improves its open-ended dialogue capabilities and understanding of complex scenarios by connecting to a large language model (LLM) for joint training. ### Dataset To train InternVideo2, the authors constructed a large-scale multimodal video dataset containing 402 million data entries, including 2 million videos, 50 million video-text pairs, 50 million video-audio-speech-text pairs, and 300 million image-text pairs. The construction and annotation quality of these datasets were particularly emphasized to ensure the model's performance and generalization ability. ### Experimental Results Experimental results show that InternVideo2 achieves state-of-the-art performance on multiple video-related tasks, including action recognition, video retrieval, and question answering. Notably, InternVideo2 demonstrates significant advantages in long video understanding and video-centric dialogue tasks, showcasing its strong capabilities in handling complex scenarios and long-term sequential actions. ### Contributions 1. **Proposed a new series of video foundation models—InternVideo2**, which endows the model with perception, semantic understanding, and reasoning capabilities through methods like masked reconstruction, multimodal contrastive learning, and context prediction. 2. **Achieved state-of-the-art performance on over 60 video/audio tasks**, particularly excelling in video-related dialogue and long video understanding tasks. 3. **Provided an enhanced dataset** for training InternVideo2, including validation of audio data and improved annotation methods, significantly boosting the model's performance and generalization ability.

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

InternVideo: General Video Foundation Models Via Generative and Discriminative Learning

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

VideoGLUE: Video General Understanding Evaluation of Foundation Models

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Harvest Video Foundation Models via Efficient Post-Pretraining