Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin
2024-10-03
Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at <a class="link-external link-https" href="https://github.com/QwenLM/Qwen2-VL" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key limitations of existing large - scale vision - language models (LVLMs) when processing images and videos: 1. **Fixed image input size**: Current large - scale vision - language models are usually limited by the fixed image input size. Standard LVLMs encode the input image into a fixed resolution (e.g., 224 × 224), usually by down - sampling or up - sampling the image. Although this "one - size - fits - all" strategy can handle a consistent resolution, it also limits the model's ability to capture information at different scales, especially in high - resolution images, resulting in a significant loss of detailed information. Therefore, when perceiving visual information, these models cannot capture information at different scales and details as sensitively as human vision. 2. **Static visual encoder**: Most LVLMs rely on static, frozen CLIP - style visual encoders, which raises concerns about whether the visual representations generated by these pre - trained models are sufficient, especially in complex reasoning tasks and in dealing with subtle details in images. Recent research has attempted to solve these problems by fine - tuning the Vision Transformer (ViT) during LVLM training to improve the model's performance. 3. **Video content processing**: Video content is essentially a series of frames, but many existing models still treat it as an independent modality. Understanding the dynamic nature of the real world, especially the dynamic nature embodied in videos, is crucial for models aiming to master the complexity of the real world. However, current models have limited ability to handle three - dimensional space and temporal dynamics because they use traditional methods based on one - dimensional position embedding. To overcome these limitations, the Qwen2 - VL series of models introduce the following innovations: - **Dynamic resolution mechanism**: Qwen2 - VL introduces the "Naive Dynamic Resolution" mechanism, enabling the model to dynamically process images of different resolutions and convert them into different numbers of visual tokens. This allows the model to generate visual representations more efficiently and accurately, closer to the human perception process. - **Multi - modal Rotary Position Embedding (M - RoPE)**: Qwen2 - VL integrates multi - modal Rotary Position Embedding (M - RoPE), effectively fusing the position information in text, image, and video. M - RoPE improves the model's ability to model the position information of multi - modal input by decomposing the rotary embedding into three components: time, height, and width. - **Unified image and video processing paradigm**: Qwen2 - VL adopts a hybrid training method, combining image and video data, enhancing the model's ability in image understanding and video understanding. By dynamically adjusting the resolution of each video frame, the model can maintain training efficiency while processing long - video. These innovations have enabled the Qwen2 - VL series of models to achieve significant performance improvements in various multi - modal benchmark tests, especially in document understanding, video understanding, multilingual support, and agent capabilities.