Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at <a class="link-external link-https" href="https://github.com/QwenLM/Qwen2-VL" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key limitations of existing large - scale vision - language models (LVLMs) when processing images and videos: 1. **Fixed image input size**: Current large - scale vision - language models are usually limited by the fixed image input size. Standard LVLMs encode the input image into a fixed resolution (e.g., 224 × 224), usually by down - sampling or up - sampling the image. Although this "one - size - fits - all" strategy can handle a consistent resolution, it also limits the model's ability to capture information at different scales, especially in high - resolution images, resulting in a significant loss of detailed information. Therefore, when perceiving visual information, these models cannot capture information at different scales and details as sensitively as human vision. 2. **Static visual encoder**: Most LVLMs rely on static, frozen CLIP - style visual encoders, which raises concerns about whether the visual representations generated by these pre - trained models are sufficient, especially in complex reasoning tasks and in dealing with subtle details in images. Recent research has attempted to solve these problems by fine - tuning the Vision Transformer (ViT) during LVLM training to improve the model's performance. 3. **Video content processing**: Video content is essentially a series of frames, but many existing models still treat it as an independent modality. Understanding the dynamic nature of the real world, especially the dynamic nature embodied in videos, is crucial for models aiming to master the complexity of the real world. However, current models have limited ability to handle three - dimensional space and temporal dynamics because they use traditional methods based on one - dimensional position embedding. To overcome these limitations, the Qwen2 - VL series of models introduce the following innovations: - **Dynamic resolution mechanism**: Qwen2 - VL introduces the "Naive Dynamic Resolution" mechanism, enabling the model to dynamically process images of different resolutions and convert them into different numbers of visual tokens. This allows the model to generate visual representations more efficiently and accurately, closer to the human perception process. - **Multi - modal Rotary Position Embedding (M - RoPE)**: Qwen2 - VL integrates multi - modal Rotary Position Embedding (M - RoPE), effectively fusing the position information in text, image, and video. M - RoPE improves the model's ability to model the position information of multi - modal input by decomposing the rotary embedding into three components: time, height, and width. - **Unified image and video processing paradigm**: Qwen2 - VL adopts a hybrid training method, combining image and video data, enhancing the model's ability in image understanding and video understanding. By dynamically adjusting the resolution of each video frame, the model can maintain training efficiency while processing long - video. These innovations have enabled the Qwen2 - VL series of models to achieve significant performance improvements in various multi - modal benchmark tests, especially in document understanding, video understanding, multilingual support, and agent capabilities.

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL: Towards Real-World Vision-Language Understanding

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Qwen Technical Report

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

Qwen2.5 Technical Report

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

EVLM: An Efficient Vision-Language Model for Visual Understanding

CogVLM2: Visual Language Models for Image and Video Understanding

Qwen2 Technical Report

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

Small Language Model Meets with Reinforced Vision Vocabulary