Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai,Shuai Bai,Shusheng Yang,Shijie Wang,Sinan Tan,Peng Wang,Junyang Lin,Chang Zhou,Jingren Zhou

2023-10-13

Abstract:In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at <a class="link-external link-https" href="https://github.com/QwenLM/Qwen-VL" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Enhancing the capabilities of Visual Language Models (LVLM)**: By introducing the Qwen-VL series models, the ability of large language models to process images and other modal data is enhanced, enabling them to better understand and perceive visual signals. 2. **Addressing the shortcomings of existing open-source LVLMs**: Existing open-source LVLMs have deficiencies in training and optimization, resulting in performance that is far inferior to proprietary models. Qwen-VL addresses these issues through a carefully designed visual receptor, input-output interface, a three-stage training process, and a multilingual multimodal cleaned corpus. 3. **Achieving fine-grained visual understanding**: Many existing models can only perform coarse-grained image understanding and lack fine-grained object localization and text recognition capabilities. Qwen-VL significantly improves fine-grained visual understanding through high-resolution input and training with a fine-grained corpus. 4. **Multimodal dialogue and task execution**: Qwen-VL excels not only in traditional tasks such as image description and Q&A but also surpasses existing visual language chatbots in dialogue benchmarks, supporting multi-turn dialogue and multilingual communication. In summary, this paper is dedicated to developing a powerful and flexible visual language model to address various vision-centric issues in practical applications, and to enhance the model's multimodal processing capabilities and interactivity.

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Qwen Technical Report

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

From Images to Textual Prompts: Zero-Shot Visual Question Answering with Frozen Large Language Models

Unified Vision-Language Pre-Training for Image Captioning and VQA

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

HumanVLM: Foundation for Human-Scene Vision-Language Model

Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare

CogVLM: Visual Expert for Pretrained Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences