CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong,Weihan Wang,Ming Ding,Wenmeng Yu,Qingsong Lv,Yan Wang,Yean Cheng,Shiyu Huang,Junhui Ji,Zhao Xue,Lei Zhao,Zhuoyi Yang,Xiaotao Gu,Xiaohan Zhang,Guanyu Feng,Da Yin,Zihan Wang,Ji Qi,Xixuan Song,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Yuxiao Dong,Jie Tang
2024-08-29
Abstract:Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in <a class="link-external link-https" href="https://github.com/THUDM/CogVLM2" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://github.com/THUDM/GLM-4" rel="external noopener nofollow">this https URL</a>, contributing to the advancement of the field.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the following key issues: 1. **Enhancing Visual-Language Fusion**: By designing a "visual expert" architecture to achieve deeper fusion between visual and language modalities, thereby overcoming the inadequacies caused by shallow alignment techniques. 2. **Improving Input Resolution**: Proposes an efficient high-resolution cross module that allows the model to handle image inputs up to 1344×1344 pixels without sacrificing performance. 3. **Extending Modalities and Application Scope**: Expands the application of visual-language models to areas such as graphical user interface (GUI) agents and video understanding, and proposes an automated temporal annotation data generation method. Specifically, the paper introduces the CogVLM2 series of models, including: - **CogVLM2**: A model for image understanding that inherits the visual expert architecture with improvements in both pre-training and fine-tuning stages. - **CogVLM2-Video**: A model for video understanding that integrates multi-frame input and timestamp information, and proposes an automated temporal annotation data construction method. - **GLM-4V**: A bilingual visual-language model aimed at enhancing image understanding capabilities in both Chinese and English. These models have achieved state-of-the-art performance in multiple benchmarks, such as MMBench, MM-Vet, TextVQA, etc. Additionally, the paper details the data processing methods and settings for pre-training and fine-tuning, ensuring that the models perform excellently across a wide range of visual tasks.