CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong,Weihan Wang,Ming Ding,Wenmeng Yu,Qingsong Lv,Yan Wang,Yean Cheng,Shiyu Huang,Junhui Ji,Zhao Xue,Lei Zhao,Zhuoyi Yang,Xiaotao Gu,Xiaohan Zhang,Guanyu Feng,Da Yin,Zihan Wang,Ji Qi,Xixuan Song,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Yuxiao Dong,Jie Tang

2024-08-29

Abstract:Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in <a class="link-external link-https" href="https://github.com/THUDM/CogVLM2" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://github.com/THUDM/GLM-4" rel="external noopener nofollow">this https URL</a>, contributing to the advancement of the field.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the following key issues: 1. **Enhancing Visual-Language Fusion**: By designing a "visual expert" architecture to achieve deeper fusion between visual and language modalities, thereby overcoming the inadequacies caused by shallow alignment techniques. 2. **Improving Input Resolution**: Proposes an efficient high-resolution cross module that allows the model to handle image inputs up to 1344×1344 pixels without sacrificing performance. 3. **Extending Modalities and Application Scope**: Expands the application of visual-language models to areas such as graphical user interface (GUI) agents and video understanding, and proposes an automated temporal annotation data generation method. Specifically, the paper introduces the CogVLM2 series of models, including: - **CogVLM2**: A model for image understanding that inherits the visual expert architecture with improvements in both pre-training and fine-tuning stages. - **CogVLM2-Video**: A model for video understanding that integrates multi-frame input and timestamp information, and proposes an automated temporal annotation data construction method. - **GLM-4V**: A bilingual visual-language model aimed at enhancing image understanding capabilities in both Chinese and English. These models have achieved state-of-the-art performance in multiple benchmarks, such as MMBench, MM-Vet, TextVQA, etc. Additionally, the paper details the data processing methods and settings for pre-training and fine-tuning, ensuring that the models perform excellently across a wide range of visual tasks.

CogVLM2: Visual Language Models for Image and Video Understanding

CogVLM: Visual Expert for Pretrained Language Models

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

An Introduction to Vision-Language Modeling

Audio-Visual LLM for Video Understanding

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Visually-Augmented Language Modeling

EVLM: An Efficient Vision-Language Model for Visual Understanding

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

LongVLM: Efficient Long Video Understanding via Large Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

VoCo-LLaMA: Towards Vision Compression with Large Language Models

VIGC: Visual Instruction Generation and Correction

LLM4VG: Large Language Models Evaluation for Video Grounding

CogAgent: A Visual Language Model for GUI Agents

InfMLLM: A Unified Framework for Visual-Language Tasks.

Rethinking VLMs and LLMs for Image Classification