Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin,Ryuichi Takanobu,Wancai Zhang,Xiaochun Cao,Li Yuan

2024-04-05

Abstract:Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at

Computer Science

What problem does this paper attempt to address?

The paper attempts to address the challenges faced by existing large language models (LLMs) in handling image and video understanding. Although current methods have made some progress in multimodal dialogue, they usually focus on either image or video input and cannot efficiently handle both types of visual information simultaneously. Specifically: 1. **Differences in Image and Video Understanding**: Existing methods typically use more visual tokens for images to achieve finer spatial understanding, while they sacrifice spatial understanding for videos to accommodate more frames for modeling temporal relationships. 2. **Fixed Number of Visual Tokens**: Some methods can extract a fixed number of tokens for each image and video, but these methods mainly focus on image understanding and lack effective modeling of video temporal understanding. 3. **Challenges of Joint Training**: Existing methods usually require separate pre-training of image and video encoders, leading to model redundancy and difficulty in joint training. To overcome these issues, the paper proposes a unified vision-language model **Chat-UniVi**, which can represent images and videos through unified dynamic visual tokens, thereby capturing both the spatial details of images and the temporal relationships of videos under a limited number of visual tokens. Additionally, the model employs multi-scale representations, enabling it to perceive high-level semantic concepts and low-level visual details. By training on a mixed dataset containing images and videos, Chat-UniVi can be directly applied to tasks involving both media without any modifications.

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Universal Multimodal Representation for Language Understanding

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

EVLM: An Efficient Vision-Language Model for Visual Understanding

Valley: Video Assistant with Large Language model Enhanced abilitY

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Uni3DL: Unified Model for 3D and Language Understanding

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention