Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Peng Jin,Ryuichi Takanobu,Wancai Zhang,Xiaochun Cao,Li Yuan
2024-04-05
Abstract:Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at
Computer Science
What problem does this paper attempt to address?
The paper attempts to address the challenges faced by existing large language models (LLMs) in handling image and video understanding. Although current methods have made some progress in multimodal dialogue, they usually focus on either image or video input and cannot efficiently handle both types of visual information simultaneously. Specifically: 1. **Differences in Image and Video Understanding**: Existing methods typically use more visual tokens for images to achieve finer spatial understanding, while they sacrifice spatial understanding for videos to accommodate more frames for modeling temporal relationships. 2. **Fixed Number of Visual Tokens**: Some methods can extract a fixed number of tokens for each image and video, but these methods mainly focus on image understanding and lack effective modeling of video temporal understanding. 3. **Challenges of Joint Training**: Existing methods usually require separate pre-training of image and video encoders, leading to model redundancy and difficulty in joint training. To overcome these issues, the paper proposes a unified vision-language model **Chat-UniVi**, which can represent images and videos through unified dynamic visual tokens, thereby capturing both the spatial details of images and the temporal relationships of videos under a limited number of visual tokens. Additionally, the model employs multi-scale representations, enabling it to perceive high-level semantic concepts and low-level visual details. By training on a mixed dataset containing images and videos, Chat-UniVi can be directly applied to tasks involving both media without any modifications.