EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen,Dong Shen,Hanwen Zhong,Huasong Zhong,Kui Xia,Di Xu,Wei Yuan,Yifei Hu,Bin Wen,Tianke Zhang,Changyi Liu,Dewen Fan,Huihui Xiao,Jiahong Wu,Fan Yang,Size Li,Di Zhang
2024-07-19
Abstract:In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issues of excessive computational overhead and the difficulty for large language models to fully perceive visual signals with single-layer ViT features when Vision-Language Models (VLMs) process long sequence visual signals, such as video data. To tackle these problems, the paper proposes an Efficient Vision-Language Model (EVLM) with the following characteristics: 1. **Adopts cross-attention mechanisms**: Similar to Flamingo, EVLM utilizes cross-attention mechanisms to facilitate interaction between visual and textual inputs. 2. **Uses hierarchical ViT features**: Extracts features from different levels of the visual encoder, enabling large-scale language models to perceive visual signals at various levels. 3. **Introduces a Mixture of Experts (MoE) mechanism**: Applies the MoE mechanism in the cross-attention layers to enhance model performance, and further improves performance by increasing the scale of training parameters. Furthermore, the paper provides a detailed description of the model architecture, efficient training strategies (including multimodal pre-training, multi-task continual pre-training, and supervised fine-tuning phases), and the evaluation results on various benchmark tests, demonstrating EVLM's competitiveness and superior performance in tasks such as image captioning and visual question answering.