EVLM: An Efficient Vision-Language Model for Visual Understanding

Kaibing Chen,Dong Shen,Hanwen Zhong,Huasong Zhong,Kui Xia,Di Xu,Wei Yuan,Yifei Hu,Bin Wen,Tianke Zhang,Changyi Liu,Dewen Fan,Huihui Xiao,Jiahong Wu,Fan Yang,Size Li,Di Zhang

2024-07-19

Abstract:In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues of excessive computational overhead and the difficulty for large language models to fully perceive visual signals with single-layer ViT features when Vision-Language Models (VLMs) process long sequence visual signals, such as video data. To tackle these problems, the paper proposes an Efficient Vision-Language Model (EVLM) with the following characteristics: 1. **Adopts cross-attention mechanisms**: Similar to Flamingo, EVLM utilizes cross-attention mechanisms to facilitate interaction between visual and textual inputs. 2. **Uses hierarchical ViT features**: Extracts features from different levels of the visual encoder, enabling large-scale language models to perceive visual signals at various levels. 3. **Introduces a Mixture of Experts (MoE) mechanism**: Applies the MoE mechanism in the cross-attention layers to enhance model performance, and further improves performance by increasing the scale of training parameters. Furthermore, the paper provides a detailed description of the model architecture, efficient training strategies (including multimodal pre-training, multi-task continual pre-training, and supervised fine-tuning phases), and the evaluation results on various benchmark tests, demonstrating EVLM's competitiveness and superior performance in tasks such as image captioning and visual question answering.

EVLM: An Efficient Vision-Language Model for Visual Understanding

Towards Better Vision-Inspired Vision-Language Models

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

CogVLM: Visual Expert for Pretrained Language Models

A-VL: Adaptive Attention for Large Vision-Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

Unveiling Encoder-Free Vision-Language Models

Visually-Augmented Language Modeling

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

InfMLLM: A Unified Framework for Visual-Language Tasks.

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models