Unveiling Encoder-Free Vision-Language Models

Haiwen Diao,Yufeng Cui,Xiaotong Li,Yueze Wang,Huchuan Lu,Xinlong Wang
2024-10-29
Abstract:Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: <a class="link-external link-https" href="https://github.com/baaivision/EVE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper attempts to address several key limitations of existing Vision-Language Models (VLMs) when handling visual and language tasks. Specifically, these limitations include: 1. **Image Resolution/Aspect Ratio**: Existing VLMs typically use fixed-size square images for pre-training, which necessitates adjustments such as scaling, padding, or cropping when dealing with images of different shapes. This results in layout distortion, disconnection between image slices, and additional computational burden, especially when processing high-resolution images. 2. **Deployment Overhead**: Since visual encoders and large language models (LLMs) are usually executed sequentially, the computational efficiency in actual deployment is severely impacted as the scale of VLMs increases, particularly when high-resolution images are repeatedly segmented and processed. 3. **Model Capacity Matching**: Existing VLMs and LLMs are pre-trained separately, leading to the challenge of matching their capacities and capabilities. As the scale of LLMs increases, selecting the appropriate visual encoder to maximize their respective capabilities remains a complex and unclear issue. To overcome these limitations, the paper proposes a new pure decoder architecture for vision-language models (EVE), which does not rely on a visual encoder but directly handles image and text inputs. Through this approach, EVE can support images of any resolution and aspect ratio, and demonstrates performance comparable to or even better than encoder-based VLMs on multiple vision-language benchmarks.