Optimizing Vision-Language Interactions Through Decoder-Only Models

Kaito Tanaka,Benjamin Tan,Brian Wong
2024-12-14
Abstract:Vision-Language Models (VLMs) have emerged as key enablers for multimodal tasks, but their reliance on separate visual encoders introduces challenges in efficiency, scalability, and modality alignment. To address these limitations, we propose MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs through a novel Vision-Token Adapter (VTA) and adaptive co-attention mechanism. By eliminating the need for a visual encoder, MUDAIF achieves enhanced efficiency, flexibility, and cross-modal understanding. Trained on a large-scale dataset of 45M image-text pairs, MUDAIF consistently outperforms state-of-the-art methods across multiple benchmarks, including VQA, image captioning, and multimodal reasoning tasks. Extensive analyses and human evaluations demonstrate MUDAIF's robustness, generalization capabilities, and practical usability, establishing it as a new standard in encoder-free vision-language models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the challenges of efficiency, scalability, and modality alignment in Vision - Language Models (VLMs) due to the reliance on independent visual encoders. Specifically, traditional VLMs adopt a two - stage framework: first, image features are extracted by a visual encoder, and then these features are processed by a language model to generate output. Although this architecture is effective, it has limitations in the following aspects: 1. **Limitations on input resolution and aspect ratio**: The visual encoder must be pre - trained to handle a specific image distribution, which limits its ability to handle images of arbitrary resolution and aspect ratio. 2. **Increased computational overhead**: The visual encoder adds computational burden during training and inference, making deployment in resource - constrained environments complicated. 3. **Insufficient cross - modality alignment**: The output of the visual encoder lacks fine - grained alignment with the latent space of the language model, resulting in poor cross - modality reasoning and integration. To solve these problems, the paper proposes MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), which is a visual - language model with a pure - decoder architecture. MUDAIF eliminates the need for a visual encoder and achieves seamless integration of visual and text inputs by introducing the **Vision - Token Adapter (VTA)** and the **adaptive co - attention mechanism**. Specific contributions include: - Proposing a brand - new pure - decoder visual - language model MUDAIF, which converts the original visual features into pseudo - text tokens through VTA, enabling the language model to directly process these tokens. - Introducing an adaptive co - attention mechanism to ensure bidirectional interaction between visual and text information and optimize cross - modality fusion. - Pre - training on a large - scale dataset (45 million image - text pairs) and combining with multi - modal instruction fine - tuning, making the model perform excellently in multiple benchmark tests, including Visual Question Answering (VQA), image caption generation, and multi - modal reasoning tasks. In conclusion, MUDAIF aims to improve the efficiency, flexibility, and cross - modality understanding ability of the model by eliminating the visual encoder, thereby becoming the standard for a new generation of encoder - free vision - language models.