Abstract:Vision-Language Models (VLMs) have emerged as key enablers for multimodal tasks, but their reliance on separate visual encoders introduces challenges in efficiency, scalability, and modality alignment. To address these limitations, we propose MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs through a novel Vision-Token Adapter (VTA) and adaptive co-attention mechanism. By eliminating the need for a visual encoder, MUDAIF achieves enhanced efficiency, flexibility, and cross-modal understanding. Trained on a large-scale dataset of 45M image-text pairs, MUDAIF consistently outperforms state-of-the-art methods across multiple benchmarks, including VQA, image captioning, and multimodal reasoning tasks. Extensive analyses and human evaluations demonstrate MUDAIF's robustness, generalization capabilities, and practical usability, establishing it as a new standard in encoder-free vision-language models.

What problem does this paper attempt to address?

This paper attempts to address the challenges of efficiency, scalability, and modality alignment in Vision - Language Models (VLMs) due to the reliance on independent visual encoders. Specifically, traditional VLMs adopt a two - stage framework: first, image features are extracted by a visual encoder, and then these features are processed by a language model to generate output. Although this architecture is effective, it has limitations in the following aspects: 1. **Limitations on input resolution and aspect ratio**: The visual encoder must be pre - trained to handle a specific image distribution, which limits its ability to handle images of arbitrary resolution and aspect ratio. 2. **Increased computational overhead**: The visual encoder adds computational burden during training and inference, making deployment in resource - constrained environments complicated. 3. **Insufficient cross - modality alignment**: The output of the visual encoder lacks fine - grained alignment with the latent space of the language model, resulting in poor cross - modality reasoning and integration. To solve these problems, the paper proposes MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), which is a visual - language model with a pure - decoder architecture. MUDAIF eliminates the need for a visual encoder and achieves seamless integration of visual and text inputs by introducing the **Vision - Token Adapter (VTA)** and the **adaptive co - attention mechanism**. Specific contributions include: - Proposing a brand - new pure - decoder visual - language model MUDAIF, which converts the original visual features into pseudo - text tokens through VTA, enabling the language model to directly process these tokens. - Introducing an adaptive co - attention mechanism to ensure bidirectional interaction between visual and text information and optimize cross - modality fusion. - Pre - training on a large - scale dataset (45 million image - text pairs) and combining with multi - modal instruction fine - tuning, making the model perform excellently in multiple benchmark tests, including Visual Question Answering (VQA), image caption generation, and multi - modal reasoning tasks. In conclusion, MUDAIF aims to improve the efficiency, flexibility, and cross - modality understanding ability of the model by eliminating the visual encoder, thereby becoming the standard for a new generation of encoder - free vision - language models.

Optimizing Vision-Language Interactions Through Decoder-Only Models

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

EVLM: An Efficient Vision-Language Model for Visual Understanding

Towards Better Vision-Inspired Vision-Language Models

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Unveiling Encoder-Free Vision-Language Models

eP-ALM: Efficient Perceptual Augmentation of Language Models

Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning

A-VL: Adaptive Attention for Large Vision-Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Vision-Language Adaptive Mutual Decoder for OOV-STR

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Enhancing Instruction-Following Capability of Visual-Language Models by Reducing Image Redundancy

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

Towards Interpreting Visual Information Processing in Vision-Language Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous Driving

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models