Abstract:Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at <a class="link-external link-https" href="https://github.com/SaraGhazanfari/EMMA" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper attempts to solve the problem of insufficiently optimized fusion of visual encoding and text encoding in multimodal large - language models (MLLMs). Specifically, the current state - of - the - art multimodal models face the following challenges when fusing visual feature encoding with language models: 1. **Static visual encoding**: Existing multimodal models usually rely on fixed visual feature encodings extracted from visual foundation models. These encodings are generated without considering specific instructions, making it difficult for the model to dynamically adapt to specific tasks or contexts. 2. **Increased complexity**: In order to improve the fusion of visual and text encodings, existing methods introduce complex modality adaptation modules, which significantly increase the complexity of the model and the amount of training data required. 3. **Limited performance improvement**: Despite the introduction of complex modules, the performance improvement is not obvious on many benchmark tests, and sometimes it is even inferior to the baseline model. To solve these problems, the paper proposes EMMA (Efficient Multi - Modal Adaptation), a lightweight cross - modal module aimed at efficiently fusing visual and text encodings to generate instruction - aware visual representations. The main contributions of EMMA include: 1. **Efficient early - fusion mechanism**: Early - fuse visual and language representations with minimal parameter increase (less than 0.2% growth in model size). 2. **In - depth interpretability analysis**: Provide a detailed analysis of the internal mechanisms of the proposed method, revealing how visual and text tokens are integrated. 3. **Comprehensive experimental verification**: Through a series of benchmark tests, show significant improvements of EMMA on dedicated and general - purpose benchmarks, especially in reducing hallucinations. In summary, the goal of this paper is to improve the performance of multimodal large - language models in visual and text fusion, and reduce the computational cost and the need for training data by proposing a more efficient and lightweight modality adaptation mechanism.

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

EMMA: Empowering Multi-modal Mamba with Structural and Hierarchical Alignment

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

InfMLLM: A Unified Framework for Visual-Language Tasks.

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs

eP-ALM: Efficient Perceptual Augmentation of Language Models

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Semantic Alignment for Multimodal Large Language Models

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

OneLLM: One Framework to Align All Modalities with Language