EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Sara Ghazanfari,Alexandre Araujo,Prashanth Krishnamurthy,Siddharth Garg,Farshad Khorrami
2024-10-03
Abstract:Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with instructions and processed by the language model to generate high-quality responses. Despite significant progress in enhancing the language component, challenges persist in optimally fusing visual encodings within the language model for task-specific adaptability. Recent research has focused on improving this fusion through modality adaptation modules but at the cost of significantly increased model complexity and training data needs. In this paper, we propose EMMA (Efficient Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse visual and textual encodings, generating instruction-aware visual representations for the language model. Our key contributions include: (1) an efficient early fusion mechanism that integrates vision and language representations with minimal added parameters (less than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light on the internal mechanisms of the proposed method; (3) comprehensive experiments that demonstrate notable improvements on both specialized and general benchmarks for MLLMs. Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations. Our code is available at <a class="link-external link-https" href="https://github.com/SaraGhazanfari/EMMA" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of insufficiently optimized fusion of visual encoding and text encoding in multimodal large - language models (MLLMs). Specifically, the current state - of - the - art multimodal models face the following challenges when fusing visual feature encoding with language models: 1. **Static visual encoding**: Existing multimodal models usually rely on fixed visual feature encodings extracted from visual foundation models. These encodings are generated without considering specific instructions, making it difficult for the model to dynamically adapt to specific tasks or contexts. 2. **Increased complexity**: In order to improve the fusion of visual and text encodings, existing methods introduce complex modality adaptation modules, which significantly increase the complexity of the model and the amount of training data required. 3. **Limited performance improvement**: Despite the introduction of complex modules, the performance improvement is not obvious on many benchmark tests, and sometimes it is even inferior to the baseline model. To solve these problems, the paper proposes EMMA (Efficient Multi - Modal Adaptation), a lightweight cross - modal module aimed at efficiently fusing visual and text encodings to generate instruction - aware visual representations. The main contributions of EMMA include: 1. **Efficient early - fusion mechanism**: Early - fuse visual and language representations with minimal parameter increase (less than 0.2% growth in model size). 2. **In - depth interpretability analysis**: Provide a detailed analysis of the internal mechanisms of the proposed method, revealing how visual and text tokens are integrated. 3. **Comprehensive experimental verification**: Through a series of benchmark tests, show significant improvements of EMMA on dedicated and general - purpose benchmarks, especially in reducing hallucinations. In summary, the goal of this paper is to improve the performance of multimodal large - language models in visual and text fusion, and reduce the computational cost and the need for training data by proposing a more efficient and lightweight modality adaptation mechanism.