Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao,Min Zhang,Wei Zhao,Pengxiang Ding,Siteng Huang,Donglin Wang
2024-06-05
Abstract:In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: <a class="link-external link-https" href="https://sites.google.com/view/cobravlm" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the computational efficiency of multi - modal large language models (MLLMs). Current MLLMs are usually built based on Transformer networks, which have quadratic computational complexity and lead to inefficiency when processing large - scale data. To overcome this problem, the paper proposes Cobra, an MLLM with linear computational complexity. Cobra improves computational efficiency by integrating the efficient Mamba language model into the visual modality and exploring different modality fusion schemes to create an effective multi - modal Mamba. Specifically, Cobra uses the state - space model (SSM) as its core architecture instead of the traditional attention - mechanism Transformer, which enables Cobra to significantly reduce the consumption of computational resources while maintaining high performance. The key contributions of the paper include: 1. **Proposing the Cobra model**: Cobra is a multi - modal large language model with linear computational complexity, aiming to improve the computational efficiency of existing MLLMs. 2. **Research on modality fusion**: Multiple modality fusion schemes have been studied, the integration of visual and linguistic information has been optimized, and the most effective multi - modal representation method has been found. 3. **Experimental verification**: Through multiple benchmark tests, it has been proven that Cobra can be comparable in performance to existing efficient methods, and even performs better on certain specific tasks, and is faster due to its linear sequence modeling. These contributions not only improve the computational efficiency of multi - modal large language models, but also provide new directions and ideas for future research.