Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao,Min Zhang,Wei Zhao,Pengxiang Ding,Siteng Huang,Donglin Wang

2024-06-05

Abstract:In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: <a class="link-external link-https" href="https://sites.google.com/view/cobravlm" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the computational efficiency of multi - modal large language models (MLLMs). Current MLLMs are usually built based on Transformer networks, which have quadratic computational complexity and lead to inefficiency when processing large - scale data. To overcome this problem, the paper proposes Cobra, an MLLM with linear computational complexity. Cobra improves computational efficiency by integrating the efficient Mamba language model into the visual modality and exploring different modality fusion schemes to create an effective multi - modal Mamba. Specifically, Cobra uses the state - space model (SSM) as its core architecture instead of the traditional attention - mechanism Transformer, which enables Cobra to significantly reduce the consumption of computational resources while maintaining high performance. The key contributions of the paper include: 1. **Proposing the Cobra model**: Cobra is a multi - modal large language model with linear computational complexity, aiming to improve the computational efficiency of existing MLLMs. 2. **Research on modality fusion**: Multiple modality fusion schemes have been studied, the integration of visual and linguistic information has been optimized, and the most effective multi - modal representation method has been found. 3. **Experimental verification**: Through multiple benchmark tests, it has been proven that Cobra can be comparable in performance to existing efficient methods, and even performs better on certain specific tasks, and is faster due to its linear sequence modeling. These contributions not only improve the computational efficiency of multi - modal large language models, but also provide new directions and ideas for future research.

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

VL-Mamba: Exploring State Space Models for Multimodal Learning

InfMLLM: A Unified Framework for Visual-Language Tasks.

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

CoBa: Convergence Balancer for Multitask Finetuning of Large Language Models

Demystify Mamba in Vision: A Linear Attention Perspective

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models