Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond

2024-10-04

Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at <a class="link-external link-https" href="https://github.com/maklachur/Mamba-in-Computer-Vision" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the challenges faced by convolutional neural networks (CNNs) and vision transformers (ViTs) in the current field of computer vision. Specifically: 1. **Limitations of CNNs**: - **Local feature extraction**: Although CNNs perform well in extracting local features, they have difficulty in capturing long - distance dependencies, mainly due to the limitations of their local receptive fields. - **Complex architecture requirements**: To overcome this limitation, it is usually necessary to design deeper and more complex architectures, which will increase the computational cost and reduce the efficiency. 2. **Limitations of ViTs**: - **High computational cost**: ViTs effectively model global relationships through the self - attention mechanism, but this mechanism has quadratic complexity, resulting in low computational efficiency in high - resolution and real - time applications. 3. **Proposal of the Mamba model**: - **Linear computational complexity**: By utilizing Selective Structured State Space Models, the Mamba model can effectively capture long - distance dependencies while maintaining linear computational complexity. - **Balancing performance and efficiency**: The Mamba model aims to combine the advantages of CNNs and ViTs, while overcoming their shortcomings, providing a solution that strikes a balance between performance and computational efficiency. In summary, the main objective of this paper is to introduce and analyze the Mamba model, show its unique contributions, computational advantages and application prospects in computer vision tasks, and identify existing challenges and future research directions.

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

A Survey on Vision Mamba: Models, Applications and Challenges

Visual Mamba: A Survey and New Outlooks

A Survey on Visual Mamba

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Vision Mamba: A Comprehensive Survey and Taxonomy

VMamba: Visual State Space Model

Vision Mamba for Classification of Breast Ultrasound Images

A Survey of Mamba

A Comprehensive Survey of Mamba Architectures for Medical Image Analysis: Classification, Segmentation, Restoration and Beyond

Demystify Mamba in Vision: A Linear Attention Perspective

MedMamba: Vision Mamba for Medical Image Classification

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

LocalMamba: Visual State Space Model with Windowed Selective Scan

VideoMamba: State Space Model for Efficient Video Understanding

MambaOut: Do We Really Need Mamba for Vision?

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

MambaVC: Learned Visual Compression with Selective State Spaces