Mamba in Vision: A Comprehensive Survey of Techniques and Applications

Md Maklachur Rahman,Abdullah Aman Tutul,Ankur Nath,Lamyanba Laishram,Soon Ki Jung,Tracy Hammond
2024-10-04
Abstract:Mamba is emerging as a novel approach to overcome the challenges faced by Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in computer vision. While CNNs excel at extracting local features, they often struggle to capture long-range dependencies without complex architectural modifications. In contrast, ViTs effectively model global relationships but suffer from high computational costs due to the quadratic complexity of their self-attention mechanisms. Mamba addresses these limitations by leveraging Selective Structured State Space Models to effectively capture long-range dependencies with linear computational complexity. This survey analyzes the unique contributions, computational benefits, and applications of Mamba models while also identifying challenges and potential future research directions. We provide a foundational resource for advancing the understanding and growth of Mamba models in computer vision. An overview of this work is available at <a class="link-external link-https" href="https://github.com/maklachur/Mamba-in-Computer-Vision" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the challenges faced by convolutional neural networks (CNNs) and vision transformers (ViTs) in the current field of computer vision. Specifically: 1. **Limitations of CNNs**: - **Local feature extraction**: Although CNNs perform well in extracting local features, they have difficulty in capturing long - distance dependencies, mainly due to the limitations of their local receptive fields. - **Complex architecture requirements**: To overcome this limitation, it is usually necessary to design deeper and more complex architectures, which will increase the computational cost and reduce the efficiency. 2. **Limitations of ViTs**: - **High computational cost**: ViTs effectively model global relationships through the self - attention mechanism, but this mechanism has quadratic complexity, resulting in low computational efficiency in high - resolution and real - time applications. 3. **Proposal of the Mamba model**: - **Linear computational complexity**: By utilizing Selective Structured State Space Models, the Mamba model can effectively capture long - distance dependencies while maintaining linear computational complexity. - **Balancing performance and efficiency**: The Mamba model aims to combine the advantages of CNNs and ViTs, while overcoming their shortcomings, providing a solution that strikes a balance between performance and computational efficiency. In summary, the main objective of this paper is to introduce and analyze the Mamba model, show its unique contributions, computational advantages and application prospects in computer vision tasks, and identify existing challenges and future research directions.