Abstract:For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at <a class="link-external link-https" href="https://github.com/mlvlab/EfficientViM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to efficiently capture global dependencies in images when deploying neural networks in resource - constrained environments. Specifically, the authors propose a new architecture named **Efficient Vision Mamba (EfficientViM)**, which is based on Hidden State Mixer - based State Space Duality (HSM - SSD) to efficiently capture global dependencies at a lower computational cost. #### Main problems and background 1. **Requirement for lightweight architectures**: - In resource - constrained environments such as mobile and edge devices, traditional Convolutional Neural Networks (CNNs) and attention mechanisms can respectively capture local and global dependencies, but their computational complexity is high and it is difficult to meet the requirements of practical applications. - In particular, the quadratic computational complexity of the self - attention mechanism (O(L^2D), where L is the number of tokens and D is the number of channels) makes it inefficient when processing large - scale data. 2. **Limitations of existing methods**: - Although some works attempt to reduce the computational cost by approximating self - attention or limiting the number of tokens, these methods still have bottlenecks, especially when processing high - resolution images. - State Space Models (SSMs) have become a promising alternative due to their linear computational complexity (O(LD)), but their exploration in visual tasks is relatively limited. #### Innovations of EfficientViM 1. **Design of HSM - SSD layers**: - The authors redesign the standard SSD layer, transferring the channel - mixing operation from the image feature space to the hidden - state space, thereby alleviating the main bottleneck and maintaining the generalization ability of the model. - The specific formula is as follows: \[ h=(a\mathbf{1}_N^\top\odot B)^\top x_{in}W_{in} \] where \(a\) is the importance weight, \(B\) is the projection matrix, \(x_{in}\) is the input feature, and \(W_{in}\) is a learnable matrix. 2. **Multi - stage hidden - state fusion**: - Introduce a Multi - stage Hidden State Fusion (MSF) mechanism, which enhances the representational ability of the model by combining hidden - state - predicted logits at different stages. - The calculation formula is as follows: \[ z = \sum_{s = 0}^{S}\hat{\beta}(s)z(s) \] where \(\hat{\beta}(s)\) is the normalized weight and \(z(s)\) is the logit at the \(s\) - th stage. 3. **Single - head HSM - SSD**: - By eliminating tensor operations (such as reshape, copy) in the multi - head configuration, further reduce the bottleneck caused by memory access and increase the throughput. #### Experimental results - **ImageNet - 1K classification task**: - EfficientViM outperforms previous efficient networks in both speed and accuracy, such as MobileNetV3 - L 0.75 and EfficientViT - M3, achieving approximately 90% and 30% speed improvements respectively. - EfficientViM - M2 is about 4 times faster than MobileViTV2 0.75 and has a 0.2% performance improvement. - **High - resolution image scalability**: - On high - resolution images (such as 384x384 and 512x512), the throughput advantage of EfficientViM is more obvious, being more than 15% faster than SHViT. - **Distillation training**: - After distillation training, EfficientViM still has a good performance in terms of speed -

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

VSSD: Vision Mamba with Non-Causal State Space Duality

VMamba: Visual State Space Model

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

GhostViT: Expediting Vision Transformers Via Cheap Operations

Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

LocalMamba: Visual State Space Model with Windowed Selective Scan

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

MambaVC: Learned Visual Compression with Selective State Spaces

DVMSR: Distillated Vision Mamba for Efficient Super-Resolution

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

A Novel State Space Model with Local Enhancement and State Sharing for Image Fusion