EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Sanghyeok Lee,Joonmyung Choi,Hyunwoo J. Kim
2024-11-22
Abstract:For the deployment of neural networks in resource-constrained environments, prior works have built lightweight architectures with convolution and attention for capturing local and global dependencies, respectively. Recently, the state space model has emerged as an effective global token interaction with its favorable linear computational cost in the number of tokens. Yet, efficient vision backbones built with SSM have been explored less. In this paper, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. In the HSM-SSD layer, we redesign the previous SSD layer to enable the channel mixing operation within hidden states. Additionally, we propose multi-stage hidden state fusion to further reinforce the representation power of hidden states, and provide the design alleviating the bottleneck caused by the memory-bound operations. As a result, the EfficientViM family achieves a new state-of-the-art speed-accuracy trade-off on ImageNet-1k, offering up to a 0.7% performance improvement over the second-best model SHViT with faster speed. Further, we observe significant improvements in throughput and accuracy compared to prior works, when scaling images or employing distillation training. Code is available at <a class="link-external link-https" href="https://github.com/mlvlab/EfficientViM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to efficiently capture global dependencies in images when deploying neural networks in resource - constrained environments. Specifically, the authors propose a new architecture named **Efficient Vision Mamba (EfficientViM)**, which is based on Hidden State Mixer - based State Space Duality (HSM - SSD) to efficiently capture global dependencies at a lower computational cost. #### Main problems and background 1. **Requirement for lightweight architectures**: - In resource - constrained environments such as mobile and edge devices, traditional Convolutional Neural Networks (CNNs) and attention mechanisms can respectively capture local and global dependencies, but their computational complexity is high and it is difficult to meet the requirements of practical applications. - In particular, the quadratic computational complexity of the self - attention mechanism (O(L^2D), where L is the number of tokens and D is the number of channels) makes it inefficient when processing large - scale data. 2. **Limitations of existing methods**: - Although some works attempt to reduce the computational cost by approximating self - attention or limiting the number of tokens, these methods still have bottlenecks, especially when processing high - resolution images. - State Space Models (SSMs) have become a promising alternative due to their linear computational complexity (O(LD)), but their exploration in visual tasks is relatively limited. #### Innovations of EfficientViM 1. **Design of HSM - SSD layers**: - The authors redesign the standard SSD layer, transferring the channel - mixing operation from the image feature space to the hidden - state space, thereby alleviating the main bottleneck and maintaining the generalization ability of the model. - The specific formula is as follows: \[ h=(a\mathbf{1}_N^\top\odot B)^\top x_{in}W_{in} \] where \(a\) is the importance weight, \(B\) is the projection matrix, \(x_{in}\) is the input feature, and \(W_{in}\) is a learnable matrix. 2. **Multi - stage hidden - state fusion**: - Introduce a Multi - stage Hidden State Fusion (MSF) mechanism, which enhances the representational ability of the model by combining hidden - state - predicted logits at different stages. - The calculation formula is as follows: \[ z = \sum_{s = 0}^{S}\hat{\beta}(s)z(s) \] where \(\hat{\beta}(s)\) is the normalized weight and \(z(s)\) is the logit at the \(s\) - th stage. 3. **Single - head HSM - SSD**: - By eliminating tensor operations (such as reshape, copy) in the multi - head configuration, further reduce the bottleneck caused by memory access and increase the throughput. #### Experimental results - **ImageNet - 1K classification task**: - EfficientViM outperforms previous efficient networks in both speed and accuracy, such as MobileNetV3 - L 0.75 and EfficientViT - M3, achieving approximately 90% and 30% speed improvements respectively. - EfficientViM - M2 is about 4 times faster than MobileViTV2 0.75 and has a 0.2% performance improvement. - **High - resolution image scalability**: - On high - resolution images (such as 384x384 and 512x512), the throughput advantage of EfficientViM is more obvious, being more than 15% faster than SHViT. - **Distillation training**: - After distillation training, EfficientViM still has a good performance in terms of speed -