MambaVC: Learned Visual Compression with Selective State Spaces

Shiyu Qin,Jinpeng Wang,Yimin Zhou,Bin Chen,Tianci Luo,Baoyi An,Tao Dai,Shutao Xia,Yaowei Wang
2024-05-28
Abstract:Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at <a class="link-external link-https" href="https://github.com/QinSY123/2024-MambaVC" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Computer Vision and Pattern Recognition,Information Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the long - standing trade - off between efficiency and effectiveness in the field of visual compression. Specifically, traditional visual compression methods based on CNN (Convolutional Neural Network) and Transformer have difficulty achieving both efficient computational performance and excellent compression quality when processing high - resolution images. Due to the limitation of its local receptive field, CNN performs poorly in capturing global context information; while Transformer can capture global information well, but its computational complexity and memory consumption are too high, resulting in low efficiency. To solve these problems, the authors propose a new visual compression network based on State Space Models (SSMs) - MambaVC. MambaVC enhances the ability to capture global context information by introducing Selective State Spaces, especially by designing the 2D Selective Scanning (2DSS) module, thereby improving compression performance and reducing computational and memory costs. ### Main contributions 1. **Innovative network structure**: Developed MambaVC, which is the first network to use selective state spaces for visual compression. The introduced VSS block (Visual State Space block) and 2DSS module significantly improve the ability of global context modeling. 2. **Superior performance**: On multiple benchmark datasets, MambaVC achieves better rate - distortion performance than existing CNN and Transformer methods, and reduces the amount of computation and memory usage. 3. **Advantages in high - resolution image compression**: Especially in high - resolution image compression, MambaVC shows a stronger performance improvement, demonstrating its potential and scalability in practical applications. 4. **Comprehensive comparative analysis**: Through a detailed comparison of different network designs, the advantages of MambaVC in various aspects (such as spatial redundancy, effective receptive field, information loss, etc.) are verified. These improvements not only enhance the effect of visual compression, but also provide a new direction for future research, especially in application scenarios that require efficient processing of high - resolution images, such as high - definition medical image compression and high - resolution satellite image transmission.