Abstract:Learned visual compression is an important and active task in multimedia. Existing approaches have explored various CNN- and Transformer-based designs to model content distribution and eliminate redundancy, where balancing efficacy (i.e., rate-distortion trade-off) and efficiency remains a challenge. Recently, state-space models (SSMs) have shown promise due to their long-range modeling capacity and efficiency. Inspired by this, we take the first step to explore SSMs for visual compression. We introduce MambaVC, a simple, strong and efficient compression network based on SSM. MambaVC develops a visual state space (VSS) block with a 2D selective scanning (2DSS) module as the nonlinear activation function after each downsampling, which helps to capture informative global contexts and enhances compression. On compression benchmark datasets, MambaVC achieves superior rate-distortion performance with lower computational and memory overheads. Specifically, it outperforms CNN and Transformer variants by 9.3% and 15.6% on Kodak, respectively, while reducing computation by 42% and 24%, and saving 12% and 71% of memory. MambaVC shows even greater improvements with high-resolution images, highlighting its potential and scalability in real-world applications. We also provide a comprehensive comparison of different network designs, underscoring MambaVC's advantages. Code is available at <a class="link-external link-https" href="https://github.com/QinSY123/2024-MambaVC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the long - standing trade - off between efficiency and effectiveness in the field of visual compression. Specifically, traditional visual compression methods based on CNN (Convolutional Neural Network) and Transformer have difficulty achieving both efficient computational performance and excellent compression quality when processing high - resolution images. Due to the limitation of its local receptive field, CNN performs poorly in capturing global context information; while Transformer can capture global information well, but its computational complexity and memory consumption are too high, resulting in low efficiency. To solve these problems, the authors propose a new visual compression network based on State Space Models (SSMs) - MambaVC. MambaVC enhances the ability to capture global context information by introducing Selective State Spaces, especially by designing the 2D Selective Scanning (2DSS) module, thereby improving compression performance and reducing computational and memory costs. ### Main contributions 1. **Innovative network structure**: Developed MambaVC, which is the first network to use selective state spaces for visual compression. The introduced VSS block (Visual State Space block) and 2DSS module significantly improve the ability of global context modeling. 2. **Superior performance**: On multiple benchmark datasets, MambaVC achieves better rate - distortion performance than existing CNN and Transformer methods, and reduces the amount of computation and memory usage. 3. **Advantages in high - resolution image compression**: Especially in high - resolution image compression, MambaVC shows a stronger performance improvement, demonstrating its potential and scalability in practical applications. 4. **Comprehensive comparative analysis**: Through a detailed comparison of different network designs, the advantages of MambaVC in various aspects (such as spatial redundancy, effective receptive field, information loss, etc.) are verified. These improvements not only enhance the effect of visual compression, but also provide a new direction for future research, especially in application scenarios that require efficient processing of high - resolution images, such as high - definition medical image compression and high - resolution satellite image transmission.

MambaVC: Learned Visual Compression with Selective State Spaces

VMamba: Visual State Space Model

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

MambaSCI: Efficient Mamba-UNet for Quad-Bayer Patterned Video Snapshot Compressive Imaging

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

LocalMamba: Visual State Space Model with Windowed Selective Scan

VSSD: Vision Mamba with Non-Causal State Space Duality

A Survey on Visual Mamba

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Efficient Image Compression Using Advanced State Space Models

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

A Survey on Vision Mamba: Models, Applications and Challenges