Efficient Visual State Space Model for Image Deblurring

Lingshun Kong,Jiangxin Dong,Ming-Hsuan Yang,Jinshan Pan
2024-05-23
Abstract:Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration. ViTs typically yield superior results in image restoration compared to CNNs due to their ability to capture long-range dependencies and input-dependent characteristics. However, the computational complexity of Transformer-based models grows quadratically with the image resolution, limiting their practical appeal in high-resolution image restoration tasks. In this paper, we propose a simple yet effective visual state space model (EVSSM) for image deblurring, leveraging the benefits of state space models (SSMs) to visual data. In contrast to existing methods that employ several fixed-direction scanning for feature extraction, which significantly increases the computational cost, we develop an efficient visual scan block that applies various geometric transformations before each SSM-based module, capturing useful non-local information and maintaining high efficiency. Extensive experimental results show that the proposed EVSSM performs favorably against state-of-the-art image deblurring methods on benchmark datasets and real-captured images.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the challenges and issues present in the task of image deblurring by proposing a novel solution. Specifically, the study attempts to solve the following key problems: 1. **Limitations of existing methods**: Traditional methods, such as those based on Convolutional Neural Networks (CNNs), have limitations when dealing with image deblurring tasks. The convolution operation itself is spatially invariant and local, making it difficult for CNNs to capture the spatial variation characteristics of image content and the non-local information beneficial for deblurring. 2. **Trade-off between efficiency and performance**: Although Transformer architectures can capture global information through self-attention mechanisms and perform well in image restoration tasks, their computational complexity increases significantly with the resolution of the input image, which poses a limitation for high-resolution image processing. Additionally, some methods that reduce computational costs (e.g., local window methods, transposed attention, etc.) sacrifice the ability to model non-local or spatial information, thereby affecting the quality of the restored image. 3. **Need to explore non-local information**: Therefore, there is a need to develop an efficient method that can explore non-local information without significantly increasing computational costs to achieve high-quality deblurring performance. To address the above issues, the paper proposes a simple yet effective Efficient Visual State Space Model (EVSSM), which leverages the advantages of State Space Models (SSMs) to handle visual data. Specifically, EVSSM utilizes the capability of state space models to effectively capture long-range dependencies and employs an Efficient Visual Scan (EVS) strategy to capture non-local spatial information while maintaining low computational costs. Additionally, the paper introduces an Efficient Discriminative Frequency Domain-based Feedforward Network (EDFFN) module to further enhance the efficiency of feature transformation. Experimental results show that the proposed EVSSM method achieves competitive or even better performance compared to existing state-of-the-art methods on multiple benchmark datasets, especially in the quantitative and qualitative evaluations on datasets such as GoPro, HIDE, and RealBlur. These results demonstrate the effectiveness and efficiency of EVSSM in handling the image deblurring task.