How Powerful Potential of Attention on Image Restoration?

Cong Wang,Jinshan Pan,Yeying Jin,Liyan Wang,Wei Wang,Gang Fu,Wenqi Ren,Xiaochun Cao
2024-03-15
Abstract:Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the question of whether using only the attention mechanism (without using the Feed-Forward Network, FFN) can achieve performance comparable to or better than existing CNN and Transformer methods in image restoration tasks. Specifically, the paper explores the potential of the attention mechanism without FFN and proposes the Continuous Scaling Attention (CSAttn) method to achieve this goal. ### Background of the Paper - **Image Restoration**: Image restoration aims to recover a clear image from a given degraded image, which is very beneficial for practical applications such as video surveillance and autonomous driving. - **Existing Methods**: - **CNN Methods**: Convolutional Neural Networks (CNNs) perform well in image restoration tasks but are limited by local receptive fields, making it difficult to model long-range pixel dependencies. - **Transformer Methods**: Transformers capture long-range pixel dependencies through multi-head self-attention mechanisms and improve model performance through feed-forward networks (FFNs) for nonlinear transformations. ### Research Motivation - **Role of FFN**: Existing Transformer architectures typically include multi-head self-attention mechanisms and FFNs. FFNs are used for nonlinear transformations and feature enhancement and are considered key components in modern Transformer architectures. - **Research Question**: Can high performance in image restoration tasks be achieved solely through the attention mechanism without using FFNs? ### Proposed Method - **Continuous Scaling Attention (CSAttn)**: The paper proposes a new attention mechanism that includes three consecutive attention computations without the need for FFNs. - **Key Designs**: - **Continuous Attention Learning**: Gradually improve model performance through a series of effective designs. - **Nonlinear Activation Functions**: Introduce nonlinear activation functions in attention modeling to activate more useful features. - **Value Nonlinear Transformation Adjustment**: Adaptively adjust value features to generate more representative information for subsequent attention computations. - **Internal Attention Aggregation**: Fuse attention features from different levels to learn better attention representations. - **Internal Progressive Multi-Head**: Gradually increase the number of attention heads to implicitly enhance attention representations. - **Internal Residual Connections**: Provide more useful features for the next attention computation, further improving restoration quality. - **Spatial Scaling Learning**: Save training budget through spatial scaling operations while maintaining superior performance. ### Experimental Results - **De-raining**: On the Rain100H dataset, CSAttn improved the PSNR by an average of 0.41dB compared to state-of-the-art methods. - **De-snowing**: On the CSD and Snow100K datasets, CSAttn achieved the best PSNR and SSIM metrics. - **Low-Light Image Enhancement**: On the LOL dataset, CSAttn significantly outperformed existing methods, improving PSNR by at least 4.22dB. - **Real Image Dehazing**: On the Dense-Haze and NH-Haze datasets, CSAttn performed excellently, especially on the Dense-Haze dataset, where SSIM improved by 6.03%. ### Conclusion The paper demonstrates through experiments that CSAttn can achieve or even surpass the performance of existing CNN and Transformer methods in multiple image restoration tasks without using FFNs. This indicates that the attention mechanism itself has strong potential and can achieve high-performance image restoration through appropriate design.