Abstract:Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.

What problem does this paper attempt to address?

The paper attempts to address the question of whether using only the attention mechanism (without using the Feed-Forward Network, FFN) can achieve performance comparable to or better than existing CNN and Transformer methods in image restoration tasks. Specifically, the paper explores the potential of the attention mechanism without FFN and proposes the Continuous Scaling Attention (CSAttn) method to achieve this goal. ### Background of the Paper - **Image Restoration**: Image restoration aims to recover a clear image from a given degraded image, which is very beneficial for practical applications such as video surveillance and autonomous driving. - **Existing Methods**: - **CNN Methods**: Convolutional Neural Networks (CNNs) perform well in image restoration tasks but are limited by local receptive fields, making it difficult to model long-range pixel dependencies. - **Transformer Methods**: Transformers capture long-range pixel dependencies through multi-head self-attention mechanisms and improve model performance through feed-forward networks (FFNs) for nonlinear transformations. ### Research Motivation - **Role of FFN**: Existing Transformer architectures typically include multi-head self-attention mechanisms and FFNs. FFNs are used for nonlinear transformations and feature enhancement and are considered key components in modern Transformer architectures. - **Research Question**: Can high performance in image restoration tasks be achieved solely through the attention mechanism without using FFNs? ### Proposed Method - **Continuous Scaling Attention (CSAttn)**: The paper proposes a new attention mechanism that includes three consecutive attention computations without the need for FFNs. - **Key Designs**: - **Continuous Attention Learning**: Gradually improve model performance through a series of effective designs. - **Nonlinear Activation Functions**: Introduce nonlinear activation functions in attention modeling to activate more useful features. - **Value Nonlinear Transformation Adjustment**: Adaptively adjust value features to generate more representative information for subsequent attention computations. - **Internal Attention Aggregation**: Fuse attention features from different levels to learn better attention representations. - **Internal Progressive Multi-Head**: Gradually increase the number of attention heads to implicitly enhance attention representations. - **Internal Residual Connections**: Provide more useful features for the next attention computation, further improving restoration quality. - **Spatial Scaling Learning**: Save training budget through spatial scaling operations while maintaining superior performance. ### Experimental Results - **De-raining**: On the Rain100H dataset, CSAttn improved the PSNR by an average of 0.41dB compared to state-of-the-art methods. - **De-snowing**: On the CSD and Snow100K datasets, CSAttn achieved the best PSNR and SSIM metrics. - **Low-Light Image Enhancement**: On the LOL dataset, CSAttn significantly outperformed existing methods, improving PSNR by at least 4.22dB. - **Real Image Dehazing**: On the Dense-Haze and NH-Haze datasets, CSAttn performed excellently, especially on the Dense-Haze dataset, where SSIM improved by 6.03%. ### Conclusion The paper demonstrates through experiments that CSAttn can achieve or even surpass the performance of existing CNN and Transformer methods in multiple image restoration tasks without using FFNs. This indicates that the attention mechanism itself has strong potential and can achieve high-performance image restoration through appropriate design.

How Powerful Potential of Attention on Image Restoration?

Accurate Image Restoration with Attention Retractable Transformer

HAT: Hybrid Attention Transformer for Image Restoration

Dual-former: Hybrid Self-attention Transformer for Efficient Image Restoration

Dilated Strip Attention Network for Image Restoration

Empowering Image Recovery_ A Multi-Attention Approach

Joint multi-dimensional dynamic attention and transformer for general image restoration

Decomformer: Decompose Self-Attention of Transformer for Efficient Image Restoration

Key-Graph Transformer for Image Restoration

iiTransformer: A Unified Approach to Exploiting Local and Non-local Information for Image Restoration

Restormer: Efficient Transformer for High-Resolution Image Restoration

An efficient multi‐scale transformer for satellite image dehazing

Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration

CascadedGaze: Efficiency in Global Context Extraction for Image Restoration

Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance

Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration

Radiologic differences between ileocecal tuberculosis and Crohn's disease

Remote Sensing Image Classification Based on Non-Linear Enhanced Attention Mechanism

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Neuromorphic Vision Restoration Network for Advanced Driver Assistance System

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning