Joint multi-dimensional dynamic attention and transformer for general image restoration

Huan Zhang,Xu Zhang,Nian Cai,Jianglei Di,Yun Zhang
2024-11-12
Abstract:Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at <a class="link-external link-https" href="https://github.com/House-yuyu/MDDA-former" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the ability to handle complex degradations while maintaining high efficiency when dealing with image restoration tasks. Specifically, the paper focuses on how to achieve a good balance between performance and computational complexity in general - purpose image restoration tasks by combining the multi - dimensional dynamic attention mechanism and the Transformer. These problems include: 1. **Complex Degradation Processing**: Outdoor images are often severely degraded due to factors such as rain, fog, and noise, which affect image quality and subsequent high - level tasks. Current image restoration methods face challenges in handling these complex degradations, especially in maintaining efficiency. 2. **Balance between Performance and Computational Complexity**: Existing image restoration methods either perform poorly in terms of performance or have too high computational complexity, making it difficult to be widely used in practical applications. Therefore, a new architecture is required to simultaneously improve performance and reduce computational complexity. 3. **Multi - task Processing Ability**: The paper also focuses on how to effectively handle multiple image degradation problems, such as rain removal, deblurring, denoising, defogging, and enhancement, in one model. ### Main Contributions of the Paper 1. **Proposed a New Image Restoration Architecture MDDA - former**: This architecture makes full use of the multi - scale structural differences of the U - Net architecture by using CNN - based modules in the encoder - decoder part and Transformer blocks in the latent layer, achieving a good balance between performance and efficiency. 2. **Designed the Multi - Dimensional Dynamic Attention Block (MDAB)**: This block can learn the dynamic complementary attention in the three dimensions of space, channel, and filter of the convolution kernel under acceptable computational complexity, thereby effectively extracting rich local context information. 3. **Proposed an Effective Transformer Block (ETB)**: This block effectively captures global context information through the transposed self - attention mechanism and depth convolution with linear complexity, while maintaining low model parameters and FLOPs. 4. **Experimental Verification**: A large number of experimental results show that the proposed method achieves a better trade - off between performance and complexity in five image restoration tasks (rain removal, deblurring, denoising, defogging, and enhancement), as well as on 18 benchmark datasets, and also performs well in high - level visual tasks. ### Formula Presentation - **Multi - Dimensional Dynamic Convolution (MDConv)**: \[ Y = W_d\ast X \] \[ W_d = W\odot\alpha_s\odot\alpha_c\odot\alpha_f \] \[ \alpha_s, \alpha_c, \alpha_f=\pi(X) \] where \(X\in\mathbb{R}^{h\times w\times C_{\text{in}}}\) is the input, \(Y\in\mathbb{R}^{h\times w\times C_{\text{out}}}\) is the output, \(W\) is the regular (static) convolution kernel, \(\alpha_s\in\mathbb{R}^{k\times k}\), \(\alpha_c\in\mathbb{R}^{C_{\text{in}}}\), \(\alpha_f\in\mathbb{R}^{C_{\text{out}}}\) represent the attention weights in the three dimensions of space, channel, and filter respectively, and \(\odot\) and \(\ast\) represent the element - wise multiplication and convolution operations respectively. - **Effective Transformer Block (ETB)**: \[ Q, K, V = f_{dw}^{3\times3}(f_1^{1\times1}(\text{LN}(X_e))) \] \[ \hat{Q}, \hat{K}, \hat{V}=R(Q, K, V) \] \[ \text{FTSA}=\text{SoftMax}(\hat{K}\otimes\hat{Q}/\alpha)