Channel-Partitioned Windowed Attention And Frequency Learning for Single Image Super-Resolution

Dinh Phu Tran,Dao Duy Hung,Daeyoung Kim
2024-08-27
Abstract:Recently, window-based attention methods have shown great potential for computer vision tasks, particularly in Single Image Super-Resolution (SISR). However, it may fall short in capturing long-range dependencies and relationships between distant tokens. Additionally, we find that learning on spatial domain does not convey the frequency content of the image, which is a crucial aspect in SISR. To tackle these issues, we propose a new Channel-Partitioned Attention Transformer (CPAT) to better capture long-range dependencies by sequentially expanding windows along the height and width of feature maps. In addition, we propose a novel Spatial-Frequency Interaction Module (SFIM), which incorporates information from spatial and frequency domains to provide a more comprehensive information from feature maps. This includes information about the frequency content and enhances the receptive field across the entire image. Experimental findings show the effectiveness of our proposed modules and architecture. In particular, CPAT surpasses current state-of-the-art methods by up to 0.31dB at x2 SR on Urban100.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in the Single Image Super - Resolution (SISR) task: 1. **Capturing long - distance dependencies**: The existing window - based attention mechanisms have shown great potential in the SISR task, but may be insufficient in capturing long - distance dependencies and the relationships between distant tokens. This limits the model's understanding and utilization of global information. 2. **Lack of frequency - domain information**: The existing SISR methods mainly extract features from the spatial domain and ignore the information in the frequency domain. However, frequency - domain information is crucial for HR image reconstruction because it contains the details and structural information of the image. To solve these problems, the authors propose two innovative modules: - **Channel - Partitioned Windowed Self - Attention (CPWin - SA)**: The attention mechanism is enhanced by expanding the window along the height and width of the input feature map, so as to better capture long - distance dependencies and the relationships between distant tokens. - **Spatial - Frequency Interaction Module (SFIM)**: The information in the spatial domain and the frequency domain is combined to make full use of the information in the feature map and improve the quality of the output image. These two modules work together, making the proposed model perform well in the SISR task and significantly outperform the existing state - of - the - art methods. Specifically, CPAT improves by 0.31 dB over HAT in the x2 super - resolution task on the Urban100 dataset. In summary, this paper solves the deficiencies of the existing SISR methods in capturing long - distance dependencies and using frequency - domain information by introducing a new attention mechanism and a method of fusing spatial and frequency - domain information, thereby significantly improving the quality of super - resolution images.