HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution

Masoomeh Aslahishahri,Jordan Ubbens,Ian Stavness
2024-08-30
Abstract:In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper proposes a new method called HiTSR (Hierarchical Transformer for Reference-based Super-Resolution), aiming to simplify the existing complex multi-stage, multi-network reference image super-resolution (Ref-SR) models. Specifically, HiTSR enhances the quality of low-resolution images by combining dual attention mechanisms (self-attention and cross-attention), thereby overcoming the blurring and artifacts issues present in traditional single image super-resolution (SISR) methods. #### Main Contributions Include: 1. **Dual Attention Module**: Introduces a hierarchical Swin Transformer network that utilizes dual attention mechanisms to learn joint representations between two image distributions and predict correspondences. This approach enables the model to transfer fine textures from high-resolution reference images to the corresponding low-resolution input images while maintaining robustness to changes in object shape, position, and scale. 2. **Global Context Information Enhancement**: Uses the squeeze-and-excitation (SE) module in convolutional neural networks (CNNs) to enhance global context information, thereby encoding spatial features at multiple resolutions to generate global query representations. 3. **Long-Range Skip Connections (LSCs)**: Adds long-range skip connections between the shallow and deep layers of the transformer blocks to facilitate information flow across network hierarchies. 4. **Gated Attention Strategy**: Employs a gated attention strategy that adjusts a gating parameter to simultaneously focus on the contents of self-attention and cross-attention blocks, thereby flexibly adjusting the feature combinations within each transformer block. 5. **Performance**: On the SUN80, Urban100, and Manga109 datasets, HiTSR achieves results comparable to or even better than existing state-of-the-art methods. Specifically, on the SUN80 dataset, its PSNR/SSIM values reached 30.24/0.821, surpassing current methods. In summary, HiTSR aims to improve the reference image super-resolution task by simplifying the architecture and training process, achieving efficient and high-quality image reconstruction.