Abstract:In this paper, we propose HiTSR, a hierarchical transformer model for reference-based image super-resolution, which enhances low-resolution input images by learning matching correspondences from high-resolution reference images. Diverging from existing multi-network, multi-stage approaches, we streamline the architecture and training pipeline by incorporating the double attention block from GAN literature. Processing two visual streams independently, we fuse self-attention and cross-attention blocks through a gating attention strategy. The model integrates a squeeze-and-excitation module to capture global context from the input images, facilitating long-range spatial interactions within window-based attention blocks. Long skip connections between shallow and deep layers further enhance information flow. Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109. Specifically, on the SUN80 dataset, our model achieves PSNR/SSIM values of 30.24/0.821. These results underscore the effectiveness of attention mechanisms in reference-based image super-resolution. The transformer-based model attains state-of-the-art results without the need for purpose-built subnetworks, knowledge distillation, or multi-stage training, emphasizing the potency of attention in meeting reference-based image super-resolution requirements.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper proposes a new method called HiTSR (Hierarchical Transformer for Reference-based Super-Resolution), aiming to simplify the existing complex multi-stage, multi-network reference image super-resolution (Ref-SR) models. Specifically, HiTSR enhances the quality of low-resolution images by combining dual attention mechanisms (self-attention and cross-attention), thereby overcoming the blurring and artifacts issues present in traditional single image super-resolution (SISR) methods. #### Main Contributions Include: 1. **Dual Attention Module**: Introduces a hierarchical Swin Transformer network that utilizes dual attention mechanisms to learn joint representations between two image distributions and predict correspondences. This approach enables the model to transfer fine textures from high-resolution reference images to the corresponding low-resolution input images while maintaining robustness to changes in object shape, position, and scale. 2. **Global Context Information Enhancement**: Uses the squeeze-and-excitation (SE) module in convolutional neural networks (CNNs) to enhance global context information, thereby encoding spatial features at multiple resolutions to generate global query representations. 3. **Long-Range Skip Connections (LSCs)**: Adds long-range skip connections between the shallow and deep layers of the transformer blocks to facilitate information flow across network hierarchies. 4. **Gated Attention Strategy**: Employs a gated attention strategy that adjusts a gating parameter to simultaneously focus on the contents of self-attention and cross-attention blocks, thereby flexibly adjusting the feature combinations within each transformer block. 5. **Performance**: On the SUN80, Urban100, and Manga109 datasets, HiTSR achieves results comparable to or even better than existing state-of-the-art methods. Specifically, on the SUN80 dataset, its PSNR/SSIM values reached 30.24/0.821, surpassing current methods. In summary, HiTSR aims to improve the reference image super-resolution task by simplifying the architecture and training process, achieving efficient and high-quality image reconstruction.

HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution

Hybrid-Scale Hierarchical Transformer for Remote Sensing Image Super-Resolution

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Multi-attention fusion transformer for single-image super-resolution

MaxSR: Image Super-Resolution Using Improved MaxViT

Single Image Super-Resolution Using Deep Hierarchical Attention Network

Vision Transformers with Hierarchical Attention

HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution

HMANet: Hybrid Multi-Axis Aggregation Network for Image Super-Resolution

Reference-Based Image Super-Resolution with Deformable Attention Transformer.

Attention-based Multi-Reference Learning for Image Super-Resolution

MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution

From Coarse to Fine: Hierarchical Pixel Integration for Lightweight Image Super-resolution

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

ML-CrAIST: Multi-scale Low-high Frequency Information-based Cross black Attention with Image Super-resolving Transformer

Remote Sensing Image Super-Resolution via Residual-Dense Hybrid Attention Network

An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution

Hyperspectral Image Super-Resolution via Deep Spatiospectral Attention Convolutional Neural Networks

Fully Cross-Attention Transformer for Guided Depth Super-Resolution