Abstract:Vision Transformer have achieved impressive performance in image super-resolution. However, they suffer from low inference speed mainly because of the quadratic complexity of multi-head self-attention (MHSA), which is the key to learning long-range dependencies. On the contrary, most CNN-based methods neglect the important effect of global contextual information, resulting in inaccurate and blurring details. If one can make the best of both Transformers and CNNs, it will achieve a better trade-off between image quality and inference speed. Based on this observation, firstly assume that the main factor affecting the performance in the Transformer-based SR models is the general architecture design, not the specific MHSA component. To verify this, some ablation studies are made by replacing MHSA with large kernel convolutions, alongside other essential module replacements. Surprisingly, the derived models achieve competitive performance. Therefore, a general architecture design GlobalSR is extracted by not specifying the core modules including blocks and domain embeddings of Transformer-based SR models. It also contains three practical guidelines for designing a lightweight SR network utilizing image-level global contextual information to reconstruct SR images. Following the guidelines, the blocks and domain embeddings of GlobalSR are instantiated via Deformable Convolution Attention Block (DCAB) and Fast Fourier Convolution Domain Embedding (FCDE), respectively. The instantiation of general architecture, termed GlobalSR-DF, proposes a DCA to extract the global contextual feature by utilizing Deformable Convolution and a Hadamard product as the attention map at the block level. Meanwhile, the FCDE utilizes the Fast Fourier to transform the input spatial feature into frequency space and then extract image-level global information from it by convolutions. Extensive experiments demonstrate that GlobalSR is the key part in achieving a superior trade-off between SR quality and efficiency. Specifically, our proposed GlobalSR-DF outperforms state-of-the-art CNN-based and ViT-based SISR models regarding accuracy-speed trade-offs with sharp and natural details.

S<SUP>2</SUP>R: Exploring a Double-Win Transformer-Based Framework for Ideal and Blind Super-Resolution

CSwT-SR: Conv-Swin Transformer for Blind Remote Sensing Image Super-Resolution with Amplitude-Phase Learning and Structural Detail Alternating Learning

Self-Reference Image Super-Resolution via Pre-trained Diffusion Large Model and Window Adjustable Transformer

End-to-end Alternating Optimization for Real-World Blind Super Resolution

Blind Image Super-Resolution: A Survey and Beyond

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

Deep learning techniques for blind image super-resolution: A high-scale multi-domain perspective evaluation

GlobalSR: Global context network for single image super-resolution via deformable convolution attention and fast Fourier convolution

Blind Quality Assessment for Image Superresolution Using Deep Two-Stream Convolutional Networks

Various Degradation: Dual Cross-Refinement Transformer for Blind Sonar Image Super-Resolution

Unsupervised Real-World Image Super-Resolution via Dual Synthetic-to-Realistic and Realistic-to-Synthetic Translations

M2TSR: Multi-Range and Mix-Grained Transformer for Single Image Super-Resolution

Deep Blind Super-Resolution for Satellite Video

Feature Modulation Transformer: Cross-Refinement of Global Representation Via High-Frequency Prior for Image Super-Resolution.

HiT-SR: Hierarchical Transformer for Efficient Image Super-Resolution

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

ITSRN++: Stronger and Better Implicit Transformer Network for Continuous Screen Content Image Super-Resolution

Blind Super-Resolution With Iterative Kernel Correction