Abstract:In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy, as evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes. The source code is available at <a class="link-external link-https" href="https://github.com/Yuehan717/RealViformer" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily focuses on the issue of Real-World Video Super-Resolution (RWVSR). In real-world environments, videos often suffer from various complex degradations and artifacts, making video super-resolution processing very challenging. Traditional video super-resolution methods often assume that low-resolution frames are obtained by simply downsampling high-resolution frames through a known downsampling kernel, which is not the case in real-world scenarios. The main contributions of the paper are as follows: 1. **Exploring the differences between spatial attention and channel attention in real-world video super-resolution**: The authors compared two covariance-based attention mechanisms—spatial attention and channel attention—in the task of real-world video super-resolution. Experimental results show that although spatial attention is widely used in standard video super-resolution, it is very sensitive to noise and degradation; in contrast, channel attention is more robust. 2. **Revealing the issues with channel attention and their solutions**: The paper points out that directly applying channel attention increases the covariance between output feature channels, indicating higher feature redundancy, which is detrimental to the learning process. To address this issue, the authors explored techniques such as the Squeeze-and-Excite mechanism and covariance-based channel rescaling to reduce feature redundancy. 3. **Proposing the RealViformer model**: Based on the above findings, the authors developed a new real-world video super-resolution model—RealViformer. This model uses an improved channel attention module to limit artifacts generated by the model and enhances performance by introducing the Squeeze-and-Excite mechanism and covariance-based channel rescaling techniques. Experimental results show that RealViformer achieves state-of-the-art performance on challenging synthetic video datasets as well as two real-world video datasets, with fewer parameters and faster running speed. In summary, this paper deeply analyzes the behavioral differences of different attention mechanisms in the task of real-world video super-resolution, proposes effective solutions accordingly, and ultimately designs an efficient and high-performance real-world video super-resolution model.

RealViformer: Investigating Attention for Real-World Video Super-Resolution

Video super-resolution with phase-aided deformable alignment network

Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

S2A: Scale-Attention-Aware Networks for Video Super-Resolution.

Real-World Video Super-Resolution with a Degradation-Adaptive Model

Enhanced Video Super-Resolution Network Towards Compressed Data

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Learning Degradation-Robust Spatiotemporal Frequency-Transformer for Video Super-Resolution

Omniscient Video Super-Resolution with Explicit-Implicit Alignment

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Attention-guided dual spatial-temporal non-local network for video super-resolution

Asymmetric Event-Guided Video Super-Resolution

3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

STDAN: Deformable Attention Network for Space-Time Video Super-Resolution

Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution

RTSR: A Real-Time Super-Resolution Model for AV1 Compressed Content

Video Super-Resolution Via a Spatio-Temporal Alignment Network.

NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution