RealViformer: Investigating Attention for Real-World Video Super-Resolution

Yuehan Zhang,Angela Yao
2024-07-19
Abstract:In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy, as evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes. The source code is available at <a class="link-external link-https" href="https://github.com/Yuehan717/RealViformer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on the issue of Real-World Video Super-Resolution (RWVSR). In real-world environments, videos often suffer from various complex degradations and artifacts, making video super-resolution processing very challenging. Traditional video super-resolution methods often assume that low-resolution frames are obtained by simply downsampling high-resolution frames through a known downsampling kernel, which is not the case in real-world scenarios. The main contributions of the paper are as follows: 1. **Exploring the differences between spatial attention and channel attention in real-world video super-resolution**: The authors compared two covariance-based attention mechanisms—spatial attention and channel attention—in the task of real-world video super-resolution. Experimental results show that although spatial attention is widely used in standard video super-resolution, it is very sensitive to noise and degradation; in contrast, channel attention is more robust. 2. **Revealing the issues with channel attention and their solutions**: The paper points out that directly applying channel attention increases the covariance between output feature channels, indicating higher feature redundancy, which is detrimental to the learning process. To address this issue, the authors explored techniques such as the Squeeze-and-Excite mechanism and covariance-based channel rescaling to reduce feature redundancy. 3. **Proposing the RealViformer model**: Based on the above findings, the authors developed a new real-world video super-resolution model—RealViformer. This model uses an improved channel attention module to limit artifacts generated by the model and enhances performance by introducing the Squeeze-and-Excite mechanism and covariance-based channel rescaling techniques. Experimental results show that RealViformer achieves state-of-the-art performance on challenging synthetic video datasets as well as two real-world video datasets, with fewer parameters and faster running speed. In summary, this paper deeply analyzes the behavioral differences of different attention mechanisms in the task of real-world video super-resolution, proposes effective solutions accordingly, and ultimately designs an efficient and high-performance real-world video super-resolution model.