Abstract:Video super-resolution (VSR) is important in video processing for reconstructing high-definition image sequences from corresponding continuous and highly-related video frames. However, existing VSR methods have limitations in fusing spatial-temporal information. Some methods only fuse spatial-temporal information on a limited range of total input sequences, while others adopt a recurrent strategy that gradually attenuates the spatial information. While recent advances in VSR utilize Transformer-based methods to improve the quality of the upscaled videos, these methods require significant computational resources to model the long-range dependencies, which dramatically increases the model complexity. To address these issues, we propose a Collaborative Transformer for Video Super-Resolution (CTVSR). The proposed method integrates the strengths of Transformer-based and recurrent-based models by concurrently assimilating the spatial information derived from multi-scale receptive fields and the temporal information acquired from temporal trajectories. In particular, we propose a Spatial Enhanced Network (SEN) with two key components: Token Dropout Attention (TDA) and Deformable Multi-head Cross Attention (DMCA). TDA focuses on the key regions to extract more informative features, and DMCA employs deformable cross attention to gather information from adjacent frames. Moreover, we introduce a Temporal-trajectory Enhanced Network (TEN) that computes the similarity of a given token with temporally-related tokens in the temporal trajectory, which is different from previous methods that evaluate all tokens within the temporal dimension. With comprehensive quantitative and qualitative experiments on four widely-used VSR benchmarks, the proposed CTVSR achieves competitive performance with relatively low computational consumption and high forward speed.

CVIformer: Cross-View Interactive Transformer for Efficient Stereoscopic Image Super-Resolution

Strong-Weak Cross-View Interaction Network for Stereo Image Super-Resolution

CVGSR: Stereo image Super-Resolution with Cross-View guidance

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

Cross View Capture for Stereo Image Super-Resolution

Cross-View Hierarchy Network for Stereo Image Super-Resolution

Efficient Hybrid Feature Interaction Network for Stereo Image Super-Resolution

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Video Super-Resolution with Spatial-Temporal Transformer Encoder

CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion

EViTIB: Efficient Vision Transformer Via Inductive Bias Exploration for Image Super-Resolution

Feature Modulation Transformer: Cross-Refinement of Global Representation Via High-Frequency Prior for Image Super-Resolution.

SIR-Former: Stereo Image Restoration Using Transformer

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

PFT-SSR: Parallax Fusion Transformer for Stereo Image Super-Resolution

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Stereo Video Super-Resolution Via Exploiting View-Temporal Correlations.

A New Dataset and Transformer for Stereoscopic Video Super-Resolution

Deep Plug-and-Play Video Super-Resolution.

Image Super-Resolution Using a Simple Transformer Without Pretraining

Transformer with Hybrid Attention Mechanism for Stereo Endoscopic Video Super Resolution