Abstract:Video super-resolution (VSR) is important in video processing for reconstructing high-definition image sequences from corresponding continuous and highly-related video frames. However, existing VSR methods have limitations in fusing spatial-temporal information. Some methods only fuse spatial-temporal information on a limited range of total input sequences, while others adopt a recurrent strategy that gradually attenuates the spatial information. While recent advances in VSR utilize Transformer-based methods to improve the quality of the upscaled videos, these methods require significant computational resources to model the long-range dependencies, which dramatically increases the model complexity. To address these issues, we propose a Collaborative Transformer for Video Super-Resolution (CTVSR). The proposed method integrates the strengths of Transformer-based and recurrent-based models by concurrently assimilating the spatial information derived from multi-scale receptive fields and the temporal information acquired from temporal trajectories. In particular, we propose a Spatial Enhanced Network (SEN) with two key components: Token Dropout Attention (TDA) and Deformable Multi-head Cross Attention (DMCA). TDA focuses on the key regions to extract more informative features, and DMCA employs deformable cross attention to gather information from adjacent frames. Moreover, we introduce a Temporal-trajectory Enhanced Network (TEN) that computes the similarity of a given token with temporally-related tokens in the temporal trajectory, which is different from previous methods that evaluate all tokens within the temporal dimension. With comprehensive quantitative and qualitative experiments on four widely-used VSR benchmarks, the proposed CTVSR achieves competitive performance with relatively low computational consumption and high forward speed.

Event-Adapted Video Super-Resolution

Asymmetric Event-Guided Video Super-Resolution

Turning Frequency to Resolution: Video Super-resolution via Event Cameras

Video super-resolution via event-driven temporal alignment

Video super-resolution with phase-aided deformable alignment network

Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution

Enhanced Video Super-Resolution Network Towards Compressed Data

EvTexture: Event-driven Texture Enhancement for Video Super-Resolution

Adapting Single-Image Super-Resolution Models to Video Super-Resolution: A Plug-and-Play Approach

Event Stream Super-Resolution Via Spatiotemporal Constraint Learning

HR-INR: Continuous Space-Time Video Super-Resolution via Event Camera

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning

Cascaded Temporal Updating Network for Efficient Video Super-Resolution

Accelerating the Training of Video Super-Resolution Models

Multi-Stage Feature Fusion Network for Video Super-Resolution

FM-VSR: Feature Multiplexing Video Super-Resolution for Compressed Video

IAA-VSR: an Iterative Alignment Algorithm for Video Super-Resolution.

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Space-time video super-resolution using long-term temporal feature aggregation

Event-based Video Frame Interpolation with Edge Guided Motion Refinement