Abstract:Existing video super-resolution (VSR) methods generally adopt a recurrent propagation network to extract spatio-temporal information from the entire video sequences, exhibiting impressive performance. However, the key components in recurrent-based VSR networks significantly impact model efficiency, e.g., the alignment module occupies a substantial portion of model parameters, while the bidirectional propagation mechanism significantly amplifies the inference time. Consequently, developing a compact and efficient VSR method that can be deployed on resource-constrained devices, e.g., smartphones, remains challenging. To this end, we propose a cascaded temporal updating network (CTUN) for efficient VSR. We first develop an implicit cascaded alignment module to explore spatio-temporal correspondences from adjacent frames. Moreover, we propose a unidirectional propagation updating network to efficiently explore long-range temporal information, which is crucial for high-quality video reconstruction. Specifically, we develop a simple yet effective hidden updater that can leverage future information to update hidden features during forward propagation, significantly reducing inference time while maintaining performance. Finally, we formulate all of these components into an end-to-end trainable VSR network. Extensive experimental results show that our CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods. Notably, compared with BasicVSR, our method obtains better results while employing only about 30% of the parameters and running time. The source code and pre-trained models will be available at <a class="link-external link-https" href="https://github.com/House-Leo/CTUN" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the trade - off between efficiency and performance when video super - resolution (VSR) methods are deployed on resource - constrained devices such as smartphones. Specifically, existing VSR methods usually adopt the method based on recurrent propagation networks to extract spatio - temporal information from the entire video sequence. Although they perform well, they have the following two main problems: 1. **High model complexity**: Alignment modules (such as optical flow and deformable convolution) occupy a large number of model parameters. 2. **Long inference time**: The bidirectional propagation mechanism significantly increases the inference time. Therefore, developing a compact and efficient VSR method that can achieve high - quality video reconstruction on resource - constrained devices has become an important challenge. To this end, the authors propose a Cascaded Temporal Update Network (CTUN) to significantly improve model efficiency while maintaining performance. ### Main contributions of CTUN 1. **Implicit Cascaded Alignment Module (ICAM)**: Effectively explores the spatio - temporal correspondences among past, current and future features in an implicit way, making the model parameters more efficient and easier to train. 2. **Hidden Updater (HU)**: Utilizes future information to update hidden features, significantly reducing the memory consumption and inference time of recurrent - based VSR models while maintaining performance. 3. **End - to - end trained unidirectional propagation network**: Through experiments, CTUN achieves a good trade - off between performance and model complexity on multiple VSR benchmark datasets, especially with significant improvements in terms of the number of parameters and running time compared to BasicVSR. ### Experimental results The experimental results show that CTUN achieves performance comparable to or even better than existing methods on multiple benchmark datasets, especially having obvious advantages in terms of the number of parameters and inference time. For example, on the Vid4 dataset, CTUN improves the PSNR by 0.24 dB compared to BasicVSR, while the number of parameters is only about 30% of that of BasicVSR. ### Summary The core problem of the paper is to develop a VSR method that can be efficiently deployed on resource - constrained devices. By introducing innovative components such as ICAM and HU, CTUN significantly improves model efficiency while maintaining high - quality video reconstruction.

Cascaded Temporal Updating Network for Efficient Video Super-Resolution

Video super-resolution with phase-aided deformable alignment network

NoUCSR: Efficient Super-Resolution Network Without Upsampling Convolution.

CTVSR: Collaborative Spatial-Temporal Transformer for Video Super-Resolution

Video Super-Resolution Based on Multiple Networks Merging

Enhanced Video Super-Resolution Network Towards Compressed Data

Accelerating the Training of Video Super-Resolution Models

Video Super-Resolution Via a Spatio-Temporal Alignment Network.

Revisiting Temporal Modeling for Video Super-resolution.

Video super-resolution via mixed spatial-temporal convolution and selective fusion

Collaborative Feedback Discriminative Propagation for Video Super-Resolution

Attention-guided video super-resolution with recurrent multi-scale spatial–temporal transformer

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

You Only Align Once: Bidirectional Interaction for Spatial-Temporal Video Super-Resolution

AsConvSR: Fast and Lightweight Super-Resolution Network with Assembled Convolutions

A Lightweight Recurrent Grouping Attention Network for Video Super-Resolution

Dual feature enhanced video super-resolution network based on low-light scenarios

Learning for Unconstrained Space-Time Video Super-Resolution

Global Spatial-Temporal Information-based Residual ConvLSTM for Video Space-Time Super-Resolution

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

Video Super-Resolution Reconstruction Based on Deep Convolutional Neural Network and Spatio-Temporal Similarity