Cascaded Temporal Updating Network for Efficient Video Super-Resolution

Hao Li,Jiangxin Dong,Jinshan Pan
2024-08-26
Abstract:Existing video super-resolution (VSR) methods generally adopt a recurrent propagation network to extract spatio-temporal information from the entire video sequences, exhibiting impressive performance. However, the key components in recurrent-based VSR networks significantly impact model efficiency, e.g., the alignment module occupies a substantial portion of model parameters, while the bidirectional propagation mechanism significantly amplifies the inference time. Consequently, developing a compact and efficient VSR method that can be deployed on resource-constrained devices, e.g., smartphones, remains challenging. To this end, we propose a cascaded temporal updating network (CTUN) for efficient VSR. We first develop an implicit cascaded alignment module to explore spatio-temporal correspondences from adjacent frames. Moreover, we propose a unidirectional propagation updating network to efficiently explore long-range temporal information, which is crucial for high-quality video reconstruction. Specifically, we develop a simple yet effective hidden updater that can leverage future information to update hidden features during forward propagation, significantly reducing inference time while maintaining performance. Finally, we formulate all of these components into an end-to-end trainable VSR network. Extensive experimental results show that our CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods. Notably, compared with BasicVSR, our method obtains better results while employing only about 30% of the parameters and running time. The source code and pre-trained models will be available at <a class="link-external link-https" href="https://github.com/House-Leo/CTUN" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the trade - off between efficiency and performance when video super - resolution (VSR) methods are deployed on resource - constrained devices such as smartphones. Specifically, existing VSR methods usually adopt the method based on recurrent propagation networks to extract spatio - temporal information from the entire video sequence. Although they perform well, they have the following two main problems: 1. **High model complexity**: Alignment modules (such as optical flow and deformable convolution) occupy a large number of model parameters. 2. **Long inference time**: The bidirectional propagation mechanism significantly increases the inference time. Therefore, developing a compact and efficient VSR method that can achieve high - quality video reconstruction on resource - constrained devices has become an important challenge. To this end, the authors propose a Cascaded Temporal Update Network (CTUN) to significantly improve model efficiency while maintaining performance. ### Main contributions of CTUN 1. **Implicit Cascaded Alignment Module (ICAM)**: Effectively explores the spatio - temporal correspondences among past, current and future features in an implicit way, making the model parameters more efficient and easier to train. 2. **Hidden Updater (HU)**: Utilizes future information to update hidden features, significantly reducing the memory consumption and inference time of recurrent - based VSR models while maintaining performance. 3. **End - to - end trained unidirectional propagation network**: Through experiments, CTUN achieves a good trade - off between performance and model complexity on multiple VSR benchmark datasets, especially with significant improvements in terms of the number of parameters and running time compared to BasicVSR. ### Experimental results The experimental results show that CTUN achieves performance comparable to or even better than existing methods on multiple benchmark datasets, especially having obvious advantages in terms of the number of parameters and inference time. For example, on the Vid4 dataset, CTUN improves the PSNR by 0.24 dB compared to BasicVSR, while the number of parameters is only about 30% of that of BasicVSR. ### Summary The core problem of the paper is to develop a VSR method that can be efficiently deployed on resource - constrained devices. By introducing innovative components such as ICAM and HU, CTUN significantly improves model efficiency while maintaining high - quality video reconstruction.