Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Yutong Wang,Jiajie Teng,Jiajiong Cao,Yuming Li,Chenguang Ma,Hongteng Xu,Dixin Luo
2024-11-25
Abstract:As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{<a class="link-external link-https" href="https://github.com/Dixin-Lab/BFVR-STC" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in video face enhancement: 1. **Long processing time**: Existing video face enhancement methods often require a long time when processing videos, which limits their efficiency in practical applications. 2. **Poor spatio - temporal consistency**: When existing methods process videos, due to the lack of effective spatio - temporal constraints, they are prone to cause inconsistent visual effects between frames, such as the flickering phenomenon. To overcome these problems, the paper proposes a new efficient blind video face enhancement method, aiming to restore high - quality videos from compressed low - quality versions and improve the spatio - temporal consistency of videos through an effective anti - flicker mechanism. Specifically, this method is based on the 3D - VQGAN backbone network and combines spatio - temporal codebooks to record high - quality portrait features and residual - based temporal information. In addition, the paper also proposes a two - stage learning framework to train the model: - **First stage**: Use high - quality videos to train spatio - temporal codebooks and high - quality auto - encoders. - **Second stage**: Use high - quality - low - quality video pairs, predict the spatio - temporal codebook indices of low - quality inputs through two lookup transformers, and further update the encoder of low - quality videos. Experimental results show that this method outperforms the current state - of - the - art blind face video restoration and anti - flicker methods on the VFHQ - Test dataset, and performs excellently in terms of both efficiency and effectiveness.