Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Yutong Wang,Jiajie Teng,Jiajiong Cao,Yuming Li,Chenguang Ma,Hongteng Xu,Dixin Luo

2024-11-25

Abstract:As a very common type of video, face videos often appear in movies, talk shows, live broadcasts, and other scenes. Real-world online videos are often plagued by degradations such as blurring and quantization noise, due to the high compression ratio caused by high communication costs and limited transmission bandwidth. These degradations have a particularly serious impact on face videos because the human visual system is highly sensitive to facial details. Despite the significant advancement in video face enhancement, current methods still suffer from $i)$ long processing time and $ii)$ inconsistent spatial-temporal visual effects (e.g., flickering). This study proposes a novel and efficient blind video face enhancement method to overcome the above two challenges, restoring high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism. In particular, the proposed method develops upon a 3D-VQGAN backbone associated with spatial-temporal codebooks recording high-quality portrait features and residual-based temporal information. We develop a two-stage learning framework for the model. In Stage \Rmnum{1}, we learn the model with a regularizer mitigating the codebook collapse problem. In Stage \Rmnum{2}, we learn two transformers to lookup code from the codebooks and further update the encoder of low-quality videos. Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness. Code is available at \url{<a class="link-external link-https" href="https://github.com/Dixin-Lab/BFVR-STC" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to solve two main problems in video face enhancement: 1. **Long processing time**: Existing video face enhancement methods often require a long time when processing videos, which limits their efficiency in practical applications. 2. **Poor spatio - temporal consistency**: When existing methods process videos, due to the lack of effective spatio - temporal constraints, they are prone to cause inconsistent visual effects between frames, such as the flickering phenomenon. To overcome these problems, the paper proposes a new efficient blind video face enhancement method, aiming to restore high - quality videos from compressed low - quality versions and improve the spatio - temporal consistency of videos through an effective anti - flicker mechanism. Specifically, this method is based on the 3D - VQGAN backbone network and combines spatio - temporal codebooks to record high - quality portrait features and residual - based temporal information. In addition, the paper also proposes a two - stage learning framework to train the model: - **First stage**: Use high - quality videos to train spatio - temporal codebooks and high - quality auto - encoders. - **Second stage**: Use high - quality - low - quality video pairs, predict the spatio - temporal codebook indices of low - quality inputs through two lookup transformers, and further update the encoder of low - quality videos. Experimental results show that this method outperforms the current state - of - the - art blind face video restoration and anti - flicker methods on the VFHQ - Test dataset, and performs excellently in terms of both efficiency and effectiveness.

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency

Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement

Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Efficiently Exploiting Spatially Variant Knowledge for Video Deblurring

Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer

Learning Degradation-Robust Spatiotemporal Frequency-Transformer for Video Super-Resolution

Learning an Occlusion-Aware Network for Video Deblurring

Unified Video and Image Representation for Boosted Video Face Forgery Detection

Perceptual Quality Assessment of Face Video Compression: A Benchmark and An Effective Method

Flow-Guided Sparse Transformer for Video Deblurring

PixRevive: Latent Feature Diffusion Model for Compressed Video Quality Enhancement

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

VEnhancer: Generative Space-Time Enhancement for Video Generation

Beyond GFVC: A Progressive Face Video Compression Framework with Adaptive Visual Tokens

Spatio-Temporal Filter Adaptive Network for Video Deblurring

An Efficient Network Design for Face Video Super-resolution

Efficient conditioned face animation using frontally-viewed embedding

Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition

Capturing Co-existing Distortions in User-Generated Content for No-reference Video Quality Assessment

Disentangle Propagation and Restoration for Efficient Video Recovery