SRL: Separation-and-Recombination Learning for Video Facial Landmark Detection with Limited Data

Wenyan Wu,Yici Cai,Qiang Zhou
DOI: https://doi.org/10.1109/fg52635.2021.9667064
2021-01-01
Abstract:Recent video facial landmark detection methods heavily rely on the supervised learning with large amount of annotated data. Nevertheless, the annotation of data on the video is very labor-intensive and time-consuming. Also, the supervised learning with massive parameters is easy to make the network suffer from overfitting and generalization-losing. In this work, we propose the Separation-and-Recombination Learning (SRL) framework to tackle this problem, in which the crucial idea is to adequately mine the inherent information of the limited labeled data in a semi-supervised manner. Specifically, we split the SRL framework into two stages, a separation stage and a recombination stage. Firstly, in the separation stage, we propose to train an Auto-Encoder network, disentangling-net, taking multi-frame temporal cues as input and with reconstruction and KL-divergence loss as constraints. In this stage, we successfully disentangle the face into two weak-coupling latent spaces, i.e., structure and appearance space. Then, in the recombination stage, with the trained disentangling-net, the limited labeled data can be greatly expended as pseudo paired data, with the recombination of structure and appearance code. Finally, we train a replaceable landmark detection network, predicting-net, with the supervision of both labeled and pseudo-labeled data. In the experiment, we demonstrate state-of-the-art performance on several well-known benchmarks, i.e., 300VW [56], blurred-300VW [60] and RWMB [60] dataset. Most importantly, our method is able to maintain impressive accuracy on extremely small training sets down to as few as 50% samples.
What problem does this paper attempt to address?