Learning Transferable Compound Expressions from Masked AutoEncoder Pretraining

Feng Qiu,Heming Du,Wei Zhang,Chen Liu,Lincheng Li,Tianchen Guo,Xin Yu
DOI: https://doi.org/10.1109/cvprw63382.2024.00476
2024-01-01
Computer Vision and Pattern Recognition
Abstract:Video-based Compound Expression Recognition (CER) aims to identify compound expressions in everyday interactions per frame. Unlike rapid progress in Facial Expression Recognition (FER) for the basic emotions (e.g., surprised, sad, and fearful), CER with the compound emotions (e.g., fearfully surprised, and sadly fearful) remains under-explored, with an evident gap in the availability of substantial datasets. In this paper, we design a framework to demonstrate the feasibility of predicting compound expressions in-the-wild without relying on domain-specific supervision. To be specific, we first train a model on a large-scale facial dataset using the Masked Autoencoder (MAE) approach to learn comprehensive facial features. Then, to tailor it for facial expression analysis, we fine-tune the ViT encoder on an Action Unit (AU) detection task. To address the issue of insufficient data, we transform the task of recognizing compound emotions into a multi-label recognition task for basic emotions. We train a network by finetuning the pretrained ViT encoder to predict the probability of each basic emotion, and then combine these probabilities to arrive at the final prediction for the compound emotions. Experiments conducted on the C-EXPR-DB dataset demonstrate the effectiveness of our framework in the frame-by-frame prediction of compound expressions in-the-wild. Our framework is recognized as the leading solution in the Compound Expression (CE) Recognition Challenge in the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). More information for the Competition can be found in: 6th ABAW.
What problem does this paper attempt to address?