Abstract:Facial expression recognition (FER) holds significant practical implications in real-world scenarios such as human–computer interaction, fatigue driving detection, and learning engagement analysis. Nonetheless, acquiring large-scale and high-quality annotated facial expression datasets is profoundly challenging due to the inherent ambiguity of facial images and concerns over privacy. Consequently, this paper introduces a self-supervised facial expression recognition method based on mask image modeling. This method can learn multi-level facial feature representations without expensive labels and achieves commendable facial expression recognition performance through further fine-grained feature selection. Specifically, we propose the multi-level feature selector (MFS). The MFS comprises two pivotal components: the multi-level feature combiner and the feature selector. During the pre-training stage, the multi-level feature combiner is employed to integrate multi-level features, effectively addressing the vision transformer's deficiencies in capturing high-frequency facial semantics. Subsequently, in the fine-tuning stage, the feature selector can automatically differentiate highly discriminative regions, extracting fine-grained features. Subsequently, we use graph convolutional networks to further mine the latent connections among fine-grained features, ultimately deriving an integrated feature with enhanced discriminative capabilities. Through such fine-grained facial feature selection, we can mitigate performance degradation induced by inter-class similarities and intra-class variations. Experimental results on the RAF-DB, AffectNet, and FER + datasets demonstrate that our approach significantly outperforms other self-supervised methods in recognition performance and closely approaches the state-of-the-art methods in supervised learning. The code is available at https://github.com/Greysahy/MFS

PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation

Toward High Quality Facial Representation Learning

Unified Video and Image Representation for Boosted Video Face Forgery Detection

ProS: Facial Omni-Representation Learning via Prototype-based Self-Distillation

Landmarks-assisted Collaborative Deep Framework for Automatic 4D Facial Expression Recognition.

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

SVFAP: Self-supervised Video Facial Affect Perceiver

Facial Action Unit Detection and Intensity Estimation from Self-supervised Representation

Audio-Visual Contrastive Pre-train for Face Forgery Detection

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Omni-supervised Facial Expression Recognition via Distilled Data

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

Self-supervised facial expression recognition with fine-grained feature selection

DFCP: Few-Shot DeepFake Detection via Contrastive Pretraining

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation

A Jointly Learned Deep Architecture for Facial Attribute Analysis and Face Detection in the Wild

FaceChain-FACT: Face Adapter with Decoupled Training for Identity-preserved Personalization

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs