Abstract:Learning discriminative and robust representations is important for facial expression recognition (FER) due to subtly different emotional faces and their subjective annotations. Previous works usually address one representation solely because these two goals seem to be contradictory for optimization. Their performances inevitably suffer from challenges from the other representation. In this article, by considering this problem from two novel perspectives, we demonstrate that discriminative and robust representations can be learned in a unified approach, i.e., DR-FER, and mutually benefit each other. Moreover, we make it with the supervision from only original annotations. Specifically, to learn discriminative representations, we propose performing masked image modeling (MIM) as an auxiliary task to force our network to discover expression-related facial areas. This is the first attempt to employ MIM to explore discriminative patterns in a self-supervised manner. To extract robust representations, we present a category-aware self-paced learning schedule to mine high-quality annotated (easy) expressions and incorrectly annotated (hard) counterparts. We further introduce a retrieval similarity-based relabeling strategy to correct hard expression annotations, exploiting them more effectively. By enhancing the discrimination ability of the FER classifier as a bridge, these two learning goals significantly strengthen each other. Extensive experiments on several popular benchmarks demonstrate the superior performance of our DR-FER. Moreover, thorough visualizations and extra experiments on manually annotation-corrupted datasets show that our approach successfully accomplishes learning both discriminative and robust representations simultaneously.

Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

Landmarks-assisted Collaborative Deep Framework for Automatic 4D Facial Expression Recognition.

NR-DFERNet: Noise-Robust Network for Dynamic Facial Expression Recognition

A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP

Automatic 4D Facial Expression Recognition via Collaborative Cross-domain Dynamic Image Network.

A^3lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP

Joint Structured Sparsity Regularized Multiview Dimension Reduction for Video-Based Facial Expression Recognition.

Combining 2D Gabor and Local Binary Pattern for Facial Expression Recognition Using Extreme Learning Machine

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs

MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild

MAE-DFER: Efficient Masked Autoencoder for Self-supervised Dynamic Facial Expression Recognition

2D+3D Facial Expression Recognition via Discriminative Dynamic Range Enhancement and Multi-Scale Learning

From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

Towards Reading Beyond Faces for Sparsity-aware 3D/4D Affect Recognition

Multi-Attention Module for Dynamic Facial Emotion Recognition

Automatic 4D Facial Expression Recognition Using Dynamic Geometrical Image Network

Multi-View Exclusive Unsupervised Dimension Reduction for Video-Based Facial Expression Recognition

Enhanced Dual-Level Representations for Facial Expression Recognition

Intensity-Aware Loss for Dynamic Facial Expression Recognition in the Wild

Fine-Grained Temporal-Enhanced Transformer for Dynamic Facial Expression Recognition