Abstract:Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing facial expression recognition (FER) methods cannot fully express the emotional concepts of different facial expressions when supervised with discrete labels. Specifically, most existing FER methods learn facial expression representations by fine - tuning pre - trained visual encoders and use discrete labels to train classifiers to map these representations to confidence scores. However, this method ignores the differences in emotional concepts between different facial expressions, such as the difference between fear and anger. To solve this problem, the author proposes a knowledge - enhanced facial expression recognition method, which uses text embeddings generated from visual - language models (VLM) as external knowledge to guide the learning of facial expression representations. Specifically, the author proposes the following innovations: 1. **Introducing text embeddings as supervision signals**: By matching the similarity between facial expression representations and text embeddings, instead of relying on discrete labels, the emotional concepts of different facial expressions can be better captured. 2. **Emotion - to - neutral conversion**: Inspired by Russell's circumplex model of emotion, the author designs an emotion - to - neutral conversion method, which converts facial expression representations into neutral representations by simulating the differences in text embeddings. 3. **Self - contrastive objective**: To further enhance the discriminative ability of facial expression representations, the author introduces a self - contrastive objective, which makes facial expression representations closer to the corresponding text embeddings and farther from the neutral representations. Through these improvements, the author hopes to significantly improve the performance of the model in the facial expression recognition task, especially when dealing with complex and challenging datasets. ### Formula summary - Text embedding generation formula: \[ T = \{t_c = F_t(\text{Prompt}_c), c = 1, 2, \ldots, C\} \] where \(F_t\) is the frozen VLM text encoder and \(\text{Prompt}_c\) is the prompt template for the \(c\)-th category. - Similarity calculation formula: \[ \text{sim}(t_c, v_i)=\frac{t_c\cdot v_i}{\|t_c\|\|v_i\|} \] - Cross - entropy loss formula: \[ L_s = -\frac{1}{N}\sum_{i = 1}^{N}\log\left(\frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_{c = 1}^{C}\exp(\text{sim}(t_c, v_i)/\tau)}\right) \] - Conversion loss formula: \[ L_t=\frac{1}{N_e}\sum_{i = 1}^{N_e}\left(1-\frac{\Delta t\cdot\Delta v}{\|\Delta t\|\|\Delta v\|}\right) \] where \(\Delta v = v_i - n_i\) and \(\Delta t = t_i - t_n\). - Self - contrastive loss formula: \[ L_c=\frac{1}{N_e}\sum_{i = 1}^{N_e}(\text{sim}(t_i, n_i)-\text{sim}(t_i, v_i)+\gamma) \] - Overall loss function: \[ L_{\text{total}}=\lambda_sL_s+\lambda_tL_t+\lambda_cL_c \] Through these formulas and methods, the author has successfully improved the performance of facial expression recognition and achieved significant results on multiple datasets.

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

Cgan Based Facial Expression Recognition for Human-Robot Interaction

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

Efficient Facial Expression Recognition with Representation Reinforcement Network and Transfer Self-Training for Human–Machine Interaction

Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition

Combining 2D Gabor and Local Binary Pattern for Facial Expression Recognition Using Extreme Learning Machine

Generative Neutral Features-Disentangled Learning for Facial Expression Recognition

Facial Expression Recognition with Contrastive Learning and Uncertainty-Guided Relabeling

Facial Expression Recognition Using Hybrid Features of Pixel and Geometry

The Devil is in the Face: Exploiting Harmonious Representations for Facial Expression Recognition

Facial Expression Recognition by Expression-Specific Representation Swapping

Multi-Attention Module for Dynamic Facial Emotion Recognition

Fine-Grained Facial Expression Recognition in Multiple Smiles

Efficient Net-XGBoost: An Implementation for Facial Emotion Recognition Using Transfer Learning

Semantic-Rich Facial Emotional Expression Recognition

Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Real Emotion Seeker: Recalibrating Annotation for Facial Expression Recognition

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Exploring Facial Expression Recognition through Semi-Supervised Pretraining and Temporal Modeling