Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

Hangyu Li,Yihan Xu,Jiangchao Yao,Nannan Wang,Xinbo Gao,Bo Han
2024-09-13
Abstract:Existing facial expression recognition (FER) methods typically fine-tune a pre-trained visual encoder using discrete labels. However, this form of supervision limits to specify the emotional concept of different facial expressions. In this paper, we observe that the rich knowledge in text embeddings, generated by vision-language models, is a promising alternative for learning discriminative facial expression representations. Inspired by this, we propose a novel knowledge-enhanced FER method with an emotional-to-neutral transformation. Specifically, we formulate the FER problem as a process to match the similarity between a facial expression representation and text embeddings. Then, we transform the facial expression representation to a neutral representation by simulating the difference in text embeddings from textual facial expression to textual neutral. Finally, a self-contrast objective is introduced to pull the facial expression representation closer to the textual facial expression, while pushing it farther from the neutral representation. We conduct evaluation with diverse pre-trained visual encoders including ResNet-18 and Swin-T on four challenging facial expression datasets. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art FER methods. The code will be publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing facial expression recognition (FER) methods cannot fully express the emotional concepts of different facial expressions when supervised with discrete labels. Specifically, most existing FER methods learn facial expression representations by fine - tuning pre - trained visual encoders and use discrete labels to train classifiers to map these representations to confidence scores. However, this method ignores the differences in emotional concepts between different facial expressions, such as the difference between fear and anger. To solve this problem, the author proposes a knowledge - enhanced facial expression recognition method, which uses text embeddings generated from visual - language models (VLM) as external knowledge to guide the learning of facial expression representations. Specifically, the author proposes the following innovations: 1. **Introducing text embeddings as supervision signals**: By matching the similarity between facial expression representations and text embeddings, instead of relying on discrete labels, the emotional concepts of different facial expressions can be better captured. 2. **Emotion - to - neutral conversion**: Inspired by Russell's circumplex model of emotion, the author designs an emotion - to - neutral conversion method, which converts facial expression representations into neutral representations by simulating the differences in text embeddings. 3. **Self - contrastive objective**: To further enhance the discriminative ability of facial expression representations, the author introduces a self - contrastive objective, which makes facial expression representations closer to the corresponding text embeddings and farther from the neutral representations. Through these improvements, the author hopes to significantly improve the performance of the model in the facial expression recognition task, especially when dealing with complex and challenging datasets. ### Formula summary - Text embedding generation formula: \[ T = \{t_c = F_t(\text{Prompt}_c), c = 1, 2, \ldots, C\} \] where \(F_t\) is the frozen VLM text encoder and \(\text{Prompt}_c\) is the prompt template for the \(c\)-th category. - Similarity calculation formula: \[ \text{sim}(t_c, v_i)=\frac{t_c\cdot v_i}{\|t_c\|\|v_i\|} \] - Cross - entropy loss formula: \[ L_s = -\frac{1}{N}\sum_{i = 1}^{N}\log\left(\frac{\exp(\text{sim}(t_i, v_i)/\tau)}{\sum_{c = 1}^{C}\exp(\text{sim}(t_c, v_i)/\tau)}\right) \] - Conversion loss formula: \[ L_t=\frac{1}{N_e}\sum_{i = 1}^{N_e}\left(1-\frac{\Delta t\cdot\Delta v}{\|\Delta t\|\|\Delta v\|}\right) \] where \(\Delta v = v_i - n_i\) and \(\Delta t = t_i - t_n\). - Self - contrastive loss formula: \[ L_c=\frac{1}{N_e}\sum_{i = 1}^{N_e}(\text{sim}(t_i, n_i)-\text{sim}(t_i, v_i)+\gamma) \] - Overall loss function: \[ L_{\text{total}}=\lambda_sL_s+\lambda_tL_t+\lambda_cL_c \] Through these formulas and methods, the author has successfully improved the performance of facial expression recognition and achieved significant results on multiple datasets.