Context-aware Emotion Recognition Based on Vision-Language Pre-trained Model

XingLin Li,Xinde Li,Chuanfei Hu,Huaping Liu
DOI: https://doi.org/10.1109/icarm62033.2024.10715789
2024-01-01
Abstract:Given the difficulty of recognizing ambiguous emotions in facial expression recognition tasks, we propose a visual-language model named CAER-CLIP to address this challenge. The proposed CAER-CLIP standed for Context-Aware Emotion Recognition (CAER), and were incorporated structure of the Contrastive Language–Image Pre-training (CLIP) model as promising alternative to classifier. There are two parts in CAER-CLIP model. In the visual part, facial expressions and contextual information of the image are simultaneously extracted to obtain the final feature embeddings, which are then used as a learnable “class” token for text-image pairing with desired module. In the textual part, we use text labels for emotion recognition classes as input. The outputs were merged to participate the comparative study to generated parameters of the model. The experiments demonstrate the effectiveness of the proposed method and show that our CAER-CLIP outperforms the state-of-the-art results on the CAER benchmark. The ablation experiment verified the effectiveness of both the classifier-based and text-based (ours without classifier) models, demonstrating that our method with the CAER-CLIP structure performed better, and the incorporation of a text encoder in the deep network model architecture effectively enhancing recognition accuracy.
What problem does this paper attempt to address?