Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

Yuanyuan Liu,Yuxuan Huang,Shuyang Liu,Yibing Zhan,Zijing Chen,Zhe Chen
2024-08-01
Abstract:In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at <a class="link-external link-https" href="https://github.com/cosinehuang/HESP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily addresses the issue of video facial expression recognition (V-FER) in open-set environments. Specifically: 1. **Open-Set Video Facial Expression Recognition (OV-FER) Task**: - Existing V-FER models are typically trained on closed datasets and can only recognize predefined expression categories. - These models perform poorly when encountering unknown expressions in real-world scenarios. - To address this, the authors introduce the OV-FER task, which aims to recognize both known and new unknown expressions. 2. **Proposed New Method**: - To overcome the limitations of existing methods in recognizing subtle expression changes, the paper proposes a novel Human Expression Sensitive Prompt (HESP) mechanism. - HESP consists of three parts: a text prompt module, a visual prompt module, and an open-set multi-task learning scheme. 3. **Objectives**: - Enhance the ability of the CLIP model to capture subtle expression changes in videos, thereby improving the accuracy of recognizing both known and unknown expressions. - Experimental results show that HESP significantly improves the performance of CLIP on the OV-FER task, with a relative AUC-ROC increase of 17.93% and an OSCR increase of 106.18%. Through these improvements, the paper aims to establish a more robust model capable of recognizing various facial expressions in diverse environments, applicable to fields such as intelligent healthcare and human-computer interaction.