Abstract:Context-aware emotion recognition (CAER) leverages comprehensive scene information, including facial expressions, body postures, and contextual background. However, current studies predominantly rely on facial expressions, body postures, and global contextual features; the interaction between the agents (target individuals) and other objects in the scene is usually absent or incomplete. In this article, a three-dimensional view relationship-based CAER (TDRCer) method is proposed, which comprises two branches: the personal emotional branch (PEB) and the contextual emotional branch (CEB). First, PEB is designed for the extraction of facial expression features and body posture features from the agent. A vision transformer (ViT), pretrained by contrastive learning with a novel loss function combining Euclidean distance and cosine similarity, is applied to enhance the robustness of facial expression features. Meanwhile, the human body contour images extracted by semantic segmentation are fed into another ViT to extract body posture features. Second, CEB is constructed for the extraction of global contextual features and interactive relationships among objects in the scene. The images masked by the agents' bodies are fed into a ViT to extract global contextual features. By leveraging both the gaze angle and depth map, a three-dimensional view graph (3DVG) is constructed to represent the interactive relationships between agents and objects in the scene. Then, a graph convolutional network is employed to extract interactive relationship features from the 3DVG. Finally, the multiplicative fusion strategy is applied to fuse the features of two branches, and the fused features are utilized to classify the emotions. TDRCer achieves an accuracy of 89.90% on the CAER-S dataset and a mean average precision (mAP) of 36.02% on the EMOTIons in context (EMOTIC) dataset. The code can be accessed at https://github.com/mengTender/TDRCer.

Context-aware Emotion Recognition Based on Vision-Language Pre-trained Model

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

Large Vision-Language Models as Emotion Recognizers in Context Awareness

Learning Emotion Representations from Verbal and Nonverbal Communication

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

Prompting Visual-Language Models for Dynamic Facial Expression Recognition

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

Context De-confounded Emotion Recognition

Contextual Emotion Recognition using Large Vision Language Models

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Robust Light-Weight Facial Affective Behavior Recognition with CLIP

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

Clip-aware expressive feature learning for video-based facial expression recognition

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

Cluster-Level Contrastive Learning for Emotion Recognition in Conversations

Three-Dimensional View Relationship-Based Context-Aware Emotion Recognition

Knowledge-Enhanced Facial Expression Recognition with Emotional-to-Neutral Transformation

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations