Abstract:Emotion understanding is an essential but highly challenging component of artificial general intelligence. The absence of extensively annotated datasets has significantly impeded advancements in this field. We present EmotionCLIP, the first pre-training paradigm to extract visual emotion representations from verbal and nonverbal communication using only uncurated data. Compared to numerical labels or descriptions used in previous methods, communication naturally contains emotion information. Furthermore, acquiring emotion representations from communication is more congruent with the human learning process. We guide EmotionCLIP to attend to nonverbal emotion cues through subject-aware context encoding and verbal emotion cues using sentiment-guided contrastive learning. Extensive experiments validate the effectiveness and transferability of EmotionCLIP. Using merely linear-probe evaluation protocol, EmotionCLIP outperforms the state-of-the-art supervised visual emotion recognition methods and rivals many multimodal approaches across various benchmarks. We anticipate that the advent of EmotionCLIP will address the prevailing issue of data scarcity in emotion understanding, thereby fostering progress in related domains. The code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/Xeaver/EmotionCLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the issue of data scarcity in emotion understanding, which is a key challenge in Affective Emotion Intelligence (AEI) research. Specifically, existing emotion understanding datasets mostly rely on crowdsourced annotations, leading to subjectivity and inconsistency in annotations, making it difficult to collect accurate emotion annotations on a large scale. Additionally, existing methods often require training models from scratch or utilizing models from other less relevant domains, which limits the development of emotion understanding. To solve these problems, the authors propose **EmotionCLIP**, a new pre-training paradigm that can extract visual emotion representations from everyday communication without the need for extensive manual annotations. By leveraging unfiltered data, EmotionCLIP is able to capture emotional information more naturally and is more consistent with the human learning process. This approach not only avoids the issue of data collection but also retains fine-grained emotional semantics and directly models expressed emotions rather than perceived emotions. ### Main Contributions 1. **Introduction of EmotionCLIP**: This is the first vision-language pre-training paradigm for visual emotion understanding using unfiltered data. 2. **Proposing Two Techniques**: Guiding the model to capture significant emotional expressions from human verbal and non-verbal communication. 3. **Extensive Experiments and Analysis**: Validating the superiority and transferability of the method across various downstream datasets. ### Method Overview - **Data Collection**: The authors collected a large-scale video-text paired dataset, including 3,613 TV shows and their corresponding subtitles obtained from YouTube. - **Model Architecture**: EmotionCLIP includes two branches for encoding visual input (non-verbal expressions) and text input (verbal expressions). The visual branch uses a subject-aware frame encoder and a temporal encoder, while the text branch uses a text encoder and an emotion analysis model. - **Training Objective**: Learning the correspondence between visual input and text input by minimizing the emotion-guided contrastive loss (SNCE). ### Experimental Results - **Ablation Study**: Validating the effectiveness of each component by adding different subject-aware context encoding strategies (such as Subject-Aware Attention Mask SAAM and Subject-Aware Prompt SAP) and the emotion-guided contrastive learning framework (SNCE). - **Performance Comparison**: EmotionCLIP performs excellently on multiple benchmark datasets, surpassing existing supervised visual emotion recognition methods and being comparable to many multimodal methods. ### Conclusion EmotionCLIP effectively addresses the data scarcity issue in emotion understanding by learning emotion representations from everyday communication, providing new ideas and tools for further development in the related field.

Learning Emotion Representations from Verbal and Nonverbal Communication

Context-aware Emotion Recognition Based on Vision-Language Pre-trained Model

EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition

UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Learning to Compose Diversified Prompts for Image Emotion Classification

CLIPER: A Unified Vision-Language Framework for In-the-Wild Facial Expression Recognition

GEmo-CLAP: Gender-Attribute-Enhanced Contrastive Language-Audio Pretraining for Accurate Speech Emotion Recognition

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Cluster-Level Contrastive Learning for Emotion Recognition in Conversations

Robust Light-Weight Facial Affective Behavior Recognition with CLIP

Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations

Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout

On the use of Vision-Language models for Visual Sentiment Analysis: a study on CLIP

Clip-aware expressive feature learning for video-based facial expression recognition

Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

EMO-LLaMA: Enhancing Facial Emotion Understanding with Instruction Tuning

Emo-DNA: Emotion Decoupling and Alignment Learning for Cross-Corpus Speech Emotion Recognition

Emotion recognition of EEG signals based on contrastive learning graph convolutional model

VEMOCLAP: A video emotion classification web application

ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment