Learning Emotion Representations from Verbal and Nonverbal Communication

Sitao Zhang,Yimu Pan,James Z. Wang
2023-05-23
Abstract:Emotion understanding is an essential but highly challenging component of artificial general intelligence. The absence of extensively annotated datasets has significantly impeded advancements in this field. We present EmotionCLIP, the first pre-training paradigm to extract visual emotion representations from verbal and nonverbal communication using only uncurated data. Compared to numerical labels or descriptions used in previous methods, communication naturally contains emotion information. Furthermore, acquiring emotion representations from communication is more congruent with the human learning process. We guide EmotionCLIP to attend to nonverbal emotion cues through subject-aware context encoding and verbal emotion cues using sentiment-guided contrastive learning. Extensive experiments validate the effectiveness and transferability of EmotionCLIP. Using merely linear-probe evaluation protocol, EmotionCLIP outperforms the state-of-the-art supervised visual emotion recognition methods and rivals many multimodal approaches across various benchmarks. We anticipate that the advent of EmotionCLIP will address the prevailing issue of data scarcity in emotion understanding, thereby fostering progress in related domains. The code and pre-trained models are available at <a class="link-external link-https" href="https://github.com/Xeaver/EmotionCLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the issue of data scarcity in emotion understanding, which is a key challenge in Affective Emotion Intelligence (AEI) research. Specifically, existing emotion understanding datasets mostly rely on crowdsourced annotations, leading to subjectivity and inconsistency in annotations, making it difficult to collect accurate emotion annotations on a large scale. Additionally, existing methods often require training models from scratch or utilizing models from other less relevant domains, which limits the development of emotion understanding. To solve these problems, the authors propose **EmotionCLIP**, a new pre-training paradigm that can extract visual emotion representations from everyday communication without the need for extensive manual annotations. By leveraging unfiltered data, EmotionCLIP is able to capture emotional information more naturally and is more consistent with the human learning process. This approach not only avoids the issue of data collection but also retains fine-grained emotional semantics and directly models expressed emotions rather than perceived emotions. ### Main Contributions 1. **Introduction of EmotionCLIP**: This is the first vision-language pre-training paradigm for visual emotion understanding using unfiltered data. 2. **Proposing Two Techniques**: Guiding the model to capture significant emotional expressions from human verbal and non-verbal communication. 3. **Extensive Experiments and Analysis**: Validating the superiority and transferability of the method across various downstream datasets. ### Method Overview - **Data Collection**: The authors collected a large-scale video-text paired dataset, including 3,613 TV shows and their corresponding subtitles obtained from YouTube. - **Model Architecture**: EmotionCLIP includes two branches for encoding visual input (non-verbal expressions) and text input (verbal expressions). The visual branch uses a subject-aware frame encoder and a temporal encoder, while the text branch uses a text encoder and an emotion analysis model. - **Training Objective**: Learning the correspondence between visual input and text input by minimizing the emotion-guided contrastive loss (SNCE). ### Experimental Results - **Ablation Study**: Validating the effectiveness of each component by adding different subject-aware context encoding strategies (such as Subject-Aware Attention Mask SAAM and Subject-Aware Prompt SAP) and the emotion-guided contrastive learning framework (SNCE). - **Performance Comparison**: EmotionCLIP performs excellently on multiple benchmark datasets, surpassing existing supervised visual emotion recognition methods and being comparable to many multimodal methods. ### Conclusion EmotionCLIP effectively addresses the data scarcity issue in emotion understanding by learning emotion representations from everyday communication, providing new ideas and tools for further development in the related field.