Abstract:Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.

Topic and Style-aware Transformer for Multimodal Emotion Recognition

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention

Multimodal transformer augmented fusion for speech emotion recognition

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition

Multimodal Neurophysiological Transformer for Emotion Recognition

A Unified Transformer-based Network for multimodal Emotion Recognition

Multilevel Transformer For Multimodal Emotion Recognition

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Multimodal Emotion Recognition Based on Deep Temporal Features Using Cross-Modal Transformer and Self-Attention

SS-Trans (Single-Stream Transformer for Multimodal Sentiment Analysis and Emotion Recognition): The Emotion Whisperer—A Single-Stream Transformer for Multimodal Sentiment Analysis

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Multimodal Speech Emotion Recognition Using Modality-specific Self-Supervised Frameworks

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis