Multimodal Transformer Learning for Continuous Emotion Recognition

Jian Huang,Jianhua Tao,Bin Liu,Zheng Lian,Mingyue Niu

2020-01-01

Abstract:National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

What problem does this paper attempt to address?

Multimodal Transformer Fusion for Continuous Emotion Recognition

Jian Huang,Jianhua Tao,Bin Liu,Zheng Lian,Mingyue Niu

DOI: https://doi.org/10.1109/icassp40776.2020.9053762

2020-01-01

Abstract:Multimodal fusion increases the performance of emotion recognition because of the complementarity of different modalities. Compared with decision level and feature level fusion, model level fusion makes better use of the advantages of deep neural networks. In this work, we utilize the Transformer model to fuse audio-visual modalities on the model level. Specifically, the multi-head attention produces multimodal emotional intermediate representations from common semantic feature space after encoding audio and visual modalities. Meanwhile, it also can learn long-term temporal dependencies with self-attention mechanism effectively. The experiments, on the AVEC 2017 database, shows the superiority of model level fusion than other fusion strategies. Moreover, we combine the Transformer model and LSTM to further improve the performance, which achieves better results than other methods.
Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Xiaoqin Zhang,Min Li,Sheng Lin,Guobao Xiao,Hang Xu

DOI: https://doi.org/10.1109/TCSVT.2023.3312858

2024-05-01

Abstract:Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.

Computer Science
Multilevel Transformer For Multimodal Emotion Recognition

Junyi He,Meimei Wu,Meng Li,Xiaobo Zhu,Feng Ye

DOI: https://doi.org/10.48550/arXiv.2211.07711

2022-10-26

Computation and Language

Abstract:Multimodal emotion recognition has attracted much attention recently. Fusing multiple modalities effectively with limited labeled data is a challenging task. Considering the success of pre-trained model and fine-grained nature of emotion expression, it is reasonable to take these two aspects into consideration. Unlike previous methods that mainly focus on one aspect, we introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation. Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition. Specifically, we explore different methods to incorporate phoneme-level embedding with word-level embedding. To perform multi-granularity learning, we simply combine multilevel transformer model with Albert. Extensive experimental results show that both our multilevel transformer model and multi-granularity model outperform previous state-of-the-art approaches on IEMOCAP dataset with text transcripts and speech signal.
Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition

James J. Deng,Clement H. C. Leung

DOI: https://doi.org/10.1007/978-3-030-86993-9_17

2021-01-01

Brain Informatics

Abstract:Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. This paper proposes a new method to learn a joint emotion representation for multimodal emotion recognition. Emotion-based feature for speech audio is learned by an unsupervised triplet-loss objective, and a text-to-text transformer network is used to extract text embedding for latent emotional meaning. Transfer learning provides a powerful and reusable technique to help fine-tune emotion recognition models trained on mega audio and text datasets respectively. The extracted emotional information from speech audio and text embedding are processed by dedicated transformer networks. The alternating co-attention mechanism is used to construct a deep transformer network. Multimodal fusion is implemented by a deep co-attention transformer network. Experimental results show the proposed method for learning a joint emotion representation achieves good performance in multimodal emotion recognition.
Residual multimodal Transformer for expression‐EEG fusion continuous emotion recognition

Xiaofang Jin,Jieyu Xiao,Libiao Jin,Xinruo Zhang

DOI: https://doi.org/10.1049/cit2.12346

IF: 7.985

2024-05-10

CAAI Transactions on Intelligence Technology

Abstract:Continuous emotion recognition is to predict emotion states through affective information and more focus on the continuous variation of emotion. Fusion of electroencephalography (EEG) and facial expressions videos has been used in this field, while there are with some limitations in current researches, such as hand‐engineered features, simple approaches to integration. Hence, a new continuous emotion recognition model is proposed based on the fusion of EEG and facial expressions videos named residual multimodal Transformer (RMMT). Firstly, the Resnet50 and temporal convolutional network (TCN) are utilised to extract spatiotemporal features from videos, and the TCN is also applied to process the computed EEG frequency power to acquire spatiotemporal features of EEG. Then, a multimodal Transformer is used to fuse the spatiotemporal features from the two modalities. Furthermore, a residual connection is introduced to fuse shallow features with deep features which is verified to be effective for continuous emotion recognition through experiments. Inspired by knowledge distillation, the authors incorporate feature‐level loss into the loss function to further enhance the network performance. Experimental results show that the RMMT reaches a superior performance over other methods for the MAHNOB‐HCI dataset. Ablation studies on the residual connection and loss function in the RMMT demonstrate that both of them is functional.
Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Chengxin Chen,Pengyuan Zhang

DOI: https://doi.org/10.1145/3640343

2024-02-07

Abstract:As a vital aspect of affective computing, Multimodal Emotion Recognition has been an active research area in the multimedia community. Despite recent progress, this field still confronts two major challenges in real-world applications: (1) improving the efficiency of constructing joint representations from unaligned multimodal features and (2) relieving the performance decline caused by random modality feature missing. In this article, we propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR), to address these issues. The crucial component of MCT is a novel attention-based encoder that concurrently extracts and dynamically balances the intra- and inter-modality relations for all associated modalities. With additional modality-wise parameter sharing, a more compact representation can be encoded with less time and space complexity. To improve the robustness of MCT, we further introduce HFR, which consists of two modules: Local Feature Imagination (LFI) and Global Feature Alignment (GFA). During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise-complete and -incomplete representations. Experimental evaluations on two popular benchmark datasets demonstrate that our proposed method consistently outperforms advanced baselines in both complete and incomplete data scenarios.

computer science, information systems, theory & methods, software engineering
Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition

Juan Vazquez-Rodriguez,Grégoire Lefebvre,Julien Cumin,James L. Crowley

2023-11-16

Abstract:Decades of research indicate that emotion recognition is more effective when drawing information from multiple modalities. But what if some modalities are sometimes missing? To address this problem, we propose a novel Transformer-based architecture for recognizing valence and arousal in a time-continuous manner even with missing input modalities. We use a coupling of cross-attention and self-attention mechanisms to emphasize relationships between modalities during time and enhance the learning process on weak salient inputs. Experimental results on the Ulm-TSST dataset show that our model exhibits an improvement of the concordance correlation coefficient evaluation of 37% when predicting arousal values and 30% when predicting valence values, compared to a late-fusion baseline approach.

Machine Learning,Artificial Intelligence
A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Congbao Shi,Yuanyuan Zhang,Baolin Liu

DOI: https://doi.org/10.1007/s10489-024-05329-w

IF: 5.3

2024-02-22

Applied Intelligence

Abstract:Continuous emotion recognition plays a crucial role in developing friendly and natural human-computer interaction applications. However, there exist two significant challenges unresolved in this field: how to effectively fuse complementary information from multiple modalities and capture long-range contextual dependencies during emotional evolution. In this paper, a novel multimodal continuous emotion recognition framework was proposed to address the above challenges. For the multimodal fusion challenge, the Multimodal Attention Fusion (MAF) method is proposed to fully utilize complementarity and redundancy between multiple modalities. To tackle temporal context dependencies, the Local Contextual Temporal Convolutional Network (LC-TCN) and the Global Contextual Temporal Convolutional Network (GC-TCN) were presented. These networks have the ability to progressively integrate multi-scale temporal contextual information from input streams of different modalities. Comprehensive experiments are conducted on the RECOLA and SEWA datasets to assess the effectiveness of our proposed framework. The experimental results demonstrate superior recognition performance compared to state-of-the-art approaches, achieving 0.834 and 0.671 on RECOLA, 0.573 and 0.533 on SEWA in terms of arousal and valence, respectively. These findings indicate a novel direction for continuous emotion recognition by exploring temporal multi-scale information.

computer science, artificial intelligence
Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals

Juan Vazquez-Rodriguez,Grégoire Lefebvre,Julien Cumin,James L Crowley

DOI: https://doi.org/10.48550/arXiv.2212.13885

2022-12-22

Abstract:In this paper, we address the problem of multimodal emotion recognition from multiple physiological signals. We demonstrate that a Transformer-based approach is suitable for this task. In addition, we present how such models may be pretrained in a multimodal scenario to improve emotion recognition performances. We evaluate the benefits of using multimodal inputs and pre-training with our approach on a state-ofthe-art dataset.

Signal Processing,Artificial Intelligence,Machine Learning
TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

Jingru Cui,Zunying Qin,Xingyuan Chen,Guodong Li

DOI: https://doi.org/10.1109/ICHCI58871.2023.10277906

2023-08-04

Abstract:According to the problems of the existing emotion recognition algorithms, which are not rich in emotion information, weak in feature representation and not high in recognition accuracy, this paper proposes a multimodal fusion emotion recognition algorithm based on Transformer (TMFER), which fuses three modalities of text, speech and image information for emotion recognition. For the different characteristics of each modal information, Bert model pre-training processing, MFCC feature extraction and CNN feature extractor extraction methods are used to extract features for each modality respectively, to explore deeper features. To address the problem of unreasonable combination of multi-modal features, the Transformer Encode multi-headed attention mechanism is used to build a feature fusion module to extract and combine potential feature information in different modalities in parallel. The fused data are fed into the algorithm classification module for sentiment recognition classification, and a joint supervised loss function based on large margin learning is customized to solve the problem of unbalanced classification and feature confounding in the baseline model. Finally, based on the IEMOCAP and MELD multimodal datasets, the TMFER algorithm is experimentally compared with current algorithms in the field that are more effective in emotion recognition classification. The experimental results show that the TMFER algorithm outperforms other algorithms in all evaluation metrics.

Computer Science
CTNet: Conversational Transformer Network for Emotion Recognition

Zheng Lian,Bin Liu,Jianhua Tao

DOI: https://doi.org/10.1109/TASLP.2021.3049898

2021-01-01

Abstract:Emotion recognition in conversation is a crucial topic for its widespread applications in the field of human-computer interactions. Unlike vanilla emotion recognition of individual utterances, conversational emotion recognition requires modeling both context-sensitive and speaker-sensitive dependencies. Despite the promising results of recent works, they generally do not leverage advanced fusion techniques to generate the multimodal representations of an utterance. In this way, they have limitations in modeling the intra-modal and cross-modal interactions. In order to address these problems, we propose a multimodal learning framework for conversational emotion recognition, called conversational transformer network (CTNet). Specifically, we propose to use the transformer-based structure to model intra-modal and cross-modal interactions among multimodal features. Meanwhile, we utilize word-level lexical features and segment-level acoustic features as the inputs, thus enabling us to capture temporal information in the utterance. Additionally, to model context-sensitive and speaker-sensitive dependencies, we propose to use the multi-head attention based bi-directional GRU component and speaker embeddings. Experimental results on the IEMOCAP and MELD datasets demonstrate the effectiveness of the proposed method. Our method shows an absolute 2.1 similar to 6.2% performance improvement on weighted average F1 over state-of-the-art strategies.
Multimodal Neurophysiological Transformer for Emotion Recognition

Sharath Koorathota,Zain Khan,Pawan Lapborisuth,Paul Sajda

DOI: https://doi.org/10.1109/EMBC48229.2022.9871421

Abstract:Understanding neural function often requires multiple modalities of data, including electrophysiogical data, imaging techniques, and demographic surveys. In this paper, we introduce a novel neurophysiological model to tackle major challenges in modeling multimodal data. First, we avoid non-alignment issues between raw signals and extracted, frequency-domain features by addressing the issue of variable sampling rates. Second, we encode modalities through "cross-attention" with other modalities. Lastly, we utilize properties of our parent transformer architecture to model long-range dependencies between segments across modalities and assess intermediary weights to better understand how source signals affect prediction. We apply our Multimodal Neurophysiological Transformer (MNT) to predict valence and arousal in an existing open-source dataset. Experiments on non-aligned multimodal time-series show that our model performs similarly and, in some cases, outperforms existing methods in classification tasks. In addition, qualitative analysis suggests that MNT is able to model neural influences on autonomic activity in predicting arousal. Our architecture has the potential to be fine-tuned to a variety of downstream tasks, including for BCI systems.
A Unified Transformer-based Network for multimodal Emotion Recognition

Kamran Ali,Charles E. Hughes

2023-08-28

Abstract:The development of transformer-based models has resulted in significant advances in addressing various vision and NLP-based research challenges. However, the progress made in transformer-based methods has not been effectively applied to biosensing research. This paper presents a novel Unified Biosensor-Vision Multi-modal Transformer-based (UBVMT) method to classify emotions in an arousal-valence space by combining a 2D representation of an ECG/PPG signal with the face information. To achieve this goal, we first investigate and compare the unimodal emotion recognition performance of three image-based representations of the ECG/PPG signal. We then present our UBVMT network which is trained to perform emotion recognition by combining the 2D image-based representation of the ECG/PPG signal and the facial expression features. Our unified transformer model consists of homogeneous transformer blocks that take as an input the 2D representation of the ECG/PPG signal and the corresponding face frame for emotion representation learning with minimal modality-specific design. Our UBVMT model is trained by reconstructing masked patches of video frames and 2D images of ECG/PPG signals, and contrastive modeling to align face and ECG/PPG data. Extensive experiments on the MAHNOB-HCI and DEAP datasets show that our Unified UBVMT-based model produces comparable results to the state-of-the-art techniques.

Computer Vision and Pattern Recognition,Artificial Intelligence
Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

Rutherford Agbeshi Patamia,Edwin Kwadwo Tenagyei,Kingsley Nketia Acheampong,K. Sarpong,Wu Jin

DOI: https://doi.org/10.1109/PRML52754.2021.9520692

2021-07-16

Abstract:With the procession of technology, the human-machine interaction research field is in growing need of robust automatic emotion recognition systems. Building machines that interact with humans by comprehending emotions paves the way for developing systems equipped with human-like intelligence. Previous architecture in this field often considers RNN models. However, these models are unable to learn in-depth contextual features intuitively. This paper proposes a transformer-based model that utilizes speech data instituted by previous works, alongside text and mocap data, to optimize our emotional recognition system’s performance. Our experimental result shows that the proposed model outperforms the previous state-of-the-art. The IEMOCAP dataset supported the entire experiment.

Computer Science
TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Yuezhu Xu,Zheng Zhao,Yuhua Wang,Guang shen,Jiayuan Zhang

DOI: https://doi.org/10.1109/TASLP.2023.3316458

IEEE/ACM Transactions on Audio Speech and Language Processing

Abstract:As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this article, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. Furthermore, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.

Computer Science
Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition

Soumya Dutta,Sriram Ganapathy

DOI: https://doi.org/10.1109/icassp43922.2022.9747723

2022-05-23

Abstract:In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMOCAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.
Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition

Haifeng Chen,Dongmei Jiang,Hichem Sahli

DOI: https://doi.org/10.1109/tmm.2020.3037496

IF: 7.3

2021-01-01

IEEE Transactions on Multimedia

Abstract:Continuous affect recognition is becoming an increasingly attractive research topic in affective computing. Previous works mainly focused on modelling the temporal dependency within a sensor modality, or adopting early or late fusion for multi-modal affective state recognition. However, early fusion suffers from the curse of dimensionality, and late fusion ignores the complementarity and redundancy between multiple modal streams. In this paper, we first introduce the transformer-encoder with a self-attention mechanism and propose a Convolutional Neural Network-Transformer Encoder (CNN-TE) framework to model the temporal dependency for single modal affect recognition. Further, to effectively consider the complementarity and redundancy between multiple streams we propose a Transformer Encoder with Multi-modal Multi-head Attention (TEMMA) for multi-modal affect recognition. TEMMA allows to progressively and simultaneously refine the inter-modality interactions and intra-modality temporal dependency. The learned multi-modal representations are fed to an Inference Sub-network with fully connected layers to estimate the affective state. The proposed framework is trained in a nutshell and demonstrates its effectiveness on the AVEC2016 and AVEC2019 datasets. Compared to state-of-the-art models, our approach obtains remarkable improvements on both arousal and valence in terms of concordance correlation coefficient (CCC) reaching 0.583 for arousal and 0.564 for valence on the AVEC2019 test set.

computer science, information systems,telecommunications, software engineering
Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Minoo Shayaninasab, Bagher Babaali

2024-02-12

Abstract:Due to the complex nature of human emotions and the diversity of emotion representation methods in humans, emotion recognition is a challenging field. In this research, three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized. In each modality, a Transformer model is used with transfer learning to extract feature and emotional structure. These features are then fused together, and emotion recognition is performed using a classifier. To select an appropriate fusion method and classifier, various feature-level and decision-level fusion techniques have been experimented with, and ultimately, the best model, which combines feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP multimodal dataset, achieves an accuracy of 75.42%. Keywords: Multimodal Emotion Recognition, IEMOCAP, Self-Supervised Learning, Transfer Learning, Transformer.

Artificial Intelligence
Multimodal Transformer Fusion for Emotion Recognition: A Survey

R. Séguier,Amdjed Belaref

DOI: https://doi.org/10.1109/ICNLP60986.2024.10692953

2024-03-22

Abstract:Recently, Transformer-based models have gained popularity due to their ability to effectively model sequential data, handle long-term dependencies, and manage large amounts of data [1]. These models are at the forefront of advancements in many fields, notably in emotion recognition, the center of affective computing [2]. Transformers provide a powerful tool for the nuanced understanding of human emotions through the fusion of multiple modalities. This survey aims to explore and propose a classification scheme for the growing research field of multimodal emotion recognition using Transformer models. It presents an overview of the recent advancements in the application of Transformer-based architectures and their fusion techniques to analyze and interpret emotions from various modalities. The survey also covers different challenges faced in this domain and how they are tackled by the Transformers.

Computer Science
Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network

Jian Huang,Ya Li,Jianhua Tao,Zheng Lian,Zhengqi Wen,Minghao Yang,Jiangyan Yi

DOI: https://doi.org/10.1145/3133944.3133946

2017-01-01

Abstract:The continuous dimensional emotion can depict subtlety and complexity of emotional change, which is an inherently challenging problem with growing attention. This paper presents our automatic prediction of dimensional emotional state for Audio-Visual Emotion Challenge (AVEC 2017), which uses multi-features and fusion across all available modalities. Besides the baseline features provided by the organizers, we also extract other acoustic audio feature sets, appearance features and deep visual features as complementary features. Each type of feature is trained using Long Short-Term Memory Recurrent Neutral Network (LSTM-RNN) for every dimensional emotion prediction separately considering annotation delay and temporal pooling. To overcome overfitting problem, robust models are chosen carefully for individual model. Finally, multimodal emotion fusion is achieved by utilizing Support Vector Regression (SVR) with the estimates from different feature sets in decision level fusion. The experimental results indicate that our extracted features are beneficial to performance improvement and our system design achieves very promising results with Concordant Correlation Coefficient (CCC), which outperform the baseline system on the testing set for arousal of 0.599 vs 0.375 (baseline) and for valence of 0.721 vs 0.466 and for liking 0.295 vs 0.246.

Multimodal Transformer Learning for Continuous Emotion Recognition

Multimodal Transformer Fusion for Continuous Emotion Recognition

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

Multilevel Transformer For Multimodal Emotion Recognition

Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition

Residual multimodal Transformer for expression‐EEG fusion continuous emotion recognition

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition

A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos

Emotion Recognition with Pre-Trained Transformers Using Multimodal Signals

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

CTNet: Conversational Transformer Network for Emotion Recognition

Multimodal Neurophysiological Transformer for Emotion Recognition

A Unified Transformer-based Network for multimodal Emotion Recognition

Transformer Based Multimodal Speech Emotion Recognition with Improved Neural Networks

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Multimodal Transformer with Learnable Frontend and Self Attention for Emotion Recognition

Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Multimodal Transformer Fusion for Emotion Recognition: A Survey

Continuous Multimodal Emotion Prediction Based on Long Short Term Memory Recurrent Neural Network