A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

Cheng Cheng,Zhaoxin Fan,Lin Feng,Ziyu Jia

DOI: https://doi.org/10.1016/j.neunet.2024.106111

IF: 7.8

2024-01-08

Neural Networks

Abstract:Multi-modal signals have become essential data for emotion recognition since they can represent emotions more comprehensively. However, in real-world environments, it is often impossible to acquire complete data on multi-modal signals, and the problem of missing modalities causes severe performance degradation in emotion recognition. Therefore, this paper represents the first attempt to use a transformer-based architecture, aiming to fill the modality-incomplete data from partially observed data for multi-modal emotion recognition (MER). Concretely, this paper proposes a novel unified model called transformer autoencoder (TAE), comprising a modality-specific hybrid transformer encoder, an inter-modality transformer encoder, and a convolutional decoder. The modality-specific hybrid transformer encoder bridges a convolutional encoder and a transformer encoder, allowing the encoder to learn local and global context information within each particular modality. The inter-modality transformer encoder builds and aligns global cross-modal correlations and models long-range contextual information with different modalities. The convolutional decoder decodes the encoding features to produce more precise recognition. Besides, a regularization term is introduced into the convolutional decoder to force the decoder to fully leverage the complete and incomplete data for emotional recognition of missing data. 96.33%, 95.64%, and 92.69% accuracies are attained on the available data of the DEAP and SEED-IV datasets, and 93.25%, 92.23%, and 81.76% accuracies are obtained on the missing data. Particularly, the model acquires a 5.61% advantage with 70% missing data, demonstrating that the model outperforms some state-of-the-art approaches in incomplete multi-modal learning.

computer science, artificial intelligence,neurosciences

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper primarily addresses the issue of performance degradation in multimodal emotion recognition (MER) due to missing data. Specifically, it proposes a new model called the "Transformer Autoencoder" (TAE), which aims to utilize partially observable data to fill in the missing data in multimodal emotion recognition. #### Research Background - **Importance of Multimodal Signals**: Multimodal signals (such as physiological and non-physiological signals) can more comprehensively represent emotional information. - **Real-world Challenges**: In real environments, due to various factors (such as equipment failure, occlusion, etc.), multimodal signals often experience data loss, leading to significant performance degradation in emotion recognition. #### Main Contributions 1. **Unified Model**: A unified deep learning framework called the Transformer Autoencoder (TAE) is proposed to handle incomplete multimodal data. This is the first attempt to combine EEG signals with other non-physiological signals to address the issue of missing data in multimodal datasets. 2. **Multimodal Feature Extraction**: The TAE model includes modality-specific hybrid transformer encoders and cross-modal transformer encoders to build long-range dependencies between different modalities. 3. **Regularization Term**: A regularization term is introduced to enable the encoder and decoder to learn more discriminative features during training, thereby improving the classification performance of incomplete data. Through these methods, the TAE model can effectively capture the intrinsic relationships between missing and available data, thereby enhancing the performance of multimodal emotion recognition. Experimental results show that the model exhibits excellent performance under various degrees of data loss on the DEAP and SEED-IV datasets.

A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals

Transformer-Based Multimodal Emotional Perception for Dynamic Facial Expression Recognition in the Wild

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

Modality-collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition

A twin disentanglement Transformer Network with Hierarchical-Level Feature Reconstruction for robust multimodal emotion recognition

Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation

Accommodating Missing Modalities in Time-Continuous Multimodal Emotion Recognition

TMFER: Multimodal Fusion Emotion Recognition Algorithm Based on Transformer

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Bi-Modal Bi-Task Emotion Recognition Based on Transformer Architecture

Deep Emotional Arousal Network for Multimodal Sentiment Analysis and Emotion Recognition

Multimodal Neurophysiological Transformer for Emotion Recognition

Multilevel Transformer For Multimodal Emotion Recognition

MEDT: Using Multimodal Encoding-Decoding Network as in Transformer for Multimodal Sentiment Analysis

Multimodal transformer augmented fusion for speech emotion recognition

EEG-based Emotion Recognition Via Transformer Neural Architecture Search

Multimodal Transformer Fusion for Emotion Recognition: A Survey

Multimodal Emotion Recognition by Extracting Common and Modality-Specific Information.

Emotion Recognition Using Transformers with Masked Learning