Abstract:Electroencephalogram (EEG), as a tool capable of objectively recording brain electrical signals during emotional expression, has been extensively utilized. Current technology heavily relies on datasets, with its performance being limited by the size of the dataset and the accuracy of its annotations. At the same time, unsupervised learning and contrastive learning methods largely depend on the feature distribution within datasets, thus requiring training tailored to specific datasets for optimal results. However, the collection of EEG signals is influenced by factors such as equipment, settings, individuals, and experimental procedures, resulting in significant variability. Consequently, the effectiveness of models is heavily dependent on dataset collection efforts conducted under stringent objective conditions. To address these challenges, we introduce a novel approach: employing a self‐supervised pre‐training model, to process data across different datasets. This model is capable of operating effectively across multiple datasets. The model conducts self‐supervised pre‐training without the need for direct access to specific emotion category labels, enabling it to pre‐train and extract universally useful features without predefined downstream tasks. To tackle the issue of semantic expression confusion, we employed a masked prediction model that guides the model to generate richer semantic information through learning bidirectional feature combinations in sequence. Addressing challenges such as significant differences in data distribution, we introduced adaptive clustering techniques that manage by generating pseudo‐labels across multiple categories. The model is capable of enhancing the expression of hidden features in intermediate layers during the self‐supervised training process, enabling it to learn common hidden features across different datasets. This study, by constructing a hybrid dataset and conducting extensive experiments, demonstrated two key findings: (1) our model performs best on multiple evaluation metrics; (2) the model can effectively integrate critical features from different datasets, significantly enhancing the accuracy of emotion recognition.

emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Deep Spectrum Feature Representations for Speech Emotion Recognition

Self-attention Transfer Networks for Speech Emotion Recognition

Representation Learning Through Cross-Modal Conditional Teacher-Student Training For Speech Emotion Recognition

Temporal Shift Module with Pretrained Representations for Speech Emotion Recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Masked self‐supervised pre‐training model for EEG‐based emotion recognition

Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

Unsupervised Representations Improve Supervised Learning in Speech Emotion Recognition

Attention Based Fully Convolutional Network for Speech Emotion Recognition

Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

Speech Emotion Recognition with Complementary Acoustic Representations.

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Speech Emotion Recognition Via Multi-Level Cross-Modal Distillation

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Self-supervised utterance order prediction for emotion recognition in conversations

Wav2vec2. 0 and Context Emotional Information Compensation Based Dialogue Speech Emotion Recognition

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition