Abstract:With the advancement of human-computer interaction, the role of emotion recognition has become increasingly significant. Emotion recognition technology provides practical benefits across various industries, including user experience enhancement, education, and organizational productivity. For instance, in educational settings, it enables real-time understanding of students' emotional states, facilitating tailored feedback. In workplaces, monitoring employees' emotions can contribute to improved job performance and satisfaction. Recently, emotion recognition has also gained attention in media applications such as automated movie dubbing, where it enhances the naturalness of dubbed performances by synchronizing emotional expression in both audio and visuals. Consequently, multimodal emotion recognition research, which integrates text, speech, and video data, has gained momentum in diverse fields. In this study, we propose an emotion recognition approach that combines text and speech data, specifically incorporating the characteristics of the Korean language. For text data, we utilize KoELECTRA to generate embeddings, and for speech data, we extract features using HuBERT embeddings. The proposed multimodal transformer model processes text and speech data independently, subsequently learning interactions between the two modalities through a Cross-Modal Attention mechanism. This approach effectively combines complementary information from text and speech, enhancing the accuracy of emotion recognition. Our experimental results demonstrate that the proposed model surpasses single-modality models, achieving a high accuracy of 77.01% and an F1-Score of 0.7703 in emotion classification. This study contributes to the advancement of emotion recognition technology by integrating diverse language and modality data, suggesting the potential for further improvements through the inclusion of additional modalities in future work.

Feature Aggregation with Two-Layer Ensemble Framework for Multilingual Speech Emotion Recognition

Emotional Speech Clustering Based Robust Speaker Recognition System

Toward emotional speaker recognition: framework and preliminary results

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Text and Sound-Based Feature Extraction and Speech Emotion Classification for Korean

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

An Improved MSER using Grid Search based PCA and Ensemble Voting Technique

A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

A Discriminative Feature Representation Method Based on Cascaded Attention Network With Adversarial Strategy for Speech Emotion Recognition

KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer

An autoencoder-based feature level fusion for speech emotion recognition

A Combined CNN Architecture for Speech Emotion Recognition

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition Using Deep Learning

Combining Feature Selection And Representation For Speech Emotion Recognition

Multimodal Speech Emotion Recognition Using Audio and Text