Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Tao Zhang,Zhenhua Tan
DOI: https://doi.org/10.1007/s11042-023-17944-9
IF: 2.577
2024-01-23
Multimedia Tools and Applications
Abstract:With the advancement of multimedia and human-computer interaction, it has become increasingly crucial to perceive people's emotional states in dynamic data (e.g., video, audio, text stream) in order to effectively serve them. Emotion recognition has emerged as a prominent research area over the past decades. Traditional methods for emotion recognition heavily rely on manually crafted features and primarily focus on uni-modality. However, these approaches encounter challenges in extracting sufficient discriminative information for complex emotion recognition tasks. To tackle this issue, deep neural model-based methods have gained significant popularity in emotion recognition tasks. These methods leverage deep neural models to automatically learn more discriminative emotional features, thereby addressing the problem of poor discriminability associated with manually designed features. Moreover, deep neural models are also employed to integrate information across multiple modalities, thereby enhancing the extraction of discriminative information. In this paper, we provide a comprehensive review of the relevant studies on deep neural model-based emotion recognition in dynamic data using facial, speech, and textual cues published within the past five years. Specifically, we first explain discretized and continuous representations of emotions by introducing widely accepted emotion models. Subsequently, we elucidate how advanced methods integrate different neural models by scoping these methods using variant popular deep neural models (e.g. Transformer), along with corresponding preprocessing mechanisms. In addition, we present the development trend by surveying diverse datasets, metrics, and competitive performances. Finally, we have a discussion and explore significant research challenges and opportunities. Our survey bridges the gaps in the literature since existing surveys are narrow in focus, either exclusively covering single-modal methods, solely concentrating on multi-modal methods, overlooking certain aspects of face, speech, and text, or emphasizing outdated methodologies.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?