Lightly-supervised Utterance-Level Emotion Identification Using Latent Topic Modeling of Multimodal Words.

Zhaojun Yang,Shrikanth Narayanan
DOI: https://doi.org/10.1109/icassp.2016.7472181
2016-01-01
Abstract:Research on multimodal emotion recognition has drawn much attention recently in diverse disciplines. With the increasing amount of multimodal data, unsupervised or semi-supervised learning has become highly desirable to automatically discover expression of emotion patterns in behavioral data. We present a novel approach for multimodal emotion learning using only a small amount of labels. Our approach is hinging on probabilistic latent semantic analysis (pLSA) that defines the latent variable as the emotion class, motivated by the conceptualization that human emotion acts as a latent control variable that regulates the external behavior manifestations, such as through speech and body gesture. In our approach, we represent the audio-visual information in an utterance as a bag of multimodal words. To exploit the interrelation between speech and gesture modalities, we propose a canonical correlation analysis (CCA) based vocabulary of multimodal words. Our approach has achieved promising experimental results. We have also demonstrated the superiority of the CCA-based multimodal words over those derived directly from the original cues.
What problem does this paper attempt to address?