Multimodal Knowledge Expansion Supplementary Materials

Zihui Xue,Sucheng Ren,Zhengqi Gao,Hang Zhao
2021-01-01
Abstract:The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains videos and audios of 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements. It contains 1440 emotional utterances with 8 different emotion classes: neutral, calm, happy, sad, angry, fearful, disgust and surprise. The dataset is randomly split as 2:8 for Dl and Du and 8:1:1 as train / validation / test for Du. To construct the labeled uni-modal datasetDl, we select images every 0.5 second of a video clip as modality α and train a facial emotion recognition (FER) network as the UM teacher, which classifies emotions based on images. Image-audio pairs from video clips consist of the unlabeled multimodal dataset Du. We sample images as inputs from modality α in the same way, adopt ”Kaiser best” sampling for audios and take Mel-frequency cepstral coefficients (MFCCs) as inputs from modality β.
What problem does this paper attempt to address?