Multimodal Continuous Prediction of Emotions in Movies using Long Short-Term Memory Networks

Sarath Sivaprasad,Tanmayee Joshi,Rishabh Agrawal,Niranjan Pedanekar
DOI: https://doi.org/10.1145/3206025.3206076
2018-06-05
Abstract:Predicting emotions that movies are designed to evoke, can be useful in entertainment applications such as content personalization, video summarization and ad placement. Multimodal input, primarily audio and video, helps in building the emotional content of a movie. Since the emotion is built over time by audio and video, the temporal context of these modalities is an important aspect in modeling it. In this paper, we use Long Short-Term Memory networks (LSTMs) to model the temporal context in audio-video features of movies. We present continuous emotion prediction results using a multimodal fusion scheme on an annotated dataset of Academy Award winning movies. We report a significant improvement over the state-of-the-art results, wherein the correlation between predicted and annotated values is improved from 0.62 vs 0.84 for arousal, and from 0.29 to 0.50 for valence.
What problem does this paper attempt to address?