Multimodal Deep Denoise Framework for Affective Video Content Analysis.

Yaochen Zhu,Zhenzhong Chen,Feng Wu
DOI: https://doi.org/10.1145/3343031.3350997
2019-01-01
Abstract:Affective video content analysis has attracted a lot of attention recently. However, it faces various challenges such as the gap between intrinsic visual-aural features and spontaneous human emotional response, as well as ubiquitously existed label noise in affective annotations. Therefore, it is difficult to get useful supervision signals to learn well-generalized patterns responsible for eliciting affective impact. Observing that label uncertainty severely obstacles the progress of affective video content analysis, a deep denoising framework is proposed to infer true latent labels and annotation qualities from heavy label noise, fully utilizing the multimodal information contained in videos. Specifically, a quality embedding network is adopted in a multimodal fashion, and corresponding stochastic gradient descent (SGD) optimization objective is derived with variational inference and conditional independence assumption. To better reflect the effectiveness of affective models, new test sets are established based on the widely used LIRIS-ACCEDE dataset where the training database is kept unchanged, and a ranking-based evaluation metric is introduced accordingly. Experiments conducted on both the original LIRIS-ACCEDE test dataset and the refined one demonstrate the effectiveness of the proposed method.
What problem does this paper attempt to address?