Contextual Attention Network for Emotional Video Captioning

Peipei Song,Dan Guo,Jun Cheng,Meng Wang
DOI: https://doi.org/10.1109/tmm.2022.3183402
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:This paper investigates an emerging and challenging taskemotional video captioning. Formally, given a video, the task aims to not only describe the factual content of the video, but also discover the emotional clues in the video. We propose a novel Contextual Attention Network (CANet), which recognizes and describes the fact and emotion in the video by semantic-rich context learning. To be specific, at each time step, we first extract visual and textual features from both input video and previously generated words. Then, we apply the attention mechanism to these features to capture informative contexts for captioning. We train the CANet model with the joint optimization of cross-entropy loss $\mathcal {L}_{CE}$ and contrastive loss $\mathcal {L}_{CL}$, where $\mathcal {L}_{CE}$ constrains the semantics of the generated sentence to be close to human annotation and $\mathcal {L}_{CL}$ encourages discriminative representation learning from positive and negative pairs of video and caption. Experiments on two emotional video captioning datasets (i.e., EmVidCap and EmVidCap-S) demonstrate the superiority of CANet compared to the state-of-the-art approaches.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?