Learning topic emotion and logical semantic for video paragraph captioning

Qinyu Li,Hanli Wang,Xiaokai Yi
DOI: https://doi.org/10.1016/j.displa.2024.102706
IF: 3.074
2024-04-06
Displays
Abstract:Video paragraph captioning aims to generate multiple descriptive sentences for videos, which strive to replicate human writing in accuracy, logicality, and richness. However, current research focuses on the accuracy and temporal order of events, ignoring emotion and other critical logical relations embedded in human language, such as causal and adversative relations. The ignorance impairs the reasonable transition across generated event descriptions and restricts the vividness of expression, resulting in a gap from the standard of human language. To resolve the problem, a framework that integrates logic and emotion representation learning is proposed to narrow the gap. Concretely, a large-scale inter-event relation corpus is constructed based on the EMVPC dataset. This corpus is named EMVPC-EvtRel (standing for "EMVPC-Event Relations") and contains six widely-used logical relations in human writing, 127 explicit inter-sentence connectives, and over 20,000 pairs of event segments with newly annotated logical relations. A logical semantic representation learning method is developed for recognizing the dependencies between visual events, thereby enhancing the characteristics of video contents and boosting the logicality of generated paragraphs. Moreover, a fine-grained emotion recognition module is designed to uncover emotion features embedded in videos. Finally, experimental results on the EMVPC dataset demonstrate the superiority of the proposed method compared to existing state-of-the-art approaches.
engineering, electrical & electronic,instruments & instrumentation,optics,computer science, hardware & architecture
What problem does this paper attempt to address?