Motion-Aware Video Paragraph Captioning Via Exploring Object-Centered Internal Knowledge

Yimin Hu,Guorui Yu,Yuejie Zhang,Rui Feng,Tao Zhang,Xuequan Lu,Shang Gao
DOI: https://doi.org/10.1109/icassp49357.2023.10096625
2023-01-01
Abstract:Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Different from the images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to understanding the whole video content. Existing works rarely put focus on modeling the dynamic changing state of the objects in the videos, causing the activities occurred in videos are poorly or wrongly depicted in paragraphs. To address this problem, we propose a novel Object State Tracking Network, which can capture the temporal state change of objects. However, due to the similarity of the consecutive frames in the videos, the information of the video is redundant and noisy. We further propose a semantic alignment mechanism, and enable the sentence information to refine the visual information. Extensive experiments on ActivityNet Captions demonstrate the effectiveness of our method.
What problem does this paper attempt to address?