Video Captioning Via Relation-Aware Graph Learning

Yi Zheng,Heming Jing,Qiujie Xie,Yuejie Zhang,Rui Feng,Tao Zhang,Shang Gao
DOI: https://doi.org/10.1109/icassp49357.2023.10094571
2023-01-01
Abstract:Recent neural models for video captioning usually employed an encoder-decoder framework. However, most approaches either neglected the spatial and temporal interactions between objects in a video or implicitly modelled the interactions, resulting in less desired performance. In this paper, we propose a novel relation-aware graph learning framework. It explicitly models both spatial and temporal relations for objects. In particular, a relation-aware graph is designed to depict the spatial relations between different objects in a scene. Parallelly, a temporal graph network is designed to perform relational reasoning for the same objects in adjacent frames. Features of both types of relations are learned and fused for the follow-up language decoder. Experiments on two bench-mark datasets show the effectiveness of our framework. It achieves state-of-the-art performance with CIDEr scores on MSVD and MSR-VTT.
What problem does this paper attempt to address?