Stay in Grid: Improving Video Captioning Via Fully Grid-Level Representation.

Mingkang Tang,Zhanyu Wang,Zhaoyang Zeng,Xiu Li,Luping Zhou
DOI: https://doi.org/10.1109/tcsvt.2022.3232634
IF: 5.859
2022-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video captioning is a challenging task of automatically generating natural and meaningful textual descriptions given some context videos. The state-of-the-art methods aggregate the spatial-wise information in the video encoder at the early stage, which has two drawbacks: 1) Early aggregation in the encoder can cause considerable spatial details missing, which may consequently lead to incorrect word choices in the following text encoder. 2) The spatial attention learned in the video encoder may not be compelling enough without text guidance. To solve these problems, we propose a Stay-in-Grid video CAPtioning method SGCAP, which makes full use of the grid-level spatial features and consists of a Bilinear Sequential Attention Encoder (BSAE) and a Cross-modal Sequential Attention Decoder (CSAD). The former explores and retains fully grid-level discriminative representations in the video encoder, while the latter performs the late spatial aggregation in the decoder to attend to the most relevant regions with the supervision of the input words. Experimental results demonstrate the effectiveness of our method on three public datasets, showing its superior performance over multiple state-of-the-art video captioning models. Source codes and the pre-trained models will be made available to the public.
What problem does this paper attempt to address?