Differentiate Visual Features with Guidance Signals for Video Captioning

Yifan Yang,Xiaoqiang Lu
DOI: https://doi.org/10.1145/3562007.3562052
2022-01-01
Abstract:The task of video captioning is to generate comprehensible and grammatically correct sentences which describe the main visual content of videos. Existing neural modules based methods improve the model interpretability by separately predicting words of different part-of-speech. However, the separation of different modules may lead to confusing semantics. In this work, a video captioning method referred to as Differentiate Visual Features with Guidance Signals (DVFGS) is proposed, which enhances the semantic consistency of the neural modules based method through guidance signals. This process is similar to the cell differentiation process, producing differences and having different effects on the whole. Extensive experiments performed on MSVD and MSR-VTT show that DVFGS pushes the limit of neural modules based video captioning methods forward.
What problem does this paper attempt to address?