Video Captioning Via a Symmetric Bidirectional Decoder

Shanshan Qi,Luxi Yang
DOI: https://doi.org/10.1049/cvi2.12043
IF: 1.484
2021-01-01
IET Computer Vision
Abstract:The dominant video captioning methods employ the attentional encoder-decoder architecture, where the decoder is an autoregressive structure that generates sentences from left-to-right. However, these methods generally suffer from the exposure bias issue and neglect the guidance of future output contexts obtained from the right-to-left decoding. Here, the authors propose a new symmetric bidirectional decoder for video captioning. The authors first integrate the self-attentive multi-head attention and bidirectional gated recurrent unit for capturing the long-term semantic dependencies in videos. The authors then apply one single decoder to generate accurate descriptions from left-to-right and right-to-left simultaneously. The decoder in each decoding direction performs two cross-attentive multi-head attention modules to consider both the past hidden states from the same decoding direction and the future hidden states from the reverse decoding direction at each time step. A symmetric semantic-guided gated attention module is specially devised to adaptively suppress the irrelevant or misleading contents in the past or future output contexts and retain the useful ones for avoiding under-description. Experimental evaluations on two widely applied benchmark datasets: Microsoft research video to text and Microsoft video description corpus, demonstrate that the authors' proposed method obtains substantially state-of-the-art performance, which validates the superiority of the bidirectional decoder.
What problem does this paper attempt to address?