Divided Caption Model with Global Attention

Yamin Chen,Hancong Dua,Zitian Zhao,Zhi Wang
DOI: https://doi.org/10.1145/3461353.3461386
2021-01-01
Abstract:Dense video captioning is a newly emerging task that aims at both locating and describing all events in a video. We identify and tackle two challenges on this task, namely, 1) the limitation of just attending local features; 2) the severely degraded description and increased training complexity caused by the redundant information. In this paper, we propose a new divided caption model, where two different attention mechanisms are introduced to rectify the captioning process in a unified framework. Firstly, we employ a global attention mechanism to encode video features in the proposal module, which can obtain a better temporal boundary. Second, we design bidirectional Long short-term memory (LSTM) with a common-attention mechanism to counterpoise 3d-convolutional neural network (c3d) features and global attention video content effectively in caption module to generate coherent natural language descriptions. Besides, we divide forward and backward video features in an event into segments to relieve the stress for degraded description and increased complexity. Extensive experiments demonstrate the competitive performance of the proposed Divided Caption Model with Global Attention (DCM-GA) over state-of-the-art methods on the ActivityNet Captions dataset.
What problem does this paper attempt to address?