Catching the Temporal Regions-of-Interest for Video Captioning.

Ziwei Yang,Yahong Han,Zheng Wang
DOI: https://doi.org/10.1145/3123266.3123327
2017-01-01
Abstract:As a crucial challenge for video understanding, exploiting the spatial-temporal structure of video has attracted much attention recently, especially on video captioning. Inspired by the insight that people always focus on certain interested regions of video content, we propose a novel approach which will automatically focus on regions-of-interest and catch their temporal structures. In our approach, we utilize a specific attention model to adaptively select regions-of-interest for each video frame. Then a Dual Memory Recurrent Model (DMRM) is introduced to incorporate temporal structure of global features and regions-of-interest features in parallel, which will obtain rough understanding of video content and particular information of regions-of-interest. Since the attention model could not always catch the right interests, we additionally adopt semantic supervision to attend to interested regions more correctly. We evaluate our method for video captioning on two public benchmarks: the Microsoft Video Description Corpus (MSVD) and the Montreal Video Annotation Dataset (M-VAD). The experiments demonstrate that catching temporal regions-of-interest information really enhances the representation of input videos and our approach obtains the state-of-the-art results on popular evaluation metrics like BLEU-4, CIDEr, and METEOR.
What problem does this paper attempt to address?