Hierarchical Context Encoding for Events Captioning in Videos

Dali Yang,Chun Yuan
DOI: https://doi.org/10.1109/icip.2018.8451740
2018-01-01
Abstract:In this paper, we aim to tackle the task of captioning each event in one video (dense captioning in videos) and propose a novel pipeline. The task is challenging because of the uncertainty of event measurement as well as generating context-aware sentences. Conventional video captioning methods have flaws in encoding events with context, for a simple example: models would not correctly use words such as “another” and “continue” when describing multiple events. We directly deal with this issue by coming up with an encoder working along the time axis, which encodes videos and outputs features from different levels of hierarchical LSTMs. Our hierarchical LSTMs use different layers to retrieve intra-event (regional) and inter-event (global) descriptor. Moreover, we modify the language model in an attention fashion. Unlike previous attention modules, our attention module deals with regional and global information at different phases and integrates output into language LSTM. This results in better consistency with ground truth natural language. Besides, our attention model makes the whole pipeline more robust to inaccurate event proposal. We evaluate our method on the dense captioning dataset in common metrics, reporting as at most 30% boost on a single metric.
What problem does this paper attempt to address?