Learning Deep Contextual Attention Network for Narrative Photo Stream Captioning

Hanqi Wang,Siliang Tang,Yin Zhang,Tao Mei,Yueting Zhuang,Fei Wu
DOI: https://doi.org/10.1145/3126686.3126715
2017-01-01
Abstract:While image captioning has been extensively studied, the problem of generating narrative descriptions for photo streams still remains under explored. Photo stream captioning is more challenging due to the large visual variance, complicated object context, and sentence-to-sentence coherence in the ordered collection of photos. To deal with these challenges, we propose a novel deep contextual attention network (CAN) to narratively describe photo streams by jointly exploring the rich context among attended regions and the coherence in sentences. The proposed CAN is designed in an encoder-decoder framework: the encoder models visual contexts via region-level bilinear similarity and selectively focuses on the attention areas with salient context; while a novel hierarchical gated recurrent unit (h-GRU) acts as the decoder to effectively preserve the semantic coherence among the generated sentences. As CAN is capable to exploit visual attention and context in the photo stream, the generated story is more semantically coherent than merely concatenating the isolated individual image captions. We conduct experiments on the SIND dataset and show that CAN outperforms the state-of-the-art methods by 3.1%, 8.9%, and 9.1% in terms of BLEU, METEOR and CIDEr, respectively.
What problem does this paper attempt to address?