Spatial- Temporal Attention for Image Captioning

Junwei Zhou,Xi Wang,Jizhong Han,Songlin Hu,Hongchao Gao
DOI: https://doi.org/10.1109/BigMM.2018.8499060
2018-09-01
Abstract:Inspired by the work in human translation, when translating a sentence, we can not generate a word without looking back at the previous words of the sentence. In addition, generating a sentence for an image needs spatial information. In this paper, we address a novel spatial-temporal attention approach which combines previous, current and visual information. To get a more correct sentence for an image, our model decides whether the spatial or temporal information is more important during word generation. In the experiment, we verify our method on the most popular dataset: Microsoft COCO. The results show that our method performs well.
Computer Science
What problem does this paper attempt to address?