Time-Dependent Pre-Attention Model For Image Captioning

Fuwei Wang,Xiaolong Gong,Linpeng Huang
DOI: https://doi.org/10.1109/ICPR.2018.8545355
2018-01-01
Abstract:The task of automatically generating image captions draws a lot of attention in the past few years because it shows great potential in a wide range of application scenarios. The encoder-decoder structure with attention mechanism has been extensively applied to solve this task. However, most researches apply attention mechanism only to pay attention to image features but neglect the relations between image features which we think play an important role in scene understanding. To tackle this problem, we propose a novel attention mechanism named "attention to Time-Dependent Pre-Attention" (TDPA-attention) and the TDPA-attention is combined with a hierarchical LSTM decoder to compose our captioning model (TDPA-model). Within our TDPA-attention, at every time step, every image feature pays attention to all image features according to a semantic context and the attended feature is treated as an aggregated feature that contains relations between this image feature and all image features. All these aggregated features form a new feature set that the hierarchical LSTM decoder attends to. We evaluate our model on public image caption dataset Microsoft COCO and achieve state-of-the-art performance on most evaluation metrics.
What problem does this paper attempt to address?