AttResNet: Attention-based ResNet for Image Captioning

Yunmeng Feng,Long Lan,Xiang Zhang,Chuanfu Xu,Zhenghua Wang,Zhigang Luo
DOI: https://doi.org/10.1145/3302425.3302464
2018-01-01
Abstract:Image caption has been widely studied recently which, likes the human to understand a scene, learns the high-level semantic descriptions for a single image. To achieve this goal, many recent methods divide the task into two stages, namely, the encoder and decoder, respectively corresponding to feature extraction and semantic descriptions. With the development of the deep neural network, two stages can be realized with a convolutional neural networks (CNNs) followed by a recurrent neural networks (RNNs). Following the novel idea of such deep encoder-decoder framework, this paper mainly refines the encoder with an attention-based ResNet model to provide better semantic features for the decoder. Attention mechanism has been broadly recognized to be a useful strategy in image captioning, which highlights the image regions of interest and further emphasizes the corresponding semantics to enhance the captioning of image content. Specifically, we design an attention connection and then seamlessly couple it with the well-known ResNet. Thus, we call it AttResNet. To our best knowledge, this is the first attempt to apply ResNet for image captioning. Experiments on MSCOCO dataset validate our proposed model achieves favorable results.
What problem does this paper attempt to address?