Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning.

Yanwu Shu,Liyan Zhang,Zechao Li,Jinhui Tang
DOI: https://doi.org/10.1007/978-981-10-8530-7_8
2017-01-01
Abstract:Image captioning which aims to automatically describe the content of an image using sentences, has become an attractive task in computer vision and natural language processing domain. Recently, neural network approaches have been proposed and proved to be the most efficient methods for image captioning. However, most of the prior work only considers past semantic context information to generate words in the sentence, lacking the consideration of future textual context. Therefore, in this paper, we propose a bidirectional multimodal Recurrent Neural Network (m-RNN) model which considers both history and future semantic context through a bidirectional recurrent layer. We first employ a pre-trained Convolution Neural Network (CNN) to extract image features and then leverage the bidirectional m-RNN to generate the sentences to describe each input image. Besides, we refine visual features by combining word embedding features and raw image features together to further improve the performance. Experimental results performed on the MS-COCO dataset have demonstrated the superiority of our proposed model compared with the original m-RNN model.
What problem does this paper attempt to address?