Image Captioning with an Intermediate Attributes Layer.

Qi Wu,Chunhua Shen,Anton van den Hengel,Lingqiao Liu,Anthony R. Dick
2015-01-01
Abstract:Many recent studies in image captioning rely on an architecture which learns the mapping from images to sentences in an end-to-end fashion. However, generating an accurate and complete description requires identifying all entities, their mutual interactions and the context of the image. In this work, we show that an intermediate image-to-attributes layer can dramatically improve captioning results over the current approach which directly connects an RNN to a CNN. We propose a two-stage procedure for training such an attribute-based approach: in the first stage, we mine a number of keywords from the training sentences which we use as semantic attributes for images, and learn the mapping from images to those attributes with a CNN; in the second stage, we learn the mapping from detected attribute occurrence likelihoods to sentence description using LSTM. We then demonstrate the effectiveness of our two-stage model with captioning experiments on three benchmark datasets, which are Flickr8k, Flickr30K and MS COCO.
What problem does this paper attempt to address?