Describing Images by Feeding Lstm with Structural Words

Shubo Ma,Yahong Han
DOI: https://doi.org/10.1109/icme.2016.7552883
2016-01-01
Abstract:Generating semantic description draws increasing attention recently. Describing objects with adaptive adjunct words make the sentence more informative. In this paper, we focus on the generation of descriptions for images according to the structural words we have generated such as a tetrad of <;object, attribute, activity, scene>. We propose to use deep machine translation method to generate semantically meaningful descriptions. In particular, the description is composed of objects with appropriate adjunct words, corresponding activities and scene. We propose to use a multi-task method to generate structural words. Taking these words sequence as source language, we train a LSTM encoder-decoder machine translation model to output the target language. Experiments on the benchmark datasets demonstrate our method has better performance than state-of-the-art methods of image caption in terms of language generation metrics.
What problem does this paper attempt to address?