LSTM-in-LSTM for Generating Long Descriptions of Images.

Jun Song,Siliang Tang,Jun Xiao,Fei Wu,Zhongfei (Mark) Zhang
DOI: https://doi.org/10.1007/s41095-016-0059-z
IF: 4.1268
2016-01-01
Computational Visual Media
Abstract:In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).
What problem does this paper attempt to address?