Multi-objective Layer-wise Optimization and Multi-level Probability Fusion for Image Description Generation Using LSTM
Peng-Jie TANG,Han-Li WANG,Kai-Sheng XU
DOI: https://doi.org/10.16383/j.aas.2017.c160733
2018-01-01
Abstract:The task of image automatic description by computer belongs to high-level visual understanding. Unlike image classification, object detection, etc., it usually requires that the model should not only have abilities of describing scene and objects, but also have capacities of expressing the relations between different objects and between objects and background in the image. In addition, it is required that the model should generate natural sentences which accord with correct grammars and appropriate structures. Nowadays, the approaches based on convolutional neural network (CNN) and long-short term memory network (LSTM) have been the popular solutions to this task, and a series of successes have been obtained. However, there are still several sticky problems, for instance, the LSTM network is not deep enough and the model is difficult to optimize, and as a result, performances cannot be improved and the sentences generated are of low quantity. To address these difficulties, inspired by the idea of deep learning, a model named MLO/MLPF-LSTM is proposed, in which the method of layer-wise optimization, multi-objective optimization and multi-layer probability fusion are employed. In details, an LSTM network with shallow depth is trained firstly, then, new LSTM layers and related objective functions are added to the optimized LSTM network. Meanwhile, the classification layers and objective functions in the original LSTM model are reserved and fine-tuned with the new layers. During the test, the probabilities of all the Softmax functions which are fed with the corresponding classification layers are fused for the final predicted probabilities by a weighted average method. Experimental results on MSCOCO and Flickr30K datasets demonstrate that our model is effective and outperforms other methods of same kinds on a number of evaluation metrics.