Fast Image Captioning Using LSTM

Meng Han,Wenyu Chen,Alemu Dagmawi Moges
DOI: https://doi.org/10.1007/s10586-018-1885-9
2018-01-01
Cluster Computing
Abstract:Computer vision and natural language processing have been some of the long-standing challenges in artificial intelligence. In this paper, we explore a generative automatic image annotation model, which utilizes recent advances on both fronts. Our approach makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image. During our experimentation we found that better accuracy and training was achieved when the image representation from our detection model is coupled with the input word embedding, we also found out most of the information from the last layer of detection model vanishes when it is fed as thought vector for our LSTM decoder. This is mainly because the information within the last fully connected layer of the YOLO model represents the class probabilities for the detected objects and their bounding box and this information is not rich enough. We trained our model on coco benchmark for 60 h on 64,000 training and 12,800-validation dataset achieving 23% accuracy. We also realized a significant training speed drop when we changed the number of hidden units in the LSTM layer from 1470 to 4096.
What problem does this paper attempt to address?