Generating video description with Long-Short Term Memory

Shuohao Li,Jun Zhang,Qiang Guo,Jun Lei,D. Tu
DOI: https://doi.org/10.1109/ICIVC.2016.7571276
2016-08-01
Abstract:Connecting visual imagery with visual descriptive language is a challenge for computer vision and machine translation. Inspired by image description, which used `encoder-decoder' model to translate image into target sentence. We propose an approach that can generate descriptions for video. Different from image which record the information in a moment, video have time-serials property. So when generating video description, we requires encoding dynamic temporal structure. Our model in this paper successfully takes into account both the global and local information. First, our approach extract the features of sample frames by a Convolutional Neural Network (CNN) which is pre-trained for image classification. Second, we get the global feature of video by max pooling the features of frames. Third, we divide the Long-Short Term Memory (LSTM) into two parts, one of which encode the features of frames into local feature, another decode the features which contains global and local information into target sentence. Finally, we compare two variants of our model with recent works using BLEU metrics on YouTube dataset.
Computer Science
What problem does this paper attempt to address?