Richer Semantic Visual and Language Representation for Video Captioning

Pengjie Tang,Hanli Wang,Hanzhang Wang,Kaisheng Xu
DOI: https://doi.org/10.1145/3123266.3127895
2017-01-01
Abstract:Translating and summarizing a video into natural language is an interesting and challenging visual task. In this work, a novel framework is built to generate sentences for videos with more coherence and semantics. A long short term memory (LSTM) network with an improved factored way is first developed, which takes inspiration of the conventional factored way and a common practice of presenting multi-modal features at the first time in LSTM for video captioning. An LSTM network with the combination of improved factored and un-factored ways is exploited, and a voting strategy is employed to predict the words. Then, the residual is used to enhance the gradient signals which is learned from residual network (ResNet), and a deeper LSTM network is constructed. Furthermore, several convolutional neural network (CNN) features from deep models with different architectures are fused to catch more comprehensive and complementary visual information. Experiments are conducted on the MSR-VTT2016 and MSR-VTT2017 grand challenge datasets to demonstrate the effectiveness of each presented techniques as well as the superiority compared to other state-of-the-art methods.
What problem does this paper attempt to address?