CaptionNet: A Tailor-made Recurrent Neural Network for Generating Image Descriptions

Longyu Yang,Hanli Wang,Pengjie Tang,Qinyu Li
DOI: https://doi.org/10.1109/tmm.2020.2990074
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:Image captioning is a challenging task of visual understanding and has drawn more attention of researchers. In general, two inputs are required at each time step by the Long Short-Term Memory (LSTM) network used in popular attention based image captioning frameworks, including image features and previous generated words. However, error will be accumulated if the previous words are not accurate and the related semantic is not efficient enough. Facing these challenges, a novel model named CaptionNet is proposed in this work as an improved LSTM specially designed for image captioning. Concretely, only attended image features are allowed to be fed into the memory of CaptionNet through input gates. In this way, the dependency on the previous predicted words can be reduced, forcing model to focus on more visual clues of images at the current time step. Moreover, a memory initialization method called image feature encoding is designed to capture richer semantics of the target image. The evaluation on the benchmark MSCOCO and Flickr30K datasets demonstrates the effectiveness of the proposed CaptionNet model, and extensive ablation studies are performed to verify each of the proposed methods. The project page can be found in https://mic.tongji.edu.cn/3f/9c/c9778a147356/page.htm.
What problem does this paper attempt to address?