Past is Important: Improved Image Captioning by Looking Back in Time

Yiwei Wei,Chunlei Wu,ZhiYang Jia,XuFei Hu,Shuang Guo,Haitao Shi
DOI: https://doi.org/10.1016/j.image.2021.116183
IF: 3.453
2021-01-01
Signal Processing Image Communication
Abstract:A major development in the area of image captioning consists of trying to incorporate visual attention in the design of language generative model. However, most previous studies only emphasize its role in enhancing visual composition at the current moment, while neglect its role in global sequence reasoning. This problem appears not only in captioning model, but also in reinforcement learning structure. To tackle this issue, we first propose a Visual Reserved model that enables previous visual context to be considered for the current sequence reasoning. Next, a Attentional-Fluctuation Supervised model is also proposed in reinforcement learning structure. Compared against the traditional strategies that only take non-differentiable Natural Language Processing (NLP) metrics as the incentive standard, the proposed model regards the fluctuation of previous attention matrix as an important indicator to judge the convergence of the captioning model. The proposed methods have been tested on MS-COCO captioning dataset and achieve competitive results evaluated by the evaluation server of MS COCO captioning challenge.
What problem does this paper attempt to address?