Context-Assisted Attention for Image Captioning

Zheng Lian,Rui Wang,Haichang Li,Xiaohui Hu
DOI: https://doi.org/10.1007/978-3-031-15919-0_60
2022-01-01
Abstract:Temporal attention has demonstrated its crucial role with regard to modelling the relationships between semantic queries and image regions in current image captioning task. Nevertheless, most existing attention-based methods ignore the potential effect of the previously attended information on the generation of current attention context. In this paper, we propose a simple but effective Context-Assisted Attention (CA $$^2$$ ) for image captioning, which considers the temporal coherence of the attention contexts in the process of sequence prediction. Specifically, CA $$^2$$ combines the attention contexts from previous time steps with the features of image regions to serve as the input key-value pairs of the attention module for current context generation, which enables the sentence decoder to not only attend to the image regions by tradition but also focus on the historical attention contexts when necessary. Furthermore, we present a regularization method tailored to our CA $$^2$$ , namely Weight Transferring Constraint (WTC), to restrict the total weight assigned to the historical contexts in each decoding step. Experiments on the popular MS COCO dataset demonstrate that our method consistently improves LSTM-based baselines and achieves a competitive performance with 38.7 BLEU-4 and 128.5 CIDEr-D scores.
What problem does this paper attempt to address?