Cross modification attention-based deliberation model for image captioning

Zheng Lian,Yanan Zhang,Haichang Li,Rui Wang,Xiaohui Hu
DOI: https://doi.org/10.1007/s10489-022-03845-1
IF: 5.3
2022-07-05
Applied Intelligence
Abstract:The two-pass decoding framework has been proved to considerably improve the performance of image captioning models. However, most of the existing two-pass models involve the coarse captions in assisting the refining process by simply using a conventional attention module. Such an insufficient interaction cannot provide satisfactory support for reproducing higher-quality image descriptions. In this paper, we propose a novel Cross Modification Attention (CMA) module to exploit the complementarity of images and the corresponding coarse captions to supply more reliable features for refinement. Specifically, our CMA extends the conventional attention mechanisms with a hierarchical gating network, which mutually modifies the attended vectors of both visual and linguistic modalities. Thus, it can make the visual semantic representation more unambiguous and filter out misleading information from the coarse captions. To cooperate with CMA in feature interaction, we further explore a general two-pass decoding framework, where the drafting and the deliberation model share only the image encoders rather than the whole drafting network as previous methods. Our framework provides visual features tightly coupling both decoding processes, and ensures the efficient joint optimization of the two-pass models. Moreover, we consider the coarse captions as a baseline when optimizing the deliberation model and employ a potential-oriented reward shaping strategy for reinforcement learning to pertinently improve the quality of refinement. Experiments on Flickr30K and MS COCO datasets demonstrate that our Cross Modification Attention-based Deliberation Model (CMA-DM) obtains significant improvements over single-pass decoding baselines and achieves competitive performance on MS COCO online test server.
computer science, artificial intelligence
What problem does this paper attempt to address?