Cascade Attention: Multiple Feature Based Learning for Image Captioning

Jiahe Shi,Yali Li,Shengjin Wang
DOI: https://doi.org/10.1109/icip.2019.8803149
2019-01-01
Abstract:Most recent researches in image captioning adopt attention mechanism based on encoder-decoder framework, where the attention module aligns input features for the decoder and boosts performance consequently. A common defect of traditional attention methods is that the inequality among different types of inputs is ignored, resulting in under-exploitation of certain informative features. In this paper, we propose a novel cascade attention module, which processes different types of input in a sequential manner. The cascade attention module enables inputs of higher priorities to affect the attention of other inputs so as to emphasize such inequality. We implement our model by introducing global feature of the image to the captioning process of R-CNN based frameworks, where such feature is rich of context information but takes few effects via traditional attention module. Experimental results demonstrate that our proposed method is able to exploit feature of different types, acquiring improvements on multiple automatic measurements.
What problem does this paper attempt to address?