Hybrid Attention Network for Image Captioning

Wenhui Jiang,Qin Li,Kun Zhan,Yuming Fang,Fei Shen
DOI: https://doi.org/10.1016/j.displa.2022.102238
IF: 3.074
2022-01-01
Displays
Abstract:Machine attention mechanisms are widely used in the task of image captioning. Such mechanisms dynamically focus on different regions to guide the word generation process. However, existing attention models may fail to concentrate on correct regions and mislead the word prediction without explicit supervision. In this study, we exploit the human captioning attention encoding rich information that human beings perceive during captioning, and propose a novel Hybrid Attention Network (HAN) that incorporates the prevailing machine attention mechanisms with human captioning attention. The proposed HAN addresses the problem of "object hallucination"by re-weighting bottom-up attention, and improves the diversity of the generated captioning by complementing top-down attention with human captioning attention. Extensive experiments are conducted on Flickr30K and MS COCO datasets, demonstrating that the proposed method effectively improves the performance of the current image captioning methods.
What problem does this paper attempt to address?