Show, Rethink, And Tell - Image Caption Generation With Hierarchical Topic Cues.

Feng Chen,Songxian Xie,Xinyi Li,Jintao Tang,Kunyuan Pang,Shasha Li,Ting Wang
DOI: https://doi.org/10.1109/ICME51207.2021.9428353
2021-01-01
Abstract:Current state-of-the-art approaches for image captioning mainly apply the encoder-decoder framework with attention mechanisms, most of which ignore interactions between different types of image features and perform attention operations only once per word. The mentioned problems limit the captioning model’s capability to capture sufficient information to generate high-quality captions. By contrast, humans often rethink to polish up descriptions by re-focusing on more correct and important information, which is hard to capture at first glance. In this paper, we introduce a novel topic-guided captioning model to imitate such a human’s rethinking process by modeling interactions between visual and hierarchical semantic features of topics. To the best of our knowledge, we are the first to effectively consider hierarchical semantic features as guidance to facilitate visual attention, achieving human-like rethinking for captioning. Extensive experiments on the MS COCO dataset show that our proposed model achieves superior performance over state-of-the-art methods.
What problem does this paper attempt to address?