Improving Image Captioning with Better Use of Caption

Zhan Shi,Xu Zhou,Xipeng Qiu,Xiaodan Zhu
DOI: https://doi.org/10.18653/v1/2020.acl-main.664
2020-01-01
Abstract:Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community.In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features.During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics.The code of our paper has been made publicly available.
What problem does this paper attempt to address?