Improving Image Paragraph Captioning with Dual Relations

Yun Liu,Yihui Shi,Fangxiang Feng,Ruifan Li,Zhanyu Ma,Xiaojie Wang
DOI: https://doi.org/10.1109/icme52920.2022.9859701
2022-01-01
Abstract:Image paragraph captioning aims to generate multiple de-scriptive sentences for an image. However, most previous methods ignore the explicit relations among objects resulting in unsatisfactory performance. In this paper, we propose a novel model (i.e., DualRel) to capture spatial and seman-tic relations among objects. Specifically, the spatial relation embedding is obtained solely from images using a predefined geometry pattern. With the help of captions, the semantic relation embedding is learned in a weakly supervised man-ner. These two relation embeddings are then interacted with regional features of objects through a relation-aware attention interaction. It first obtains a visual context vector using regional features. Then with the visual context vector, we obtain the corresponding spatial and semantic relation-aware vectors using attentions. These three vectors are fused with two gates for language decoding to further generate a para-graph. Experimental results on Stanford benchmark dataset show that DualRel achieves remarkable improvements 1 1 Code released at https://github.com/fuyunll07/DualRel.
What problem does this paper attempt to address?