Joint Common Sense and Relation Reasoning for Dense Relational Captioning

Shan Cao,Weiming Liu,Gaoyun An,Qiuqi Ruan
DOI: https://doi.org/10.1109/icsp48669.2020.9321009
2020-01-01
Abstract:Relation reasoning between objects plays a vital role in image captioning. A joint common sense and relation reasoning model is proposed, which has the ability to generate more informative and diverse sentences for dense relational captioning. Our proposed model consists of two stages, region features extraction and relational captioning generation. The features of regions including object, union and subject are encoded by Region Proposal Network (RPN). Afterwards, they are fed to triple-stream Long Short-Term Memory (LSTM) respectively where object-predicate-subject categories are used as the prior information to lead the proper sequence of words in the descriptions generation. Moreover, we present the memory-augmented union operation where the image features and union features are leveraged to learn common sense and relation reasoning. Our experimental results on VG relationship captioning datasets demonstrate the validity of the joint common sense and relation reasoning model, which achieves competitive performance in dense relational captioning.
What problem does this paper attempt to address?