Aligned Visual Semantic Scene Graph for Image Captioning

Shanshan Zhao,Lixiang Li,Haipeng Peng
DOI: https://doi.org/10.1016/j.displa.2022.102210
IF: 3.074
2022-01-01
Displays
Abstract:Image captioning is a multi-modal task to describe an image into natural language. Many state-of-the-art methods generally take the encoder–decoder architecture, encode an image by the convolution neural networks, or by the structured semantic scene graph that contains the object, relationship and the attribute information. The image scene graph constructed by the existing scene graph generation models are generally too noisy. To alleviate the phenomenon, we propose a multi-level cross-modal alignment (MCA) module to align the image scene graph with the sentence scene graph at different level. MCA can distill the redundant information of the image scene graph according to the sentence scene graph, and providing the commonsense knowledge for the decoder. Except for the semantic relationships, we take advantage of the bounding boxes with the visual objects to compute the implicit spatial relationships for the detected objects. With the aligned scene graph features and the implicit spatial relationship information, our decoder fused them via the dynamic mixtured attention to translate these features into descriptions. Extensive experiments on the MSCOCO dataset got the promising result compared with the state-of-the-art methods, which verified the effectiveness of our method.
What problem does this paper attempt to address?