Multi-Modal Graph Aggregation Transformer for image captioning

Lizhi Chen,Kesen Li
DOI: https://doi.org/10.1016/j.neunet.2024.106813
2024-10-16
Abstract:The current image captioning directly encodes the detected target area and recognizes the objects in the image to correctly describe the image. However, it is unreliable to make full use of regional features because they cannot convey contextual information, such as the relationship between objects and the lack of object predicate level semantics. An effective model should contain multiple modes and explore their interactions to help understand the image. Therefore, we introduce the Multi-Modal Graph Aggregation Transformer (MMGAT), which uses the information of various image modes to fill this gap. It first represents an image as a graph consisting of three sub-graphs, depicting context grid, region, and semantic text modalities respectively. Then, we introduce three aggregators that guide message passing from one graph to another to exploit context in different modalities, so as to refine the features of nodes. The updated nodes have better features for image captioning. We show significant performance scores of 144.6% CIDEr on MS-COCO and 80.3% CIDEr on Flickr30k compared to state of the arts, and conduct a rigorous analysis to demonstrate the importance of each part of our design.
What problem does this paper attempt to address?