Graph Alignment Transformer for More Grounded Image Captioning

Canwei Tian,Haiyang Hu,Zhongjin Li
DOI: https://doi.org/10.1109/iiotbdsc57192.2022.00028
2022-01-01
Abstract:The Industrial Internet of Things (IIoT) generates massive amounts of data that are the cornerstone for companies to increase productivity and provide reliable services. Based on these data, predictive and in-depth analysis can be used to identify weaknesses and make improvements. And how to analyze these data efficiently, effectively and safely requires us to explore. We expect to exploit these data by using methods of deep learning, and image captioning is one of the meaningful tasks. Image captioning aims to describe a given image in natural language. It is well believed that mining relationships between objects is a proven method to improve the performance of reasoning. These methods often extract the general relational expressions on another visual relationship benchmark. And it usually brings redundant connections between region pairs. In this paper, we propose a novel Graph Alignment Transformer (GAT) that models visual relationships in an unsupervised way to perform multimodal representation. Without taking the pre-training approach to obtain the explicit relational expressions, our model still achieves comparable results. Furthermore, we design a Graph Alignment (GA) module to explore semantic and visual alignment at node-level and graph-level, lead to accurate captions. We measured our method on the benchmark MSCOCO image captioning dataset and conduct ablation studies to investigate its effectiveness both quantitatively and qualitatively. Compared to state-of-the-art manners, our propose approach yields an impressive result.
What problem does this paper attempt to address?