VGGAN: Visual Grounding GAN Using Panoptic Transformers

Fengnan Quan,Bo Lang
DOI: https://doi.org/10.1109/icivc58118.2023.10270121
2023-01-01
Abstract:Visual Grounding is an important part of image annotation generation. The existing methods usually use data alignment based on the similarity calculation of visual text features in location inference and multi-modal fusion, which will lose visual and text information to some extent, and is more likely to make the model overfit the data of specific scenes. To solve this problem, we propose a Visual Grounding Generative Adversarial Network (VGGAN) for visual text fusion using the panoptic transformer. We use the generative adversarial network to generate the prediction, judge the accuracy, and design the visual text transformer according to the panoptic theory. The model can retain the feature information, realize the full interactions between features, thereby better supporting the feature fusion of visual and text. Experimental results on the COCO dataset of complex daily scenes verify the effectiveness of our model, and our model achieves the highest prediction accuracy compared with the state-of-the-art methods.
What problem does this paper attempt to address?