Phrase Grounding Algorithm Based on Transformer Multilevel Feature Fusion

Xiangdong Meng,Juxiang zhou,Jianhou Gan,Jun Wang,Ken Chen
DOI: https://doi.org/10.2139/ssrn.4289646
2022-01-01
SSRN Electronic Journal
Abstract:Phrase grounding task refers to recognizing the image target according to the textual information. At present, the study of phrase grounding is mainly limited by feature extraction and multimodal feature fusion. This study proposes a multilevel feature fusion method based on transformer. The multilevel fusion of image and textual features can help the MFF(Multimodal Feature Fusion) model strengthen the correlation between image-text correlation. In this study, position encoding was added to the visual feature extraction to improve the degree of association between images and texts as well as to improve the accuracy of phrase grounding. The proposed multimodal feature fusion method was experimentally verified on the Flickr30k Entities dataset. Compared with the existing phrase grounding methods, the performance of our model was significantly improved and enhanced by approximately 7%.
English Else
What problem does this paper attempt to address?