Text-Image Scene Graph Fusion for Multi-Modal Named Entity Recognition

Jian Cheng,Kaifang Long,Shuang Zhang,Tian Zhang,Lianbo Ma,Shi Cheng,Yinan Guo
DOI: https://doi.org/10.1109/tai.2023.3326416
2024-01-01
IEEE Transactions on Artificial Intelligence
Abstract:With the popularity and widespread use of social media platforms such as Twitter and Facebook, massive amounts of text and image information posted by a variety of users have flooded these social media platforms. Thus, multi-modal named entity recognition (MNER) has become a research hotspot for the task of extracting named entities from multi-modal data. Empirically, the visual clues unrelated to text data may introduce uncertain or even negative impacts on the named entity recognition. The considerations of the relevance of multi-modal data have been ignored in the previous studies. In this paper, to effectively measure the relationship between text data and visual cues for improving the accuracy of named entities, we propose a text image scene graph fusion (TISGF) approach with a text-image similarity assessment module (TISA) and a text-image fusion module (TIF) for MNER. Specifically, we first construct two (visual and textual) scene graphs to exploit the joint features of objects and relations in text and image and encode the two scene graphs separately using a specific encoder pair. In this way, we can obtain both object-level and relationship-level cross-modal features. Subsequently, TISA is used to compute the similarity of the image and text data and to determine the proportion of visual information that will be retained for fusion. Finally, we use TIF to achieve a unified multi-modal representation of each word and predict the entity type using CRF. Extensive experiment results on two public datasets demonstrate the effectiveness and competitiveness of our proposed method for the MNER task.
What problem does this paper attempt to address?