RIVA: A Pre-trained Tweet Multimodal Model Based on Text-image Relation for Multimodal NER.

Lin Sun,Jiquan Wang,Yindu Su,Fangsheng Weng,Yuxuan Sun,Zengwei Zheng,Yuanyi Chen
DOI: https://doi.org/10.18653/v1/2020.coling-main.168
2020-01-01
Abstract:Multimodal named entity recognition (MNER) for tweets has received increasing attention recently. Most of the multimodal methods used attention mechanisms to capture the text-related visual information. However, unrelated or weakly related text-image pairs account for a large proportion in tweets. Visual clues unrelated to the text would incur uncertain or even negative effects for multimodal model learning. In this paper, we propose a novel pre-trained multimodal model based on Relationship Inference and Visual Attention (RIVA) for tweets. The RIVA model controls the attention-based visual clues with a gate regarding the role of image to the semantics of text. We use a teacher-student semi-supervised paradigm to leverage a large unlabeled multimodal tweet corpus with a labeled data set for text-image relation classification. In the multimodal NER task, the experimental results show the significance of text-related visual features for the visual-linguistic model and our approach achieves SOTA performance on the MNER datasets.
What problem does this paper attempt to address?