Multimodal Entity Linking with Mixed Fusion Mechanism.

Gongrui Zhang,Chenghuan Jiang,Zhongheng Guan,Peng Wang
DOI: https://doi.org/10.1007/978-3-031-30675-4_45
2023-01-01
Abstract:Many efficient multimodal entity linking (MEL) methods have been developed in recent years. However, most MEL methods still suffer from two drawbacks. On the one hand, the inconsistency of modal encoding brings the semantic gap between modalities in the feature space and blocks the multimodal fusion. On the other hand, previous attention-based multimodal fusions cannot efficiently handle noise. To address these issues, we propose a Multimodal Encoder Representation from Transformers for Multimodal Entity Linking (Mert-MEL). Firstly, we concatenate contexts of mentions and Wikidata abstracts of candidate entities as inputs. Then we utilize transformer encoders to extract features of both textual and visual information, and employ contrastive learning to better align feature spaces. We also incorporate phrase-level text embeddings to get rich textual representations. Subsequently, we use a combination of global fusion and bottleneck fusion to integrate multimodal information and extract key information instead of noise. Finally, we send the fused embeddings to an MEL head to predict the matching scores between the mention and the candidate entities, and then link the mention to the candidate with the highest score. Experiments demonstrate that Mert-MEL prominently outperforms strong baselines on two MEL datasets.
What problem does this paper attempt to address?