Zhiwei Hu,Víctor Gutiérrez-Basulto,Ru Li,Jeff Z. Pan
Abstract:Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Multimodal Entity Linking (MEL):
1. **Ignoring negative samples within the same modality**:
- Existing methods only consider cross - modal negative samples (e.g., mismatched samples between text and image) when conducting contrastive learning, while ignoring negative samples within the same modality (e.g., differences between different text descriptions or different images). This causes the model to be unable to fully capture the semantic differences within the modality, thus affecting the quality of the embedding representation.
2. **Lack of a two - way cross - modal interaction mechanism**:
- Existing methods usually only consider one - way information flow (e.g., from text to image or from image to text) when processing cross - modal information, without achieving two - way cross - modal interaction. This one - way mechanism limits the model's full utilization of multimodal information and makes it difficult to comprehensively capture the complex relationships between modalities.
To solve the above problems, the authors propose a Multi - level Matching Network for Multimodal Entity Linking (M3EL), which includes the following three modules:
- **Multimodal Feature Extraction module**:
- Use the pre - trained CLIP model to extract modality - specific representations of text and image, and introduce an intra - modal contrastive learning sub - module to obtain more discriminative embedding representations.
- **Intra - modal Matching Network module**:
- Contain two levels of matching granularity: Coarse - grained Global - to - Global matching and Fine - grained Global - to - Local matching to achieve local and global feature interactions within the modality.
- **Cross - modal Matching Network module**:
- Apply a two - way strategy (Textual - to - Visual and Visual - to - Textual matching) to achieve two - way cross - modal interaction and reduce the gap between different modal distributions.
Through these improvements, the M3EL model can achieve better performance in the multimodal entity linking task, especially when dealing with complex multimodal data. Experimental results show that M3EL significantly outperforms existing state - of - the - art methods on multiple benchmark datasets.
### Formula Summary
1. **Intra - modal contrastive learning loss calculation**:
\[
L(T_i^e, T_i^m)=-\log\frac{\theta(T_i^e, T_i^m)}{\theta(T_i^e, T_i^m)+\beta\cdot\Phi_{inner}+\gamma\cdot\Phi_{inter}}
\]
where:
\[
\Phi_{inner}=\sum_{T_j^e\in N_e}\theta(T_i^e, T_j^e),\quad\Phi_{inter}=\sum_{T_j^m\in N_m}\theta(T_i^e, T_j^m)
\]
\[
\theta(x, y)=e^{\delta(x, y)/\tau},\quad\delta(x, y)\text{ is cosine similarity}
\]
2. **Final contrastive learning loss**:
\[
L_{cl}=\text{avg}\left(\sum_i [L(T_i^e, T_i^m)+L(T_i^m, T_i^e)+L(V_i^e, V_i^m)+L(V_i^m, V_i^e)]\right)
\]