Tianshu Wang,Hongyu Lin,Cheng Fu,Xianpei Han,Le Sun,Feiyu Xiong,Hui Chen,Minlong Lu,Xiuwen Zhu
Abstract:Entity matching (EM) is the most critical step for entity resolution (ER). While current deep learningbased methods achieve very impressive performance on standard EM benchmarks, their realworld application performance is much frustrating. In this paper, we highlight that such the gap between reality and ideality stems from the unreasonable benchmark construction process, which is inconsistent with the nature of entity matching and therefore leads to biased evaluations of current EM approaches. To this end, we build a new EM corpus and re-construct EM benchmarks to challenge critical assumptions implicit in the previous benchmark construction process by step-wisely changing the restricted entities, balanced labels, and single-modal records in previous benchmarks into open entities, imbalanced labels, and multimodal records in an open environment. Experimental results demonstrate that the assumptions made in the previous benchmark construction process are not coincidental with the open environment, which conceal the main challenges of the task and therefore significantly overestimate the current progress of entity matching. The constructed benchmarks and code are publicly released
What problem does this paper attempt to address?
This paper attempts to address the gap between existing benchmarks and practical application scenarios in the field of entity matching (EM). Specifically, the paper points out that current deep - learning - based methods achieve very impressive performance on standard EM benchmarks, but their performance in real - world applications is disappointing. This gap between ideal and reality mainly stems from the unreasonable benchmark construction processes, which are inconsistent with the essence of entity matching, thus leading to biased evaluations of current EM methods. To bridge this gap, the authors reconstructed the EM benchmark by gradually changing the restricted entities, balanced labels, and single - modality records in the existing benchmark to open entities, unbalanced labels, and multi - modality records, in order to be closer to the actual situation in an open environment.
### Main Problems
1. **Restricted Entity Assumption**:
- Most of the entity clusters and records in existing benchmarks are covered by the training set, which is inconsistent with the situation in the real world where a large number of entity clusters and records are unseen. Therefore, existing benchmarks cannot effectively evaluate the performance of entity matchers in an open environment.
2. **Balanced Label Assumption**:
- The ratio of matching and non - matching instances in existing benchmarks is relatively low and close, while in real - world applications this ratio is usually extremely high (for example, 1:100). This imbalance is one of the most critical challenges in entity matching, but existing benchmarks fail to reflect this.
3. **Single Modality Assumption**:
- Existing benchmarks mainly focus on text attributes and ignore the importance of other modalities (such as images, audio) in entity matching. In an open environment, multi - modality information can significantly improve the performance of entity matching, but existing benchmarks fail to evaluate this.
### Solutions
- **Construction of a New Dataset**:
- The authors constructed a multi - modality dataset containing more than 120,000 records, covering 10,000 products. Each record contains high - quality image attributes.
- **Construction of a New Benchmark**:
- By gradually removing the above three assumptions, the authors constructed four new benchmark tests:
- **Open Matching (OM)**: The test set contains completely unseen entity clusters and records.
- **Cluster - focused Matching (CFM)**: The test set contains unseen records in seen entity clusters.
- **Record Linking (RL)**: The test set contains one seen record and one unseen record.
- **Standard Setting (Vanilla)**: Follows the existing benchmark construction standards.
### Experimental Results
- **Restricted Entity Assumption**:
- When the restricted entity assumption is removed, the performance of the model in the open - matching scenario drops significantly, with the F1 score dropping from nearly 90% to about 67%.
- **Balanced Label Assumption**:
- When the ratio of matching to non - matching in the test set is increased to 1:100, the model performance drops significantly, especially in the open - matching benchmark, where the F1 score drops to 14.52%.
- **Single Modality Assumption**:
- After introducing visual attributes, the performance of the model in open - cluster and unbalanced settings is significantly improved, with the F1 score increasing by 7 to 11 percentage points. The multi - modality model even achieves an F1 score improvement of more than 40% in some categories.
### Conclusion
The paper reveals three implicit assumptions in existing EM benchmarks and, by constructing a new multi - modality dataset and benchmark tests, shows how these assumptions mask the key challenges in entity matching. The experimental results show that existing benchmarks significantly overestimate the performance of current EM methods, and the new benchmark tests are more capable of reflecting the entity - matching challenges in the real world.