Entity Matching by Pool-Based Active Learning

Youfang Han,Chunping Li
DOI: https://doi.org/10.3390/electronics13030559
IF: 2.9
2024-01-31
Electronics
Abstract:The goal of entity matching is to find the corresponding records representing the same entity from different data sources. At present, in the mainstream methods, rule-based entity matching methods need tremendous domain knowledge. Machine-learning-based or deep-learning-based entity matching methods need a large number of labeled samples to build the model, which is difficult to achieve in some applications. In addition, learning-based methods are more likely to overfit, so the quality requirements of training samples are very high. In this paper, we present an active learning method for entity matching tasks. This method needs to manually label only a small number of valuable samples, and use these labeled samples to build a model with high quality. This paper proposes hybrid uncertainty as a query strategy to find those valuable samples for labeling, which can minimize the number of labeled training samples and at the same time meet the requirements of entity matching tasks. The proposed method is validated on seven data sets in different fields. The experiments show that the proposed method uses only a small number of labeled samples and achieves better effects compared to current existing approaches.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily addresses several key issues in the entity matching task: 1. **Insufficient labeled samples**: - In many real-world applications, obtaining a large number of labeled samples is very difficult because manual labeling requires a lot of human effort and time, and it is challenging to acquire enough effective labels in a short period. 2. **Data sample imbalance**: - The entity matching task usually exhibits extreme data sample imbalance, where the number of non-matching samples far exceeds the number of matching samples. This imbalance may lead to insufficient training of matching samples in binary classification tasks. 3. **Low labeling efficiency**: - Many entity pairs can be easily determined to match or not through simple comparison, thus the value of labeling these data is low. If records from different datasets are directly given to experts for pairwise labeling, it would result in a large amount of unnecessary work. 4. **Adaptability to new domains**: - Although deep learning methods based on pre-trained language models can achieve good matching results, they struggle to perform well when encountering new domain problems without suitable pre-trained models and domain-specific knowledge. To address the above issues, the authors propose a pool-based active learning method to train the entity matching model. This method aims to achieve higher accuracy by labeling only a small number of the most valuable samples. Specifically, the method employs mixed uncertainty as a query strategy to select the most valuable samples for labeling from the unlabeled data pool. This approach does not require pre-trained language models or complex preprocessing steps and can perform well with a small number of labeled samples. In summary, the main goal of this paper is to effectively address issues such as insufficient labeled samples, data imbalance, and low labeling efficiency in the entity matching task through an active learning method, thereby improving the overall model performance.