Abstract:The goal of entity matching is to find the corresponding records representing the same entity from different data sources. At present, in the mainstream methods, rule-based entity matching methods need tremendous domain knowledge. Machine-learning-based or deep-learning-based entity matching methods need a large number of labeled samples to build the model, which is difficult to achieve in some applications. In addition, learning-based methods are more likely to overfit, so the quality requirements of training samples are very high. In this paper, we present an active learning method for entity matching tasks. This method needs to manually label only a small number of valuable samples, and use these labeled samples to build a model with high quality. This paper proposes hybrid uncertainty as a query strategy to find those valuable samples for labeling, which can minimize the number of labeled training samples and at the same time meet the requirements of entity matching tasks. The proposed method is validated on seven data sets in different fields. The experiments show that the proposed method uses only a small number of labeled samples and achieves better effects compared to current existing approaches.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper primarily addresses several key issues in the entity matching task: 1. **Insufficient labeled samples**: - In many real-world applications, obtaining a large number of labeled samples is very difficult because manual labeling requires a lot of human effort and time, and it is challenging to acquire enough effective labels in a short period. 2. **Data sample imbalance**: - The entity matching task usually exhibits extreme data sample imbalance, where the number of non-matching samples far exceeds the number of matching samples. This imbalance may lead to insufficient training of matching samples in binary classification tasks. 3. **Low labeling efficiency**: - Many entity pairs can be easily determined to match or not through simple comparison, thus the value of labeling these data is low. If records from different datasets are directly given to experts for pairwise labeling, it would result in a large amount of unnecessary work. 4. **Adaptability to new domains**: - Although deep learning methods based on pre-trained language models can achieve good matching results, they struggle to perform well when encountering new domain problems without suitable pre-trained models and domain-specific knowledge. To address the above issues, the authors propose a pool-based active learning method to train the entity matching model. This method aims to achieve higher accuracy by labeling only a small number of the most valuable samples. Specifically, the method employs mixed uncertainty as a query strategy to select the most valuable samples for labeling from the unlabeled data pool. This approach does not require pre-trained language models or complex preprocessing steps and can perform well with a small number of labeled samples. In summary, the main goal of this paper is to effectively address issues such as insufficient labeled samples, data imbalance, and low labeling efficiency in the entity matching task through an active learning method, thereby improving the overall model performance.

Entity Matching by Pool-Based Active Learning

Reserch of Entity Matching Based on Multiple Heterogenous Data

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Deep entity matching with adversarial active learning

Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets

Learning to Label with Active Learning and Reinforcement Learning.

Deep Reinforcement Learning for Entity Alignment

Low-resource Deep Entity Resolution with Transfer and Active Learning

ActiveEA: Active Learning for Neural Entity Alignment

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching

Multi-Context Attention for Entity Matching.

Entity Matching Across Heterogeneous Sources

Active Deep Learning on Entity Resolution by Risk Sampling

Leveraging Large Language Models for Entity Matching

Liberal Entity Matching as a Compound AI Toolchain

On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach

GNEM: A Generic One-to-Set Neural Entity Matching Framework

Disambiguate Entity Matching using Large Language Models through Relation Discovery

Entity Alignment with Noisy Annotations from Large Language Models

Entity Matching using Large Language Models

Lambda: Learning Matchable Prior For Entity Alignment with Unlabeled Dangling Cases