Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets

Mourad Jabrane,Hiba Tabbaa,Aissam Hadri,Imad Hafidi
DOI: https://doi.org/10.1016/j.is.2024.102410
IF: 3.18
2024-05-27
Information Systems
Abstract:When solving the problem of identifying similar records in different datasets (known as Entity Resolution or ER), one big challenge is the lack of enough labeled data. Which is crucial for building strong machine learning models, but getting this data can be expensive and time-consuming. Active Machine Learning (ActiveML) is a helpful approach because it cleverly picks the most useful pieces of data to learn from. It uses two main ideas: informativeness and representativeness. Typical ActiveML methods used in ER usually depend too much on just one of these ideas, which can make them less effective, especially when starting with very little data. Our research introduces a new combined method that uses both ideas together. We created two versions of this method, called DPQ and STQ, and tested them on eleven different real-world datasets. The results showed that our new method improves ER by producing better scores, more stable models, and faster learning with less training data compared to existing methods.
computer science, information systems
What problem does this paper attempt to address?