Evaluating Blocking Biases in Entity Matching

Mohammad Hossein Moslemi,Harini Balamurugan,Mostafa Milani
2024-09-25
Abstract:Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.
Machine Learning,Databases
What problem does this paper attempt to address?
The paper aims to address the fairness issues in blocking methods during the Entity Matching (EM) process. Specifically: - **Core Issue**: Entity matching faces the problem of increasing computational complexity as data grows when dealing with equivalent entities from different data sources. To tackle this challenge, blocking techniques are commonly used to reduce the number of comparisons, thereby improving scalability. However, existing blocking methods may unintentionally favor certain demographic groups, leading to unfair results. - **Research Objective**: This paper extends traditional blocking evaluation metrics to include fairness considerations, providing a framework for assessing the bias in blocking techniques. The experiments analyze the effectiveness and fairness of various blocking methods, revealing potential biases and emphasizing the importance of considering fairness in EM tasks. - **Main Contribution**: A new evaluation method is proposed, which introduces disparity metrics (such as RR gap, PC gap, etc.) to quantify the differences in blocking quality among different demographic groups, thereby identifying potential unfairness. This helps in developing more accurate and fair blocking methods.