An Outlier-Detection Based Approach for Automatic Entity Matching
Feng-Feng FAN,Zhan-Huai LI,Qun CHEN,Hai-Long LIU
DOI: https://doi.org/10.11897/SP.J.1016.2017.02197
2017-01-01
Abstract:Entity Matching,also known as Record Matching,is a key technique in data integration and cleaning process.Its typical applications include the commercial products matching across different websites and the research paper records matching between the DBLP (Digital Bibliorgrophy & Library Project) and Scholar digital libraries.The widespread data quality defects in real data,e.g.,tuple errors,missing values and representation diversities,make the entity matching problem much challenging.The popular entity matching algorithms can be categorized into rule-based,probabilistic and learning-based approaches.In e-commercial data,the descriptions of the same products may vary greatly.For the entity matching task on those datasets with representation diversity problems,it is difficult to design effective matching rules and remains challenging to train classification models.To address this issue,this paper proposes an Outlier-Detection-based approach,denoted by ODetec,for automatic entity matching.Firstly,the ODetec measures the similarities on the matching attributes for each record pair,and map the pairs into points in feature space.Then it calculates the outlier distances for each record pair in the feature space.Finally,it ranks the pairs by their outlier distances and extracts those matching candidates that meet the matching constraints.In addition,ODetec can transform multiple co-related matching features into orthogonal principal components by Principal Component Analysis,breaking through the limitation of conditional independence between attributes that is required by Fellegi-Sunter model.Thus it reaches better effect and broader applicability.Our extensive experiments on real datasets have verifiedthe effectiveness of the ODetec approach.