Abstract:Entity Matching,also known as Record Matching,is a key technique in data integration and cleaning process.Its typical applications include the commercial products matching across different websites and the research paper records matching between the DBLP (Digital Bibliorgrophy & Library Project) and Scholar digital libraries.The widespread data quality defects in real data,e.g.,tuple errors,missing values and representation diversities,make the entity matching problem much challenging.The popular entity matching algorithms can be categorized into rule-based,probabilistic and learning-based approaches.In e-commercial data,the descriptions of the same products may vary greatly.For the entity matching task on those datasets with representation diversity problems,it is difficult to design effective matching rules and remains challenging to train classification models.To address this issue,this paper proposes an Outlier-Detection-based approach,denoted by ODetec,for automatic entity matching.Firstly,the ODetec measures the similarities on the matching attributes for each record pair,and map the pairs into points in feature space.Then it calculates the outlier distances for each record pair in the feature space.Finally,it ranks the pairs by their outlier distances and extracts those matching candidates that meet the matching constraints.In addition,ODetec can transform multiple co-related matching features into orthogonal principal components by Principal Component Analysis,breaking through the limitation of conditional independence between attributes that is required by Fellegi-Sunter model.Thus it reaches better effect and broader applicability.Our extensive experiments on real datasets have verifiedthe effectiveness of the ODetec approach.

Duplicate Record Detection For Data Integration

Efficient Duplicate Record Detection Based on Similarity Estimation

Automatic Web-based duplicate attribute resolution method

Assessing Data Quality Within Available Context

An Integrated Approach for Detecting Approximate Duplicate Records

Duplicate Identification in Deep Web Data Integration.

Inconsistency Detection in Distributed Big Data

The Interaction Between Schema Matching and Record Matching in Data Integration

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Efficient Similarity Joins for Near-Duplicate Detection

Identification of Approximately Duplicate Material Records in ERP Systems

A Unified Record Linkage Strategy for Web Service Data

A statistical approach to instance-level schema matching

An Outlier-Detection Based Approach for Automatic Entity Matching

Research of Matching Technology in Data Integration

An n-gram-based approach for detecting approximately duplicate database records

Matching Heterogeneous Event Data

Automatic Web-based relational data imputation

Data De-duplication on Similar File Detection

Most Possible Partition: Utilizing Semantic Links for Duplicate Detection

Entity Resolution On Single Relation