A Duplicate Records Identification Model for Deep Web Data Sources

De-rong SHEN,Li-nan LIU,Yue KOU,Tie-zheng NIE,Ge YU
2010-01-01
Tien Tzu Hsueh Pao/Acta Electronica Sinica
Abstract:Duplicate records are multiple different records describing the same entity in the real world. Since some of the records extracted from different Deep Web sources in the same domain usually are duplicates, the paper focuses on duplicate records identification and a duplicate records identification model is proposed on the basis of known global schema and the relationship be-tween the global schema and the interface attributes of each Deep Web data source.Based on the semi-structured data extracted from Deep Web data sources, the attributes that these data matching to are annotated by using a query probing method and the dominance of attributes of global schema is specified by analyzing these extracting instance data. Moreover, multiple estimators and multiple similarity algorithms are adopted to identify the duplicates. The experiment results show our duplicate record identification model is feasible and efficient.
What problem does this paper attempt to address?