Duplicate Record Detection For Data Integration

Hongzhi Wang
DOI: https://doi.org/10.4018/978-1-4666-5198-2.ch014
2014-01-01
Abstract:In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them, and the similarity is measured as the weight of such matching. Based on similarity estimation, the basic idea in this chapter is to estimate the range of the records similarity and to determine whether they are duplicate records according to the estimation. When data integration is performed on XML data, there are many problems because of the flexibility of XML. One of the current implementations is to use Data Exchange to carry out the above operations. This chapter proposes the concept of quality assurance mechanisms besides the data integrity and reliability.
What problem does this paper attempt to address?