Data Cleaning Based On Entity Resolution

Hongzhi Wang
DOI: https://doi.org/10.4018/978-1-4666-5198-2.ch012
2014-01-01
Abstract:Data quality is one of the most prevalent problems in data management. A traditional data management application typically concerns the creation, maintenance, and use of a large amount of data, focusing only on clean datasets. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or out of date. Derived from these issues, the problem of conformity of facts from a large amount of conflicting information provided by various Web sets or different data sources to be integrated receives increasing attention. False data can generate misleading or biased analytical results and decisions and lead to loss of revenue, credibility, and customers. Based on the results of entity resolution, truth discovery shares an important role in modern data management applications. In this chapter, the authors review approaches to processing truth discovery related to central aspects of data quality (i.e., data consistency, data reduplication, data accuracy, data currency, and information completeness).
What problem does this paper attempt to address?