A Data Cleaning Method For Citeseer Dataset

Yan Wang,Hao Zhang,Yaxin Li,Deyun Wang,Yanlin Ma,Tong Zhou,Jianguo Lu
DOI: https://doi.org/10.1007/978-3-319-48740-3_3
2016-01-01
Abstract:CiteSeer is considered as the first academic search engine that have been serving data for almost twenty years. Recently, CiteSeer graciously makes all the data public, including raw PDF files, text transformed from PDF, and metadata extracted from the text. Numerous efforts have been tried to improve the accuracy of the metadata extraction. The problem is inherently challenging and errors are abundant. In this paper, we propose an innovative record-linkage-based method for data cleaning, which use two new matching algorithms to significantly improve the cleaning performance for the CiteSeer dataset. One is an enhanced matching algorithm for local datasets, the other is developed for online datasets. Experimental results show that 48.1% wrong metadata entries can be corrected by our method in total and the improvement is more than 539% compared to existing state-of-the-art data cleaning methods.
What problem does this paper attempt to address?