Harvesting the Web for Linking and Imputing String Data in Databases
Zhixu Li
2013-01-01
Abstract:nnn The quality of data has become a crucial problem in data management. This Phd thesis mainly focuses on providing solutions to better address two common research problems in processing low-quality string data. One is the data linkage problem, which refers to combining data residing in different sources. If the data is clean and consistent everywhere, the linking process can be trivial. However, data from different sources might conflict with each other, and as a result, the linking process becomes significant. Previous attempts relying on textual similarity functions are sometimes unable to solve this problem, since strings referring to the same real world entity can be syntactically far apart from each other. The other is the data imputation problem, which refers to filling in missing attribute values in an incomplete data set. Data incompleteness is a pervasive problem in the management of data. The missing data have to be filled in if the data are critical to some statistical analysis or business process or scientific research. Previous work has been limited to getting the missing values from the complete part of the data set, but this can not solve the problem effectively.nnn Based on the premise that the knowledge and information residing in external sources over the World Wide Web can provide external information and knowledge for better addressing the two data quality problems, we propose to extract the required information and knowledge from the World Wide Web for data linkage and imputation. However, there are some challenging problems in harvesting the required knowledge from the web, which are exactly what we will tackle of in this thesis. Generally, our challenges and contributions can be summarized as four related parts below:nnn First, relation extraction is very important for not only data linkage, but also traditional Question Answering systems and emerging big data applications. A state-of-the-art relation extraction method that relies on syntactic patterns can not achieve a high recall due to the strictness of the patterns. We propose a novel Context-based Relation Extraction method which learns context terms, instead of patterns, as the auxiliary information to retrieve an expression of a relation. With this novel extraction method, we can reach a much higher extraction recall than with the conventional Pattern-based Relation extraction method.nnn Second, we propose a web-based data imputation approach, which formulates proper data imputation queries for each missing value in local databases, and then extract the target missing valuen from the documents retrieved by the imputation queries automatically. By retrieving the missing attribute values from the Web, WebPut reaches a much higher imputing recall than the previous table-based approaches.nnn Third, a well-studied challenge in web information extraction is extracting named entities from text. The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but its many redundancies cause a low efficiency of the AME process and degrade the performance of real-worldn applications using the extracted substrings. We present a new type of Dictionary-based Entity Recognition Problem, which will be referred to as Approximate Membership Localization (AML), which aims locating non-overlapping substrings, something which is a better approximation to the true matched substrings, yet without generating overlappingn redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of the overlapping redundant matched substrings before generating them. Compared to other algorithms for the Approximate Membership Extraction (AME) problem, our P-Prune algorithm outputs a better approximation to the true matched substrings without generating overlapping redundancies.nnn Fourth, semantic drift is a common problem in iterative web-based information extraction. As the iterations proceed, the extractions may shift from the target class to some other classes. This semantic drifting phenomenon may also happen in our work since we also rely on iterative extractions. Previous approaches to minimizing semantic drift have incurredn substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of the iterations. These extractions will subsequently trigger a large amount of questionable results, which lead to the semantic drift. We call these questionable extractions Drifting Points (DPs), and we propose a method to minimize semantic drift by identifying the DPs and removing the effect they introduce. As a result, we effectively cut off the propagation of errors in the iterative extraction process, and as a result, we clean a very large proportion of then semantic drift errors.