Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web
Jeffery D. Heflin,Yang Yu
2012-01-01
Abstract:In an information system spanning multiple, distributed, and autonomous data sources, data quality is a problem intrinsic to any architecture of an integrated information system, because the providers of the data have control over their source content and how they describe it. Low quality data is also a pressing problem for consumers of distributed information. Because recent developments in the Semantic Web have suggested that it may be possible to rethink information integration, data quality research on the Semantic Web can be promising to solve the quality issues in distributed information systems. Correctness is often used synonymously with data quality. The goal of this work is to design algorithms to detect erroneous Semantic Web data by identifying abnormality, because such abnormal data is indicative quality issues. Such algorithms would be very useful in many scenarios, e.g., filtering query results derived from low quality data, providing input for other assessments (e.g. trust) and improving the quality of the data in integrated systems. One means of assessing quality is finding corroborations, e.g. an axiom that can be entailed by other axioms is more likely correct, because the entailment can be seen as a corroboration. Similarly, we have the probabilistic rules that are valid for most or verified data and a statement is entailed by these rules, then that statement is more likely correct than those that cannot be entailed. Based on these ideas, I developed the following algorithms. Utilizing a referenced data set that is assumed to contain few errors and where the closed world assumption is valid, the first algorithm tries to learn to classify data into categories for each type of error that an object property triple could have. The second algorithm focuses on relaxing the closed world assumption, i.e. the statements not existing in data can not be assumed false. Without learning from a referenced data set in advance, the third algorithm discovers the patterns that are similar to the ones used in the previous systems, but relaxes the assumption of “few errors”. Then it improves on three aspects compared to the previous systems. (1) The process of searching patterns is more efficient than the previous systems by doing a level wise searching; (2) the probabilities of patterns are affected by the data with different truth probabilities; (3) the system checks logic consistencies among patterns to further differentiate them. The last algorithm tries to discover value-clustered graph functional dependencies, an extended concept of functional dependency in databases. These dependency rules have a more general form than the patterns in the other systems, and can capture more latent semantics in data. Using them, the system greatly improves the capability of detecting abnormality on all types of values and in the situation where no explicit connections exist in data. Experiments on a number of data sets from different domains validate these systems. All these algorithms can be easily applied to common Semantic Web data in query answering systems, information integration systems and semantic search systems.