Abstract:In an information system spanning multiple, distributed, and autonomous data sources, data quality is a problem intrinsic to any architecture of an integrated information system, because the providers of the data have control over their source content and how they describe it. Low quality data is also a pressing problem for consumers of distributed information. Because recent developments in the Semantic Web have suggested that it may be possible to rethink information integration, data quality research on the Semantic Web can be promising to solve the quality issues in distributed information systems. Correctness is often used synonymously with data quality. The goal of this work is to design algorithms to detect erroneous Semantic Web data by identifying abnormality, because such abnormal data is indicative quality issues. Such algorithms would be very useful in many scenarios, e.g., filtering query results derived from low quality data, providing input for other assessments (e.g. trust) and improving the quality of the data in integrated systems. One means of assessing quality is finding corroborations, e.g. an axiom that can be entailed by other axioms is more likely correct, because the entailment can be seen as a corroboration. Similarly, we have the probabilistic rules that are valid for most or verified data and a statement is entailed by these rules, then that statement is more likely correct than those that cannot be entailed. Based on these ideas, I developed the following algorithms. Utilizing a referenced data set that is assumed to contain few errors and where the closed world assumption is valid, the first algorithm tries to learn to classify data into categories for each type of error that an object property triple could have. The second algorithm focuses on relaxing the closed world assumption, i.e. the statements not existing in data can not be assumed false. Without learning from a referenced data set in advance, the third algorithm discovers the patterns that are similar to the ones used in the previous systems, but relaxes the assumption of “few errors”. Then it improves on three aspects compared to the previous systems. (1) The process of searching patterns is more efficient than the previous systems by doing a level wise searching; (2) the probabilities of patterns are affected by the data with different truth probabilities; (3) the system checks logic consistencies among patterns to further differentiate them. The last algorithm tries to discover value-clustered graph functional dependencies, an extended concept of functional dependency in databases. These dependency rules have a more general form than the patterns in the other systems, and can capture more latent semantics in data. Using them, the system greatly improves the capability of detecting abnormality on all types of values and in the situation where no explicit connections exist in data. Experiments on a number of data sets from different domains validate these systems. All these algorithms can be easily applied to common Semantic Web data in query answering systems, information integration systems and semantic search systems.

Web-based Techniques for Automatically Detecting and Correcting Information Errors in a Database

A Review on Web-Based Techniques for Automatically Detecting and Correcting Information Errors in Relational Databases

Automatic Web-based relational data imputation

A data quality improvement method based on non-word errors correction

Web-ADARE: A Web-Aided Data Repairing System

Incorporating Knowledge Bases and Databases to an Effective Repair of Data Errors

Error checking of large land quality databases through data mining based on low frequency associations

Improving Data Quality: Consistency and Accuracy

Towards High Quality Semantic Web Data: Detecting Abnormal Data on the Semantic Web

Methods for Detecting and Correcting Contextual Data Quality Problems.

Editorial Learning for Multimodal Data

Study of Technique of Web Database Accessing

Error data detection and repair in condition of field value missing

Automatic Web-based duplicate attribute resolution method

Intelligent Self-Repairable Web Wrappers

AutoRepair: an Automatic Repairing Approach over Multi-Source Data

Manually Detecting Errors for Data Cleaning Using Adaptive Crowdsourcing Strategies.

Web Database Techniques :A Survey

Find Answers from Web Search Results

Error Correction for Search Engine by Mining Bad Case

Automated identification of protein classification and detection of annotation errors in protein databases using statistical approaches