Abstract:With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure "B-correlation tree" to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.

Efficient subject-oriented evaluating and mining methods for data with schema uncertainty

Schema-Driven Performance Evaluation for Highly Concurrent Scenarios.

An Easy-to-use Evaluation Framework for Benchmarking Entity Recognition and Disambiguation Systems.

Assessing Data Quality Within Available Context

Mining Top-k Minimal Redundancy Frequent Patterns over Uncertain Databases.

Efficient Cube Computing on an Extended Multidimensional Model over Uncertain Data.

Spatial Data Mining with Uncertainty

A Survey of Uncertain Data Management

Uncertain Spatial Data Mining Algorithms

Accelerated Frequent Closed Sequential Pattern Mining for Uncertain Data

A statistical approach to instance-level schema matching

Mutual Enhancement of Schema Mapping and Data Mapping

Effectively Indexing the Uncertain Space

Uncertainty Processing and Measurement of Spatial Data Association Rules Mining

A framework for the uncertain spatial data mining

Effectively Indexing the Multidimensional Uncertain Objects

Interactive Mining of Schema for Semistructured Data.

Correlated-Clustering Frame: A Holistic Method of Deep Web Schema Matching Based on Data Mining

Approximate Top-K Answering under Uncertain Schema Mappings

Uncertainty Handling in a Tabular Representation

Cleaning Uncertain Data with Crowdsourcing - a General Model with Diverse Accuracy Rates