Abstract:Similarity query is a fundamental problem in database, data mining and information retrieval research. Recently, querying incomplete data has attracted extensive attention as it poses new challenges to traditional querying techniques. The existing work on querying incomplete data addresses the problem where the data values on certain dimensions are unknown. However, in many real-life applications, such as data collected by a sensor network in a noisy environment, not only the data values but also the dimension information may be missing. In this work, we propose to investigate the problem of similarity search on dimension incomplete data. A probabilistic framework is developed to model this problem so that the users can find objects in the database that are similar to the query with probability guarantee. Missing dimension information poses great computational challenge, since all possible combinations of missing dimensions need to be examined when evaluating the similarity between the query and the data objects. We develop the lower and upper bounds of the probability that a data object is similar to the query. These bounds enable efficient filtering of irrelevant data objects without explicitly examining all missing dimension combinations. A probability triangle inequality is also employed to further prune the search space and speed up the query process. The proposed probabilistic framework and techniques can be applied to both whole and subsequence queries. Extensive experimental results on real-life data sets demonstrate the effectiveness and efficiency of our approach.

Mining Incomplete Data Using Global and Saturated Probabilistic Approximations Based on Characteristic Sets and Maximal Consistent Blocks

Processing Methods for Incomplete Information Systems Based on Rough Sets

Mining Top-k Minimal Redundancy Frequent Patterns over Uncertain Databases.

A probability based approach for processing dimension missing data

Missing value imputation using unsupervised machine learning techniques

Data Mining in Incomplete Information

An Approach to Find Missing Values in Medical Datasets

An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

A Novel Fuzzy Rough Clustering Parameter-based missing value imputation

Missing Data Exploration: Highlighting Graphical Presentation of Missing Pattern.

On the consistency of supervised learning with missing values

Missing Value Estimation for Mixed-Attribute Data Sets

Statistical Data, Missing

Handling Missing Data in Decision Trees: A Probabilistic Approach

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

The Analysis of Social-Science Data with Missing Values

Handling Nonmonotone Missing Data with Available Complete-Case Missing Value Assumption

Combining data discretization and missing value imputation for incomplete medical datasets

A Novel Approach for Imputation of Missing Attribute Values for Efficient Mining of Medical Datasets - Class Based Cluster Approach

Searching Dimension Incomplete Databases

Missing Value Imputation With Unsupervised Backpropagation