Mp-Dissimilarity: A Data Dependent Dissimilarity Measure.

Sunil Aryal,Kai Ming Ting,Gholamreza Haffari,Takashi Washio
DOI: https://doi.org/10.1109/icdm.2014.33
2014-01-01
Abstract:Nearest neighbour search is a core process in many data mining algorithms. Finding reliable closest matches of a query in a high dimensional space is still a challenging task. This is because the effectiveness of many dissimilarity measures, that are based on a geometric model, such as lp-norm, decreases as the number of dimensions increases. In this paper, we examine how the data distribution can be exploited to measure dissimilarity between two instances and propose a new data dependent dissimilarity measure called 'mp-dissimilarity'. Rather than relying on geometric distance, it measures the dissimilarity between two instances in each dimension as a probability mass in a region that encloses the two instances. It deems the two instances in a sparse region to be more similar than two instances in a dense region, though these two pairs of instances have the same geometric distance. Our empirical results show that the proposed dissimilarity measure indeed provides a reliable nearest neighbour search in high dimensional spaces, particularly in sparse data. Mp-dissimilarity produced better task specific performance than lp-norm and cosine distance in classification and information retrieval tasks.
What problem does this paper attempt to address?