Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters.

Haimei Zhao,Zhuo Chen,Qiuhui Tong,Yuan Bo
DOI: https://doi.org/10.1007/978-3-030-04212-7_14
2018-01-01
Abstract:Due to the restriction of computing resources, it is often inconvenient to directly conduct analysis on massive datasets. Instead, a set of representatives can be extracted to approximate the spatial distribution of data objects. Standard data mining algorithms are then performed on these selected points only, which typically account for a small fraction of the original data, reducing the computational time significantly. In practice, the boundary points of data clusters can be regarded as a compact and effective representation of the original data, with great potential in clustering, outlier or anomaly detection and classification. As a result, given a complex dataset, how to reliably identify a set of effective boundary points creates a new challenge in data mining. In this paper, we present a boundary extraction technique similar to the method in SCUBI (Scalable Clustering Using Boundary Information). The key difference is that our technique exploits the clustering information in a feedback loop to further refine the boundary. Experimental results show that our technique is more robust and can produce more representative boundary points than SCUBI, especially on complex datasets with large inhomogeneity in terms of cluster density.
What problem does this paper attempt to address?