MP-KMeans: K-Means with Missing Pattern for Data of Missing Not at Random

Ruifeng Zhou,Hong Yu
DOI: https://doi.org/10.1007/978-3-031-21244-4_18
2022-01-01
Abstract:K-Means is one of the most popular clustering algorithm. It aims to minimize the sum of pair-wise distance within a cluster. It has been widely used in data analysis, image recognition and many other fields. However, traditional K-Means cannot handle missing values, which greatly limits its application scenarios. Missing values are ubiquitous in the real world due to sensor failure, high cost, and privacy protection. The appearance of missing values leads to useful information lost in the information system, and makes it difficult to perform data mining. Currently, improvements of K-Means for missing values generally based on data completion and partial distance strategy. Above methods achieve satisfied performance with random missing values, but they will fail when data is missing not at random (MNAR). Considering the effect of missing mechanism, this paper proposes an improved method of traditional K-Means for data of missing not at random, which integrating missing pattern in the distance measurement to assist clustering process. The experiment results on public datasets show that the proposed method outperforms data completion-based K-Means and partial distance-based K-Means.
What problem does this paper attempt to address?