Internal Purity: A Differential Entropy based Internal Validation Index for Clustering Validation

Bin Cao,Chen Yang,Kaibo He,JING FAN
2023-01-01
Abstract:In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance, variance and model-selection. The indices based on distance or variance cannnot catpure the real ``density" of the cluster and the time complexity for distance based indices is usually too high to be applied for large datasets. Moreover, the indices based on model-selection tend to overestimate the number of cluster in clustering validation. Therefore, we propose a novel internal validation index based on the differential entropy, named \textit{internal purity} (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained models and without further fine-tuning using the experimental datasets, we use four different clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 60 test cases in total, our IP index can return the optimal clustering results in 43 cases while the second best index can merely report the optimal partition in 17 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.
What problem does this paper attempt to address?