Internal Purity: A Differential Entropy Based Internal Validation Index for Crisp and Fuzzy Clustering Validation

Bin Cao,Chen Yang,Kaibo He,Jing Fan,Honghao Gao,Pengjiang Qian
DOI: https://doi.org/10.1109/tfuzz.2024.3424479
2024-01-01
Abstract:In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indexes are implemented based on distance and variance, which cannot catpure the real “density” of the cluster. Moreover the time complexity for distance-based indexes is usually too high to be applied for large datasets. Therefore, we propose a novel internal validation index based on the differential entropy, named internal purity (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indexes. Based on deep representation settings, where six powerful deep pretrained representation models are used, and nondeep representation settings, we use five basic crisp and fuzzy clustering algorithms to compare our index with 17 other well-known internal indexes on five text, five image datasets, and five tabular datasets. The results show that, for 105 test cases in total, our IP index can return the optimal clustering results in 61 cases while the second best index can merely report the optimal partition in 20 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.
What problem does this paper attempt to address?