Clustering validation by distribution hypothesis learning

Ariel E. Bayá,Mónica G. Larese
DOI: https://doi.org/10.1007/s11222-024-10511-8
IF: 2.3241
2024-10-11
Statistics and Computing
Abstract:We present a new clustering validation technique named: "Hypothesis Learning". We build our method on three concepts: (1) clustering cohesion, (2) clustering dispersion and, (3) hypothesis quality. The first two notions focus on individual cluster quality. We measure them using a classifier estimating the tightness and separation as a likelihood. The third notion evaluates the complexity of learning the clustering partition. Similar to cohesion and dispersion, we get a likelihood value. Next, we aggregate these three measures to find a single index reporting clustering quality. Previous methods from the literature have already used supervised and unsupervised algorithms and stability concepts to validate clustering solutions. Our motivation is not only to improve these methods but to use learning algorithms in a novel manner to learn key clustering concepts such as cohesion and dispersion. Furthermore, we include a technical discussion on how to regularize a classifier to handle overfit, thus explaining the symbiosis between supervised and unsupervised algorithms. In our experimental setup, we tested "Hypothesis Learning" with a fast classifier, K Nearest Neighbour (KNN). However, in the discussion of the method, we explore other classifiers like CART and Random Forest. The experimental results compare our approach with a similar method and many other well-known clustering indexes.
statistics & probability,computer science, theory & methods
What problem does this paper attempt to address?