Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Eduardo S. Laber,Lucas Murtinho
2024-01-13
Abstract:Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The paper attempts to address the problem of optimizing inter-group criteria in cluster analysis, particularly minimum spacing and minimum spanning tree spacing, while satisfying the constraint that each cluster group contains at least a certain number of points. Specifically: 1. **Minimum Spacing (Min-Sp)**: This is the minimum distance between points of different groups, used to measure the separation between different cluster groups. 2. **Minimum Spanning Tree Spacing (MST-Sp)**: This is the cost of the minimum spanning tree connecting all cluster groups, used to measure the overall separation of the clusters. The main contributions of the paper include: - Proposing algorithms with theoretical guarantees to maximize these two inter-group criteria, especially in both unconstrained and constrained scenarios. - Demonstrating the effectiveness of the single-linkage method in maximizing the minimum spanning tree spacing, and showing that this maximization implicitly maximizes the minimum spacing as well. - Proposing algorithms to optimize these criteria under the constraint that each group contains at least a certain number of points, and providing approximation guarantees for these algorithms. - Demonstrating through experiments the effectiveness of the proposed algorithms on real datasets, particularly in avoiding the formation of overly small cluster groups. Solving these problems is significant for ensuring the diversity and separation of clustering results, especially in application scenarios such as the selection of training data in machine learning and the maintenance of population diversity in genetic algorithms.