New bounds on the cohesion of complete-link and other linkage methods for agglomeration clustering

Sanjoy Dasgupta,Eduardo Laber
2024-05-02
Abstract:Linkage methods are among the most popular algorithms for hierarchical clustering. Despite their relevance the current knowledge regarding the quality of the clustering produced by these methods is limited. Here, we improve the currently available bounds on the maximum diameter of the clustering obtained by complete-link for metric spaces.
Machine Learning,Data Structures and Algorithms
What problem does this paper attempt to address?
The problem that this paper attempts to solve is related to the quality evaluation of linkage methods in agglomerative hierarchical clustering. Specifically, the author has improved the bounds on the maximum diameter of clusters generated by the complete - linkage method in the metric space in the existing literature. Moreover, by introducing new analytical techniques, it is able to distinguish the complete - linkage method from the single - linkage method, thus verifying the superiority of the complete - linkage method in generating compact clusters. ### Main Contributions 1. **Improved Upper Bound**: For any \(k\), the paper proves that the maximum diameter of the \(k\)-clusters generated by the complete - linkage method does not exceed \(k^{1.59}\cdot \text{OPT}_{AV}(k)\), where \(\text{OPT}_{AV}(k)\) is the optimal average diameter. This is more stringent than the previously best - known upper bound \(O(k^{1.59}\cdot \text{OPT}_{DM}(k))\). 2. **Distinguishing Complete - Linkage and Single - Linkage**: By using \(\text{OPT}_{AV}\) instead of \(\text{OPT}_{DM}\), the author has successfully distinguished the approximate performance of the complete - linkage method and the single - linkage method in the worst - case scenario. In particular, the maximum diameter of the complete - linkage method is \(O(k^{1.59}\cdot \text{OPT}_{AV}(k))\), while the maximum diameter of the single - linkage method is \(\Omega(k^{2}\cdot \text{OPT}_{AV}(k))\). 3. **Extension to Other Linkage Methods**: The paper also shows that these techniques can be applied to other linkage methods, such as the average - linkage method, and proves the upper bounds on the cohesion of the clusters generated by these methods. ### Key Concepts - **Diameter in Metric Space**: Given a metric space \((X, \text{dist})\), where \(X\) is a set of points and \(\text{dist}\) is a distance function. The diameter of a set of points \(S\) is defined as \(\text{diam}(S)=\max\{\text{dist}(x, y)\mid x, y\in S\}\). - **Maximum Diameter and Average Diameter**: - Maximum diameter \(\text{max - diam}(C):=\max\{\text{diam}(C_{i})\mid1\leq i\leq k\}\) - Average diameter \(\text{avg - diam}(C):=\frac{1}{k}\sum_{i = 1}^{k}\text{diam}(C_{i})\) ### Technical Details The author obtains the above results by carefully defining the families of clusters constructed by the complete - linkage method and partitioning them during the execution process, and then estimating the upper bounds on the diameter of each family of clusters. This method not only simplifies the analysis process but also provides more stringent bounds. ### Practical Applications For the case of small \(k\), the results of the paper are particularly important because it is usually difficult for people to analyze partitions containing many groups. In addition, the results of the paper show that the bottom - up hierarchical clustering method can also perform well in the case of small \(k\), which is contrary to the common intuition. ### Summary This paper significantly improves the theoretical guarantees of the complete - linkage method and other linkage methods in generating compact clusters through the introduction of new analytical techniques and strict mathematical proofs, providing more valuable guidance for practical applications.