Data Clustering and Visualization with Recursive Max k-Cut Algorithm

An Ly,Raj Sawhney,Marina Chugunova
2024-08-15
Abstract:In this article, we continue our analysis for a novel recursive modification to the Max $k$-Cut algorithm using semidefinite programming as its basis, offering an improved performance in vectorized data clustering tasks. Using a dimension relaxation method, we use a recursion method to enhance density of clustering results. Our methods provide advantages in both computational efficiency and clustering accuracy for grouping datasets into three clusters, substantiated through comprehensive experiments.
Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of clustering algorithms in vectorized data clustering tasks. Specifically, the author proposes a recursive modification method of the Max k - Cut algorithm based on semidefinite programming (SDP), aiming to enhance the density of clustering results through dimension relaxation and recursive techniques. The paper verifies the advantages of this method in terms of computational efficiency and clustering accuracy through comprehensive experiments, especially when the data set is divided into three clusters. ### Background and Motivation With the increasing number of biomedical articles published every year, researchers begin to explore methods for clustering these articles based on features (such as citations, topics, and other similarity measures). Clustering these documents is crucial for information retrieval and modern research projects in multiple fields. In order to accurately group these articles, researchers have developed and tested many algorithms. The MaxCut and Max k - Cut algorithms have been widely studied in clustering vectorized data sets, especially through semidefinite programming, random strategies, adaptive search, etc. ### Main Contributions of the Paper 1. **Recursive Application**: The author introduces a recursive application method, which gradually optimizes the clustering effect through multiple iterations of the initial clustering results. 2. **High - Dimensional Relaxation**: The author proposes a high - dimensional relaxation method, which improves the clustering results by increasing the dimension of the data set. 3. **Experimental Verification**: Through experiments on multiple data sets, the advantages of the proposed method in terms of computational efficiency and clustering accuracy are verified. ### Specific Problems - **Max k - Cut Problem**: Given a similarity weight matrix \(W = \{w_{ij}\}\), where \(i, j = 1,\ldots,n\) (\(n\) is the number of data points), the goal is to divide the index set \(i = 1,\ldots,n\) into \(k\) sets \(A_1, A_2,\ldots, A_k\) such that \(\sum_{i < j}w_{ij}(1 - \langle y_i, y_j\rangle)\) is maximized, where \(\langle y_i, y_j\rangle\) represents the inner product of vectors \(y_i\) and \(y_j\). - **Recursive Algorithm**: Through multiple iterations, the clustering results are gradually optimized. After each iteration, the dissimilarity between clusters and the dissimilarity within clusters are calculated, and the optimal partition is updated. - **High - Dimensional Relaxation**: By mapping the original data set to a higher - dimensional space, the clustering effect is further improved. ### Experimental Results - **Moon - Shaped Data Set**: Through recursive iteration, it is observed that the clustering results are gradually optimized and finally form clear three - class clusters. - **Brain Wave Data Set**: On the reduced data set, compared with the k - nearest neighbor classifier, the clustering results generated by the proposed algorithm are very similar. - **Article Paragraph Clustering**: Through vectorization and clustering algorithms, the paragraphs discussing the side effects of amodiaquine are successfully separated from other irrelevant paragraphs. ### Conclusion The recursive and high - dimensional relaxation methods proposed in the paper perform well in multiple experiments, especially when dealing with complex data sets, which can significantly improve the accuracy and efficiency of clustering. Future research will further optimize the algorithm to better deal with more categories and more complex data sets.