Inference with K-means

Alfred K. Adzika,Prudence Djagba
2024-10-04
Abstract:This thesis aims to invent new approaches for making inferences with the k-means algorithm. k-means is an iterative clustering algorithm that randomly assigns k centroids, then assigns data points to the nearest centroid, and updates centroids based on the mean of assigned points. This process continues until convergence, forming k clusters where each point belongs to the closest centroid. This research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach. Through extensive experimentation and analysis, key findings have emerged. It is observed that a larger number of clusters or partitions tends to yield lower errors while increasing the number of assigned data points does not significantly improve inference errors. Reducing losses in the learning process does not significantly impact overall inference errors. Indicating that as learning is going on inference errors remain unchanged. Recommendations include the need for specialized inference techniques to estimate better data points derived from multi-clustered data and exploring methods that yield improved results with larger assigned datasets. By addressing these recommendations, this research advances the accuracy and reliability of inferences made with the k-means algorithm, bridging the gap between clustering and non-parametric density estimation and inference.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to develop new methods for inference on the basis of the k - means clustering algorithm, especially in the case of using the online balanced k - means algorithm. Specifically, the focus of the research is to explore methods for predicting the last component of data points obtained from the distribution through experiments and analysis, and to evaluate the accuracy of these methods. ### Research Background 1. **k - means Clustering Algorithm**: - k - means is an iterative clustering algorithm. It randomly initializes k centroids, assigns data points to the nearest centroid, and updates the centroid according to the data points assigned to each cluster until convergence. - The goal of this algorithm is to minimize the within - cluster sum of squares (WCSS), that is, to minimize the sum of the squared distances between data points and their assigned centroids. 2. **Online Balanced k - means Algorithm**: - Online k - means can handle real - time stream data. Each time a new data point is received, it is assigned to the nearest centroid and the centroid is updated. - Balanced k - means aims to generate clusters of equal size, constrains the clusters to be of equal size, and optimizes the mean squared error (MSE) for a given cluster size. - The online balanced k - means combines the advantages of both, being able to handle real - time data and keep the clusters balanced. 3. **Non - parametric Density Estimation**: - Non - parametric density estimation is a statistical method for estimating the probability density function (PDF) of a random variable without assuming a specific functional form. Commonly used methods include histogram estimation, kernel density estimation (KDE), k - nearest neighbor estimation (KNN), and Voronoi density estimation. 4. **Voronoi Diagram and Voronoi Density Estimation**: - A Voronoi diagram divides a set of points on a plane into several regions, and all points within each region are closest to a particular point. - Voronoi density estimation estimates the density of a distribution by dividing the data space into Voronoi cells and considering a locally constant probability density function within each cell. ### Research Objectives The main objectives of this research are: 1. **Develop New Inference Methods**: Propose new inference methods for the online balanced k - means algorithm to predict the last component of data points. 2. **Evaluate the Accuracy of Inference Methods**: Evaluate the performance of these new methods through extensive experiments and analysis, especially their performance under different hyper - parameter settings. 3. **Explore Influencing Factors**: Study the influence of hyper - parameters such as the number of clusters, the learning rate, and the balance factor on the inference error, and find the optimal configuration. ### Main Findings 1. **Influence of the Number of Clusters**: Increasing the number of clusters (k) usually reduces the inference error, while increasing the number of assigned data points does not significantly improve the inference error. 2. **Influence of the Learning Rate**: The learning rate (α) has an important influence on the inference performance. The optimal learning rate is approximately 0.6. A too high learning rate (such as 0.9 or 1) will lead to a large error. 3. **Influence of the Balance Factor**: The balance factor (β) helps to keep the clusters balanced. The optimal balance factor ranges from - 0.21 to 0.7. 4. **Comparison of Inference Methods**: The method of combining normalized weights and cluster sizes performs the worst, while other methods are relatively stable on different data instances. ### Conclusions This research improves the accuracy and reliability of inference based on k - means by introducing the online balanced k - means algorithm and multiple inference methods. Although the inference performance is not affected by the number of training data points, the accuracy of inference can be significantly improved by selecting appropriate hyper - parameters. Future research can further explore specialized inference techniques to better estimate data points in multi - cluster data.