An Accurate and Efficient Large-scale Regression Method through Best Friend Clustering

Kun Li,Liang Yuan,Yunquan Zhang,Gongwei Chen
DOI: https://doi.org/10.48550/arXiv.2104.10819
2021-04-22
Abstract:As the data size in Machine Learning fields grows exponentially, it is inevitable to accelerate the computation by utilizing the ever-growing large number of available cores provided by high-performance computing hardware. However, existing parallel methods for clustering or regression often suffer from problems of low accuracy, slow convergence, and complex hyperparameter-tuning. Furthermore, the parallel efficiency is usually difficult to improve while striking a balance between preserving model properties and partitioning computing workloads on distributed systems. In this paper, we propose a novel and simple data structure capturing the most important information among data samples. It has several advantageous properties supporting a hierarchical clustering strategy that is irrelevant to the hardware parallelism, well-defined metrics for determining optimal clustering, balanced partition for maintaining the compactness property, and efficient parallelization for accelerating computation phases. Then we combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions. Experiments illustrate that our library obtains remarkable performance on convergence, accuracy, and scalability.
Machine Learning,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the deficiencies of existing parallel clustering or regression methods in terms of high precision, fast convergence, and complex hyper - parameter adjustment in the big data environment. Specifically: 1. **Low Precision and Slow Convergence**: Existing methods such as K - Means may require thousands of iterations to converge on large - scale datasets, and the precision is not satisfactory. 2. **Complex Hyper - Parameter Adjustment**: For example, the selection of the K value directly affects the clustering quality of K - Means, but it is difficult to determine the optimal value in large - scale training. 3. **Balance between Parallel Efficiency and Model Properties**: In a distributed system, it is a challenge to efficiently divide computing tasks while maintaining model properties. To solve these problems, the author proposes a new method called "Best Friend Clustering" and constructs an efficient parallel regression library by combining regression techniques. This method has the following characteristics: - **Accuracy and Rapidity**: By simplifying the concept of "best friend" in social relations, unnecessary calculations are reduced and the convergence speed is increased. - **No Need for Hyper - Parameter Adjustment**: It avoids the dependence on hyper - parameters such as the K value, thereby improving robustness. - **Scalability**: Through an optimized hierarchical clustering structure, the scalability of the algorithm on large - scale datasets is enhanced. In addition, the author also designs a balanced partitioning algorithm based on the backtracking mechanism to achieve data parallelization and uses a hybrid data and model parallel structure to improve prediction performance. The experimental results show that this method performs well in terms of convergence, accuracy, and scalability. ### Formula Summary 1. **Linear Regression**: \[ w_{\text{opt}}=\arg\min\frac{1}{2}\sum_{i = 1}^{n}(y_i - w^T x_i)^2 \] 2. **Ridge Regression**: \[ w_{\text{opt}}=\arg\min\left(\frac{1}{2}\sum_{i = 1}^{n}(y_i - w^T x_i)^2+\frac{1}{2}\lambda\|w\|^2\right) \] 3. **Kernel Ridge Regression**: \[ \alpha=(K + \lambda I_N)^{-1}y \] \[ y^*=\sum_{i = 1}^{N}\alpha_i k(x^*, x_i) \] 4. **Support Vector Regression**: \[ w_{\text{opt}}=\arg\min\left(\frac{1}{2}\|w\|^2+C\sum_{i = 1}^{N}(\xi_i+\xi_i^*)\right) \] \[ \begin{aligned} &y_i - w^T x_i\leq\epsilon+\xi_i^*\\ &w^T x_i - y_i\leq\epsilon+\xi_i\\ &\xi_i,\xi_i^*\geq0 \end{aligned} \] These formulas show the mathematical expressions of different regression techniques and help to understand the calculation methods involved in the paper.