Abstract:As the data size in Machine Learning fields grows exponentially, it is inevitable to accelerate the computation by utilizing the ever-growing large number of available cores provided by high-performance computing hardware. However, existing parallel methods for clustering or regression often suffer from problems of low accuracy, slow convergence, and complex hyperparameter-tuning. Furthermore, the parallel efficiency is usually difficult to improve while striking a balance between preserving model properties and partitioning computing workloads on distributed systems. In this paper, we propose a novel and simple data structure capturing the most important information among data samples. It has several advantageous properties supporting a hierarchical clustering strategy that is irrelevant to the hardware parallelism, well-defined metrics for determining optimal clustering, balanced partition for maintaining the compactness property, and efficient parallelization for accelerating computation phases. Then we combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions. Experiments illustrate that our library obtains remarkable performance on convergence, accuracy, and scalability.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the deficiencies of existing parallel clustering or regression methods in terms of high precision, fast convergence, and complex hyper - parameter adjustment in the big data environment. Specifically: 1. **Low Precision and Slow Convergence**: Existing methods such as K - Means may require thousands of iterations to converge on large - scale datasets, and the precision is not satisfactory. 2. **Complex Hyper - Parameter Adjustment**: For example, the selection of the K value directly affects the clustering quality of K - Means, but it is difficult to determine the optimal value in large - scale training. 3. **Balance between Parallel Efficiency and Model Properties**: In a distributed system, it is a challenge to efficiently divide computing tasks while maintaining model properties. To solve these problems, the author proposes a new method called "Best Friend Clustering" and constructs an efficient parallel regression library by combining regression techniques. This method has the following characteristics: - **Accuracy and Rapidity**: By simplifying the concept of "best friend" in social relations, unnecessary calculations are reduced and the convergence speed is increased. - **No Need for Hyper - Parameter Adjustment**: It avoids the dependence on hyper - parameters such as the K value, thereby improving robustness. - **Scalability**: Through an optimized hierarchical clustering structure, the scalability of the algorithm on large - scale datasets is enhanced. In addition, the author also designs a balanced partitioning algorithm based on the backtracking mechanism to achieve data parallelization and uses a hybrid data and model parallel structure to improve prediction performance. The experimental results show that this method performs well in terms of convergence, accuracy, and scalability. ### Formula Summary 1. **Linear Regression**: \[ w_{\text{opt}}=\arg\min\frac{1}{2}\sum_{i = 1}^{n}(y_i - w^T x_i)^2 \] 2. **Ridge Regression**: \[ w_{\text{opt}}=\arg\min\left(\frac{1}{2}\sum_{i = 1}^{n}(y_i - w^T x_i)^2+\frac{1}{2}\lambda\|w\|^2\right) \] 3. **Kernel Ridge Regression**: \[ \alpha=(K + \lambda I_N)^{-1}y \] \[ y^*=\sum_{i = 1}^{N}\alpha_i k(x^*, x_i) \] 4. **Support Vector Regression**: \[ w_{\text{opt}}=\arg\min\left(\frac{1}{2}\|w\|^2+C\sum_{i = 1}^{N}(\xi_i+\xi_i^*)\right) \] \[ \begin{aligned} &y_i - w^T x_i\leq\epsilon+\xi_i^*\\ &w^T x_i - y_i\leq\epsilon+\xi_i\\ &\xi_i,\xi_i^*\geq0 \end{aligned} \] These formulas show the mathematical expressions of different regression techniques and help to understand the calculation methods involved in the paper.

An Accurate and Efficient Large-scale Regression Method through Best Friend Clustering

Distributed Affinity Propagation Clustering Based on MapReduce

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

A Unified Framework for Representation-Based Subspace Clustering of Out-of-Sample and Large-Scale Data.

A Hybrid Approach to Clustering in Very Large Databases

Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging

Online Job Scheduling in Distributed Machine Learning Clusters

Distributed structural clustering on large graph

Boosting cluster tree with reciprocal nearest neighbors scoring

Large-Scale Clustering With Structured Optimal Bipartite Graph

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

Parallel Massive Clustering of Discrete Distributions

Distributed Kernel K-Means for Large Scale Clustering

A Communication-Efficient Parallel Method for Group-Lasso.

Large-Scale Clustering on 100 M-Scale Datasets Using a Single T4 GPU via Recall KNN and Subgraph Segmentation

Accurate, Efficient and Scalable Graph Embedding

A New Clustering Method Suitable for Large Scale Data

Clusterwise Functional Linear Regression Models.

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

Task allocation for decentralized training in heterogeneous environment

A Scalable Approach for General Correlation Clustering