Abstract:In the big data era, machine learning optimization algorithms usually need to be designed and implemented on widely-used distributed computing platforms, such as Apache Hadoop, Spark, and Flink. However, these general distributed computing platforms themselves do not focus on parallelizing machine learning optimization algorithms. In this paper, we present a parallel optimization algorithm framework for scalable machine learning, and empirically evaluate the synchronous Elastic Averaging SGD (EASGD) and other distributed SGD-based optimization algorithms. First, we design a distributed machine learning optimization algorithm framework based on Apache Spark by adopting the parameter server. Then, we design and implement the widely-used distributed synchronous EASGD and several other popular SGD-based optimization algorithms, such as Adadelta and Adam, on top of the framework. In addition, we evaluate the performance of synchronous distributed EASGD compared with other distributed optimization algorithms based on the same framework. Finally, to explore the optimal settings of mini-batch size in large-scale distributed optimization, we further analyze the empirical linear scaling rule originally proposed in the single-node environment. Experimental results show that our parallel optimization algorithm framework achieves good flexibility and scalability. And, the distributed synchronous EASGD runs over the proposed framework gains a competitive convergence performance and is about 5.7% faster than other distributed SGD-based optimization algorithms. Experimental results also verified that the empirical linear scaling rule only holds well before the mini-batch size exceeds certain threshold on large-scale benchmarks in the distributed environment.

Parallelization of Classification Algorithms Based on SparkR

Distributed High-Dimension Matrix Operation Optimization on Spark

Parallelization of Machine Learning Algorithms Respectively on Single Machine and Spark

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Parallel spectral clustering algorithm

Study of ELM Algorithm Parallelization Based on Spark

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

The parallel algorithms for LIBSVM parameter optimization based on Spark

KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial

Research on Parallelized Sentiment Classification Algorithms

Vhadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Performance and Energy Consumption of Parallel Machine Learning Algorithms

Parallelizing Machine Learning Optimization Algorithms on Distributed Data-Parallel Platforms with Parameter Server

A Parallel Varied Density-Based Clustering Algorithm with Optimized Data Partition

Data Mining Algorithm for Cloud Network Information Based on Artificial Intelligence Decision Mechanism

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

A Parallel Multiclassification Algorithm for Big Data Using an Extreme Learning Machine

Optimizing and accelerating space-time Ripley's K function based on Apache Spark for distributed spatiotemporal point pattern analysis