Abstract:The parameter server architecture has shown promising performance advantages when handling deep learning (DL) applications. One crucial issue in this regard is the presence of stragglers, which significantly retards DL training progress. Previous solutions for solving stragglers may not fully exploit the computation resource of the cluster as evidenced by our experiments, especially in the heterogeneous environment. This motivates us to design a heterogeneity-aware parameter server paradigm that addresses stragglers and accelerates DL training from the perspective of computation parallelism. We introduce a novel methodology named straggler projection to give a comprehensive inspection of stragglers and reveal practical guidelines to solve this problem in two aspects: (1) controlling each worker's training speed via elastic training parallelism control and (2) transferring blocked tasks from stragglers to pioneers to fully utilize the computation resource. Following these guidelines, we propose the abstraction of parallelism as an infrastructure and design the Elastic-Parallelism Synchronous Parallel (EPSP) algorithm to handle distributed training and parameter synchronization, supporting both enforced- and slack-synchronization schemes. The whole idea has been implemented into a prototype called ${sf Falcon}$<math>Falcon</math> which effectively accelerates the DL training speed with the presence of stragglers. Evaluation under various benchmarks with baseline comparison demonstrates the superiority of our system. Specifically, ${sf Falcon}$<math>Falcon</math> reduces the training convergence time, by up to 61.83, 55.19, 38.92, and 23.68 percent shorter than FlexRR, Sync-opt, ConSGD, and DynSGD, respectively.

PS2: Parameter Server on Spark

Accelerating Distributed Machine Learning by Smart Parameter Server

KunPeng: Parameter Server Based Distributed Learning Systems and Its Applications in Alibaba and Ant Financial

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

Scalable Learning and Probabilistic Analytics of Industrial Big Data Based on Parameter Server: Framework, Methods and Applications

MLlib*: Fast Training of GLMs Using Spark MLlib

Towards General and Efficient Online Tuning for Spark

The parallel algorithms for LIBSVM parameter optimization based on Spark

Elastic Model Aggregation with Parameter Service

PSGraph: How Tencent Trains Extremely Large-Scale Graphs with Spark?

PETPS: Supporting Huge Embedding Models with Persistent Memory

TencentBoost: A Gradient Boosting Tree System with Parameter Server

Model Averaging in Distributed Machine Learning: a Case Study with Apache Spark

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

TR-Spark

Distributed Machine Learning through Heterogeneous Edge Systems

Falcon: Addressing Stragglers in Heterogeneous Parameter Server Via Multiple Parallelism

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

PetS: A Unified Framework for Parameter-Efficient Transformers Serving

Sparker: Efficient Reduction for More Scalable Machine Learning with Spark