Abstract:Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N << L variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N-2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of computational complexity O(N-2) to obtain a stationary point and derive a closed-form approximate solution with computational complexity O(N) at a large threshold. Finally, numerical results show that the proposed solutions significantly outperform existing coded computation schemes and their extensions.

Two-Stage Coded Distributed Learning: A Dynamic Partial Gradient Coding Perspective

Optimization-Based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Coded Parallelism for Distributed Deep Learning.

Leveraging partial stragglers within gradient coding

Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning

A Low-Complexity and Adaptive Distributed Source Coding Design for Model Aggregation in Distributed Learning

Design and Optimization of Hierarchical Gradient Coding for Distributed Learning at Edge Devices

Distributed Learning based on 1-Bit Gradient Coding in the Presence of Stragglers

Age-Based Coded Computation for Bias Reduction in Distributed Learning

Heterogeneity-Aware Gradient Coding for Tolerating and Leveraging Stragglers

Gradient Coding in Decentralized Learning for Evading Stragglers

Joint Dynamic Grouping and Gradient Coding for Time-Critical Distributed Machine Learning in Heterogeneous Edge Networks

Joint Coding and Scheduling Optimization for Distributed Learning Over Wireless Edge Networks

Sequential Gradient Coding For Straggler Mitigation

Gradient Coding from Cyclic MDS Codes and Expander Graphs

Coded Distributed Graph-Based Semi-Supervised Learning

Communication-Efficient Coded Distributed Multi - Task Learning.

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Capacity of Hierarchical Secure Coded Gradient Aggregation with Straggling Communication Links

Heterogeneous Coded Computation Across Heterogeneous Workers.