Distributed Matrix Computations with Low-weight Encodings

Anindya Bijoy Das,Aditya Ramamoorthy,David J. Love,Christopher G. Brinton

2023-08-23

Abstract:Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a "good" set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and 100x faster encoding compared to the available methods.

Information Theory

What problem does this paper attempt to address?

The paper attempts to address the following major issues in distributed matrix computation: 1. **Stragglers Problem**: In distributed computing, slow nodes can significantly reduce the overall computation speed. To solve this problem, the authors propose a new coding method that can tolerate the presence of slow nodes to a certain extent. 2. **Efficient Handling of Sparse Matrices**: Existing methods based on Maximum Distance Separable (MDS) codes increase the number of non-zero elements when dealing with sparse matrices, thereby increasing computation time. The new method proposed in this paper can better preserve the sparsity of the input matrix. 3. **Numerical Stability**: Many existing methods lead to numerical instability during the encoding and decoding process. The method proposed in this paper ensures numerical stability of the system while requiring low computational overhead to find a suitable set of random coefficients. 4. **Adaptability to Heterogeneous Systems**: Most existing methods assume that all worker nodes have the same storage capacity and computation speed, but actual systems are often heterogeneous. The method proposed in this paper can adapt to worker nodes with different storage capacities and computation speeds, and can effectively utilize partial computation results from slow nodes to improve overall computation speed. Through the above improvements, the authors demonstrate their method's experimental results on an Amazon Web Services (AWS) cluster, showing that compared to existing methods, the new method can reduce the computation time of each worker node by up to 30%, and can be 100 times faster in determining a suitable set of random coefficients.

Distributed Matrix Computations with Low-weight Encodings

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

Preserving Sparsity and Privacy in Straggler-Resilient Distributed Matrix Computations

Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

Low Complexity Distributed Computing via Binary Matrices with Extension to Stragglers

A New Coding Scheme for Matrix-Vector Multiplication Via Universal Decodable Matrices.

Coded Sparse Matrix Multiplication

Coded Computing for Resilient, Secure, and Privacy-Preserving Distributed Matrix Multiplication

Network Coding Approaches for Distributed Computation over Lossy Wireless Networks.

Distributed matrix multiplication with straggler tolerance using algebraic function fields

On the Optimal Recovery Threshold of Coded Matrix Multiplication

Distributed Matrix-Vector Multiplication with Sparsity and Privacy Guarantees

Distributed Matrix Multiplication with a Smaller Recovery Threshold through Modulo-based Approaches

Code Design and Latency Analysis of Distributed Matrix Multiplication with Straggling Servers in Fading Channels

Coded Computation across Shared Heterogeneous Workers with Communication Delay

Flexible Distributed Matrix Multiplication

Flexible Field Sizes in Secure Distributed Matrix Multiplication via Efficient Interference Cancellation

Distributed matrix multiplication with straggler tolerance over very small field

Distributed Cluster-Based Solution Techniques For Dense Linear Equations

"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products