Distributed Matrix Computations with Low-weight Encodings

Anindya Bijoy Das,Aditya Ramamoorthy,David J. Love,Christopher G. Brinton
2023-08-23
Abstract:Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a "good" set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and 100x faster encoding compared to the available methods.
Information Theory
What problem does this paper attempt to address?
The paper attempts to address the following major issues in distributed matrix computation: 1. **Stragglers Problem**: In distributed computing, slow nodes can significantly reduce the overall computation speed. To solve this problem, the authors propose a new coding method that can tolerate the presence of slow nodes to a certain extent. 2. **Efficient Handling of Sparse Matrices**: Existing methods based on Maximum Distance Separable (MDS) codes increase the number of non-zero elements when dealing with sparse matrices, thereby increasing computation time. The new method proposed in this paper can better preserve the sparsity of the input matrix. 3. **Numerical Stability**: Many existing methods lead to numerical instability during the encoding and decoding process. The method proposed in this paper ensures numerical stability of the system while requiring low computational overhead to find a suitable set of random coefficients. 4. **Adaptability to Heterogeneous Systems**: Most existing methods assume that all worker nodes have the same storage capacity and computation speed, but actual systems are often heterogeneous. The method proposed in this paper can adapt to worker nodes with different storage capacities and computation speeds, and can effectively utilize partial computation results from slow nodes to improve overall computation speed. Through the above improvements, the authors demonstrate their method's experimental results on an Amazon Web Services (AWS) cluster, showing that compared to existing methods, the new method can reduce the computation time of each worker node by up to 30%, and can be 100 times faster in determining a suitable set of random coefficients.