Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms
Rong Gu,Yun Tang,Chen Tian,Hucheng Zhou,Guanru Li,Xudong Zheng,Yihua Huang
DOI: https://doi.org/10.1109/tpds.2017.2686384
IF: 5.3
2017-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about $1.29\times$ , $3.53\times$ and $2.21\times$ speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about $12.8\times$ , $5.1\times$ and $27.2\times$ speedup, respectively.