Efficient Large Scale Distributed Matrix Computation with Spark.
Rong Gu,Yun Tang,Zhaokang Wang,Shuai Wang,Xusen Yin,Chunfeng Yuan,Yihua Huang
DOI: https://doi.org/10.1109/bigdata.2015.7364023
2015-01-01
Abstract:Matrix computation is the core of many massive data-intensive analytical applications such mining social networks, recommendation systems and nature language processing. Due to the importance of matrix computation, it has been widely studied for many years. In the Big Data ear, as the scale of the matrix grows, traditional single-node matrix computation systems can hardly cope with such large data and computation. Existing distributed matrix computation solutions are still not efficient enough, or have poor fault tolerance and usability. In this paper, we propose Marlin, an efficient distributed matrix computation library which is built on top of Spark. Marlin contains several distributed matrix operation algorithms and provides high-level matrix computation primitives for users. In Marlin, we proposed three distributed matrix multiplication algorithms for different situations. Based on this, we designed an adaptive model to choose the best approach for different problems. Moreover, to improve the computation performance, instead of naively using Spark, we put forward some optimizations including taking advantage of the native linear algebra library, reducing shuffle communication and increasing parallelism. Experimental results show that Marlin is over an order of magnitude faster than R (a widely-used statistical computing system) and the existing distributed matrix operation algorithms based on MapReduce. Moreover, Marlin achieves comparable performance to the specialized MPI-based matrix multiplication algorithm SUMMA but uses a general dataflow engine and gains common dataflow features such as scalability and fault tolerance.