Distributed High-Dimension Matrix Operation Optimization on Spark
Qi She,Jingwei Zhang,Ya Zhou,Mingfei Qin,Qing Yang
DOI: https://doi.org/10.1109/icaci.2019.8778546
2019-01-01
Abstract:In the era of big data, the mining of valuable information from massive data has been increasingly valued by industry, academia and governments. Mining massive data needs data mining algorithms such as principal component analysis, regression, and clustering, which often use large-scale matrix operations. When the dimension of the matrix is very large, it is difficult to perform high dimensional matrix operations, but the distributed method can effectively solve the problems of computational scalability and computational complexity brought by high-dimensional matrix. On the distributed platform, Spark, we proposed a distributed matrix operation execution strategy RPMM which performs better in both matrix computing concurrency and the overhead of data shuffling. At the same time, the local sensitive hash algorithm is introduced to provide faster row vector similarity computing. Moreover, compared to the matrix operation on a single machine, these distributed matrix operations can effectively solve the scalability problem of large matrix operations.