Towards efficient canonical polyadic decomposition on sunway many-core processor
Ming Dun,Yunchun Li,Qingxiao Sun,Hailong Yang,Wei Li,Zhongzhi Luan,Lin Gan,Guangwen Yang,Depei Qian
DOI: https://doi.org/10.1016/j.ins.2020.11.013
IF: 8.1
2021-03-01
Information Sciences
Abstract:<p>Canonical Polyadic Decomposition (CPD) is one of the most popular tensor decomposition methods and plays an important role in big data analysis. For sparse tensor, the major computation procedure in CPD, which is known as matricized tensor times Khatri-Rao product (MTTKRP), exhibits discontinuous memory access and turns to be the performance bottleneck from achieving high performance on emerging processor architectures. In this paper, we propose <em>swCPD</em>, an efficient CPD implementation on the many-core Sunway processor. The <em>swCPD</em> accelerates the optimization algorithms dominating the performance of MTTKRP, including Alternating Least Squares (ALS), Gradient Descent (GD) and Randomized Block Sampling (RBS), as well as the latest fast Levenberg-Marquardt (fLM++) and Generalized Canonical Polyadic Decomposition with Stochastic Gradient Descent (GCP-SGD). The main idea adopted in <em>swCPD</em> is a hierarchical partitioning mechanism. From the computation perspective, the 64 Computation Processing Elements (CPEs) in a Sunway processor are divided into eight <em>groups</em>, with each <em>group</em> containing seven <em>workers</em> and one <em>controller</em>. From the data perspective, we partition the sparse tensor into different granularities, which are <em>blocks</em>, <em>bands</em> and <em>tiles</em>. Moreover, we develop a communication mechanism through register communication for cooperation between CPEs. We evaluate the implementation of <em>swCPD</em> with both synthesized and real-world datasets. The experiment results show that each optimized algorithm in <em>swCPD</em> achieves better performance than corresponding algorithms adopted in cutting-edge CPD implementations.</p>
computer science, information systems