Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor.

Lijuan Jiang,Chao Yang,Wenjing Ma
DOI: https://doi.org/10.1145/3378176
IF: 1.444
2020-01-01
ACM Transactions on Architecture and Code Optimization
Abstract:We present a systematic methodology for optimizing batched matrix multiplications on SW26010 many-core processor of the Sunway TaihuLight supercomputer. Five surrogate algorithms and a machine learning–based algorithm selector are proposed to fully exploit the computing capability of SW26010 and cope with the sophisticated algorithm characteristics of batched matrix multiplications. Experiment results show that the algorithm selector is able to adaptively choose the appropriate algorithm for various matrix shapes and batch sizes with low overhead and high accuracy. In particular, the optimized batched matrix multiplications can substantially outperform the non-batched version and reach around 84.8% of the performance upper bound.
What problem does this paper attempt to address?