High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Xiao-Yang Liu,Zeliang Zhang,Zhiyuan Wang,Han Lu,Xiaodong Wang,Anwar Walid
DOI: https://doi.org/10.1109/tc.2022.3222955
IF: 3.183
2023-05-12
IEEE Transactions on Computers
Abstract:Tensor learning is a powerful tool for big data analytics and machine learning, e.g., gene analysis and deep learning. However, tensor learning algorithms are compute-intensive since their time and space complexities grow exponentially with the order of tensors, which hinders their application. In this paper, we exploit the parallelism of tensor learning primitives using GPU tensor cores and develop high-performance tensor learning algorithms. First, we propose novel hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores. Second, for big data analytics, we employ the optimized tensor learning primitives to accelerate the CP tensor decomposition and then apply it for gene analysis. Third, we optimize the Tucker tensor decomposition and propose a novel Tucker tensor layer to compress deep neural networks. We employ natural gradients to train the neural networks, which only involve a forward pass without backpropagation and thus are suitable for GPU computations. Compared with TensorLab and TensorLy libraries on an A100 GPU, our third-order CP tensor decomposition achieves up to 16.32× and 32.25× speedups; and 6.09× and 6.72× speedups for our third-order Tucker tensor decomposition. The proposed fourth-order CP and Tucker tensor decompositions achieve up to 30.65× and 5.41× speedups over the TensorLab. Our CP tensor decomposition for gene analysis achieves up to 5.88× speedup over TensorLy. Compared with a conventional fully connected neural network, our Tucker tensor layer neural network achieves an accuracy of 97.9%, a speedup of
engineering, electrical & electronic,computer science, hardware & architecture
What problem does this paper attempt to address?