Abstract:Tensor learning is a powerful tool for big data analytics and machine learning, e.g., gene analysis and deep learning. However, tensor learning algorithms are compute-intensive since their time and space complexities grow exponentially with the order of tensors, which hinders their application. In this paper, we exploit the parallelism of tensor learning primitives using GPU tensor cores and develop high-performance tensor learning algorithms. First, we propose novel hardware-oriented optimization strategies for tensor learning primitives on GPU tensor cores. Second, for big data analytics, we employ the optimized tensor learning primitives to accelerate the CP tensor decomposition and then apply it for gene analysis. Third, we optimize the Tucker tensor decomposition and propose a novel Tucker tensor layer to compress deep neural networks. We employ natural gradients to train the neural networks, which only involve a forward pass without backpropagation and thus are suitable for GPU computations. Compared with TensorLab and TensorLy libraries on an A100 GPU, our third-order CP tensor decomposition achieves up to 16.32× and 32.25× speedups; and 6.09× and 6.72× speedups for our third-order Tucker tensor decomposition. The proposed fourth-order CP and Tucker tensor decompositions achieve up to 30.65× and 5.41× speedups over the TensorLab. Our CP tensor decomposition for gene analysis achieves up to 5.88× speedup over TensorLy. Compared with a conventional fully connected neural network, our Tucker tensor layer neural network achieves an accuracy of 97.9%, a speedup of

A-Tucker: Fast Input-Adaptive and Matricization-Free Tucker Decomposition of Higher-Order Tensors on GPUs

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

Tucker Tensor Decomposition on FPGA

Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries

Parallel Randomized Tucker Decomposition Algorithms

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

An Iterative Reweighted Method for Tucker Decomposition of Incomplete Multiway Tensors

DinTucker: Scaling up Gaussian process models on multidimensional arrays with billions of elements

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

Dintucker: Scaling Up Gaussian Process Models On Large Multidimensional Arrays

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Adaptive Regularizing Tucker Decomposition for Knowledge Graph Completion

Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis

Scalable Symmetric Tucker Tensor Decomposition

ADA-Tucker: Compressing Deep Neural Networks via Adaptive Dimension Adjustment Tucker Decomposition

Efficient algorithms for Tucker decomposition via approximate matrix multiplication

Tucker tensor factor models: matricization and mode-wise PCA estimation

TTDFT: A GPU accelerated Tucker tensor DFT code for large-scale Kohn-Sham DFT calculations