Abstract:In recent years, tensor computation has become a promising tool for solving big data analysis, machine learning, medical image, and EDA problems. To ease the memory and computation intensity of tensor processing, decomposition techniques, especially tensor-train decomposition (TTD), are widely adopted to compress the extremely high-dimensional tensor data. Despite TTD’s potential to break the curse of dimensionality, researchers have not yet leveraged its full computational potential, mainly because of two reasons: 1) executing TTD itself is time- and energy-consuming due to the singular value decomposition (SVD) operation inside each of TTD’s iteration and 2) additional software/hardware optimizations are often required to process the obtained TT-format data in certain applications such as deep learning inference. In this article, we address these challenges with two approaches. First, we propose an algorithm-hardware co-design with customized architecture, namely, TTD Engine to accelerate TTD. We use MRI image compression as a demo application to illustrate the efficacy of the proposed accelerator. Second, we present a case study demonstrating the benefit of TT-format data processing and the efficacy of using TTD Engine. In the case study, we use the TT approach to realize convolution operation, which is difficult and nontrivial for TT-format data. Experimental results show that, TTD Engine achieves, on average, $14.9 \times $ – $36.9 \times $ speedup over CPU implementations and $4.1\times $ – $9.9\times $ speedup compared to the GPU baseline. The energy efficiency is also improved by at least $14.4\times $ and $5.4\times $ over CPU and GPU, respectively. Moreover, our hardware-enabled TT-format data processing further leads to more efficient implementations of complicated operations and applications.

Sparse Tucker Tensor Decomposition on a Hybrid FPGA-CPU Platform

Tucker Tensor Decomposition on FPGA

cuFasterTucker: A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores

A-Tucker: Fast Input-Adaptive and Matricization-Free Tucker Decomposition of Higher-Order Tensors on GPUs

a-Tucker: Input-Adaptive and Matricization-Free Tucker Decomposition for Dense Tensors on CPUs and GPUs

A New Hybrid GPU-CPU Sparse LDL T Factorization Algorithm with GPU and CPU Factorizing Concurrently

Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries

Hardware-Efficient Mixed-Precision CP Tensor Decomposition

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Hardware-Enabled Efficient Data Processing with Tensor-Train Decomposition

A New Hybrid GPU-CPU Sparse LDLT Factorization Algorithm with GPU and CPU Factorizing Concurrently

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

High-Performance Tensor Learning Primitives Using GPU Tensor Cores

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

High Performance Hardware Architecture for Singular Spectrum Analysis of Hankel Tensors.

High-Performance Tensor-Train Primitives Using GPU Tensor Cores

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

Efficient Computation of Tucker Decomposition for Streaming Scientific Data Compression