Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Mohammed Elbtity,Peyton Chandarana,Ramtin Zand

2024-07-12

Abstract:Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.

Hardware Architecture,Artificial Intelligence,Distributed, Parallel, and Cluster Computing,Machine Learning,Performance

What problem does this paper attempt to address?

The paper aims to address the performance limitations of current Tensor Processing Units (TPUs) when executing Deep Neural Network (DNN) tasks due to their single static dataflow architecture. Specifically, existing TPU designs typically support only one fixed dataflow pattern (input stationary, output stationary, or weight stationary), which may not achieve optimal performance when processing different DNN layers. To solve this problem, the paper proposes a new architecture called Flex-TPU, which can dynamically adjust the dataflow pattern for each layer at runtime, thereby significantly improving the overall performance of the TPU. The main contributions of the paper include: 1. Modifying the microarchitecture of the Processing Element (PE) to support runtime reconfigurable dataflow. 2. Integrating the improved processing elements into a fully functional TPU. 3. Conducting extensive experimental validation to demonstrate the effectiveness and performance improvement of the Flex-TPU design. In this way, Flex-TPU can achieve up to 2.75 times performance improvement without significantly increasing area and power consumption. This has important implications for the design and application of future TPUs.

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

FlexTensor

Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

Heterogeneous Integration of In-Memory Analog Computing Architectures with Tensor Processing Units

GPTPU: Accelerating Applications using Edge Tensor Processing Units

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Towards Power Efficient DNN Accelerator Design on Reconfigurable Platform

High-resolution imaging on TPUs

A Carbon-Nanotube-based Tensor Processing Unit

Exploration of TPUs for AI Applications

FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices

FlexPDA: A Flexible Programming Framework for Deep Learning Accelerators.

FlexiBit: Fully Flexible Precision Bit-parallel Accelerator Architecture for Arbitrary Mixed Precision AI

A Reconfigurable Processing Element for Multiple-Precision Floating/Fixed-Point HPC

H3D-Transformer: A Heterogeneous 3D (H3D) Computing Platform for Transformer Model Acceleration on Edge Devices

Hardware Acceleration of Explainable Machine Learning using Tensor Processing Units

A Reconfigurable Multiple-Precision Floating-Point Dot Product Unit for High-Performance Computing.

Flextron: Many-in-One Flexible Large Language Model

FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor Contraction

TPE: A High-Performance Edge-Device Inference with Multi-level Transformational Mechanism

FlexBlock: A Flexible DNN Training Accelerator with Multi-Mode Block Floating Point Support