Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Mohammed Elbtity,Peyton Chandarana,Ramtin Zand
2024-07-12
Abstract:Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
Hardware Architecture,Artificial Intelligence,Distributed, Parallel, and Cluster Computing,Machine Learning,Performance
What problem does this paper attempt to address?
The paper aims to address the performance limitations of current Tensor Processing Units (TPUs) when executing Deep Neural Network (DNN) tasks due to their single static dataflow architecture. Specifically, existing TPU designs typically support only one fixed dataflow pattern (input stationary, output stationary, or weight stationary), which may not achieve optimal performance when processing different DNN layers. To solve this problem, the paper proposes a new architecture called Flex-TPU, which can dynamically adjust the dataflow pattern for each layer at runtime, thereby significantly improving the overall performance of the TPU. The main contributions of the paper include: 1. Modifying the microarchitecture of the Processing Element (PE) to support runtime reconfigurable dataflow. 2. Integrating the improved processing elements into a fully functional TPU. 3. Conducting extensive experimental validation to demonstrate the effectiveness and performance improvement of the Flex-TPU design. In this way, Flex-TPU can achieve up to 2.75 times performance improvement without significantly increasing area and power consumption. This has important implications for the design and application of future TPUs.