An Intermediate-Centric Dataflow for Transposed Convolution Acceleration on FPGA

Zhengzheng Ma,Tuo Dai,Xuechao Wei,Guojie Luo
DOI: https://doi.org/10.1145/3561053
2023-01-01
ACM Transactions on Embedded Computing Systems
Abstract:Transposed convolution has been prevailing in convolutional neural networks (CNNs), playing an important role in multiple scenarios such as image segmentation and back-propagation process of training CNNs. This mainly benefits from the ability to up-sample the input feature maps by interpolating new information from the input feature pixels. However, the backward-stencil computation constrains its performance and hindered its wide application in diverse platforms. Moreover, in contrast to the efforts on accelerating the convolution, there is a rare investigation on the acceleration of transposed convolution that is identically compute-intensive as the former. For acceleration of transposed convolution, we propose an intermediate-centric dataflow scheme, in which we decouple the generation of the intermediate patch from its further process, aim at efficiently performing the backward-stencil computation . The intermediate-centric dataflow breaks the transposed convolution into several phases/stages, achieving feeding the input feature maps and performing the backward-stencil computation in a pipelining manner. It also provides four-degree computation parallelism and efficient data reuse of input feature maps/weights. Furthermore, we also theoretically analyze the irregular data dependence leveraging the polyhedral model, which constrains the parallel computing of transposed convolution. Additionally, we devise an optimization problem to explore the design space and automatically generate the optimal design configurations for different transposed convolutional layers and hardware platforms. By selecting the representative transposed convolutional layers from DCGAN, FSRCNN, and FCN, we generate the corresponding accelerator arrays of intermediate-centric dataflow on the Xilinx Alveo U200 platform and reach the performance of 3.92 TOPS, 2.72 TOPS, and 4.76 TOPS, respectively.
What problem does this paper attempt to address?