CNN-DMA: A Predictable and Scalable Direct Memory Access Engine for Convolutional Neural Network with Sliding-window Filtering.

Zheng Wang,Zhuo Wang,Jian Liao,Chao Chen,Yongkui Yang,Bo Dong,Weiguang Chen,Wenxuan Chen,Ming Lei,Weiyu Guo,Rui Chen,Yi Peng,Zhibin Yu
DOI: https://doi.org/10.1145/3453688.3461496
2021-01-01
Abstract:Memory bandwidth utilization has become the key performance bottleneck for state-of-the-art variants of neural network kernels. Current structures such as depth-wise, point-wise and atrous convolutions have already introduced diverse and discontinuous memory access patterns, which impact efficient activation supply due to more frequent cache misses and consequently high-penalty DRAM pre-charging. To handle this, GPU achieves efficient parallelization with sophisticated optimization of CUDA program to reduce memory footprints, which demands high engineering efforts. In this work, we in contrast propose a programmable direct memory access engine for convolutional neural networks (CNN-DMA) supporting a fast supply of activation for independent and scalable computing units. The CNN-DMA favours a predictable activation streaming approach which completely avoids penalties by bus contention, cache misses and less carefully designed low-level programs. Furthermore, we enhance the baseline DMA with the capability of out-of-order data supply to filter out unique sliding-windows to boost the performance of the computing infrastructure. Experiments on state-of-the-art neural networks show that CNN-DMA achieves optimal DRAM access efficiency for point-wise convolution layers, while reduces 30% to 70% rounds of computation with sliding-window filtering.
What problem does this paper attempt to address?