High performance dilated convolutions on multi-core DSPs

Yang Wang,Qinglin Wang,Xiangdong Pei,Songzhu Mei,Rongchun Li,Jie Liu
DOI: https://doi.org/10.1007/s42514-023-00166-8
2023-09-09
CCF Transactions on High Performance Computing
Abstract:Dilated convolutions are widely used to accomplish wide receptive fields while keeping the resolution of feature maps in deep learning applications, such as semantic segmentation and object detection. However, the data locality in dilated convolutions deteriorates rapidly with the increase of dilation rate, which brings a great challenge to the high-performance direct implementation of convolutions. Multi-core digital signal processors (DSPs) with software-controlled on-chip memories allow programmers to move data between on-chip and off-chip memories by hand so that it may be very friendly to the direct implementation of dilated convolutions. In this paper, we introduce a high-performance parallel direct implementation of dilated convolutions on multi-core DSPs in a CPU-DSP heterogeneous prototype processor, which can effectively capture the data locality in dilated convolutions. The experimental results demonstrate that the direct implementation achieves much better performance than GEMM-based ones on multi-core DSPs for all the tested layers, and gets much higher efficiency than the high-performance libraries on three other architectures in cases with large feature maps. In addition, the direct implementation also exhibits good scalability.
What problem does this paper attempt to address?