Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication.

Sheng Ma,Zhong Liu,Shenggang Chen,Libo Huang,Yang Guo,Zhiying Wang,Meidi Zhang
DOI: https://doi.org/10.1109/tpds.2019.2906891
IF: 5.3
2019-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:High performance implementation of matrix multiplication is essential for scientific computing. The memory access procedure is quite possible to be the bottleneck of matrix multiplication. The widely used GotoBLAS GEMM implementation divides the integral matrix into several partitions to be assigned to different cores for parallelization. Traditionally, each core deploys a DMA transfer to access its own partition in the DRAM memory. However, deploying an independent DMA transfer for each core cannot efficiently exploit the inter-core locality. Also, multiple concurrent DMA transfers interfere with each other, further reducing the DRAM access efficiency. We observe that the same row of neighboring partitions is in the same DRAM page, which means that there is significant locality inherent in the address layout. We propose the coordinated DMA to efficiently exploit the locality. It invokes one transfer to serve all cores and moves data in a row-major manner to improve the DRAM access efficiency. Compared with a baseline design, the coordinated DMA improves the bandwidth by 84.8 percent and reduces DRAM energy consumption by 43.1 percent for micro-benchmarks. It achieves higher performance for the GEMM and Linpack benchmark. With much less hardware costs, the coordinated DMA significantly outperforms an out-of-order memory controller.
What problem does this paper attempt to address?