Improving the DRAM Access Efficiency for Matrix Multiplication on Multicore Accelerators.

Sheng Ma,Yang Guo,Shenggang Chen,Libo Huang,Zhiying Wang
DOI: https://doi.org/10.23919/date.2019.8714915
2019-01-01
Abstract:The parallelization of matrix multiplication on multicore accelerators divides a matrix into several partitions. The existing design deploys an independent DMA transfer for each core to access its own partition from DRAM. This design has poor memory access efficiency, since memory access streams of multiple concurrent DMA transfers interfere with each other. We propose Distributed-DMA (D-DMA), which invokes one transfer to serve all cores. D-DMA accesses data in a row-major manner to efficiently exploit inter-partition locality to improve the DRAM access efficiency. Compared with a baseline design, D-DMA improves the bandwidth by 84.8% and reduces DRAM energy consumption by 43.1% for micro-benchmarks. It achieves higher performance for the GEMM benchmark. With much lower hardware cost, D-DMA significantly outperforms an out-of-order memory controller.
What problem does this paper attempt to address?