A configurable multiplex data transfer model for asynchronous and heterogeneous FPGA accelerators on single DMA device

Zhangqin Huang,Shuo Zhang,Han Gao,Xiaobo Zhang,Shengqi Yang
DOI: https://doi.org/10.1016/j.micpro.2020.103174
IF: 3.503
2020-09-01
Microprocessors and Microsystems
Abstract:<p>To reduce DMA utilization for multiple algorithm IPs on FPGA, a channel configurable and multiplex DMA device (CMDMA) is proposed for asynchronous and heterogeneous algorithm IPs. Firstly, we abstract the entities and data-flow in CMDMA system with a formal description for function definition and work-flow analysis. Then based on the functions and work-flow, we design and implement a prototype of CMDMA, which includes CMDMA software driver (SW) and hardware circuits (HW) of one DMA IP, a configurable input switch (CISwitch), algorithm IPs, and an asynchronous output switch (AOSwitch). The configurable function of CMDMA is implemented by CISwitch through a configuration port in HW-level, and a configurable Round-Robin (CRR) algorithm is proposed to implement channel and input data schedule in SW-level. For output, a channel distinguishable output buffer (ChnDistBuf) is proposed, which is able to deliver channel ID and data size to SW earlier than the end time of an algorithm IP. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. With a double interrupt coordination method of both ChnDistBuf and algorithm IPs, CMDMA is able to successively store complete output data from different algorithm IPs. The experiments based on <em>4</em> heterogeneous matrix multiplication algorithm IPs on Xilinx Zynq platform show that CMDMA is able to improve about <em>8%</em>-<em>29%</em> average algorithm acceleration rates on single algorithm IP compared to the exclusive method that one DMA works for one algorithm IP only, and it is able to increase about <em>10-40MB/s</em> and <em>5-15MB/s</em> of DMA input and output data throughput with multiple algorithm IPs running in parallel. Moreover, the extended LUT and FF resources in CMDMA are <em>756</em> and <em>1219</em>, both of which are about <em>1%</em> of Zynq platform. Besides, in a double CNN algorithm IPs test on Mnist application, an enhanced function of data broadcasting in CMDMA is able to improve <em>4s</em> than the system with <em>4</em> exclusive DMA running in parallel, meanwhile reduce <em>3</em> DMA utilization and <em>0.03W</em> power consumption.</p>
computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?