DNNVM - End-to-End Compiler Leveraging Operation Fusion on FPGA-based CNN Accelerators.

Yu Xing,Shuang Liang,Lingzhi Sui,Zhen Zhang,Jiantao Qiu,Xijie Jia,Xin Liu,Yushun Wang,Yi Shan,Yu Wang
DOI: https://doi.org/10.1145/3289602.3293972
2019-01-01
Abstract:In recent years, Convolutional Neural Network(CNN) is becoming the state-of-the-art method in a wide range of Artificial Intelligence(AI) domains. The increasingly large and complex CNN models are both computation bound and I/O bound. FPGA-based accelerators driven by custom Instruction Set Architecture(ISA) achieve a balance between generality and efficiency, and leave much room for optimization. Operation fusion which fuses adjacent operations without saving intermediate results back to off-chip DDR can greatly alleviate bandwidth pressure, operations can be executed by different computation engines concurrently for latency hiding. To leverage optimizations, especially operation fusion on custom instruction-based accelerators, we propose a full-stack compiler DNNVM(Deep Neural Network Virtual Machine). DNNVM is an integration of optimizers for framework-independent computing graph, loops and data layouts, an assembler, a runtime supporter and a validation environment. DNNVM works in the context of deep learning frameworks and transforms CNN models into a directed acyclic graph, XGraph. After analyzing the interaction among fusion depth, tiling across multiple stages and on-chip memory capacity, DNNVM enumerates all potentially profitable fusion opportunities according to custom fusion templates upon XGraph, by a subgraph isomorphism algorithm. In addition, DNNVM searches for the optimal execution strategies by a heuristic shortest-path algorithm. On Xilinx [email protected], we achieve up to 1.26x speedup than naïve implementations without fusion on GoogLeNet. On Xilinx [email protected], we achieve the throughput of 2.82 TOPs/s for VGG, 1.38 TOPs/s for ResNet50 - he fastest ever reported on comparable FPGAs.
What problem does this paper attempt to address?