Abstract:In recent years, Convolutional Neural Network(CNN) is becoming the state-of-the-art method in a wide range of Artificial Intelligence(AI) domains. The increasingly large and complex CNN models are both computation bound and I/O bound. FPGA-based accelerators driven by custom Instruction Set Architecture(ISA) achieve a balance between generality and efficiency, and leave much room for optimization. Operation fusion which fuses adjacent operations without saving intermediate results back to off-chip DDR can greatly alleviate bandwidth pressure, operations can be executed by different computation engines concurrently for latency hiding. To leverage optimizations, especially operation fusion on custom instruction-based accelerators, we propose a full-stack compiler DNNVM(Deep Neural Network Virtual Machine). DNNVM is an integration of optimizers for framework-independent computing graph, loops and data layouts, an assembler, a runtime supporter and a validation environment. DNNVM works in the context of deep learning frameworks and transforms CNN models into a directed acyclic graph, XGraph. After analyzing the interaction among fusion depth, tiling across multiple stages and on-chip memory capacity, DNNVM enumerates all potentially profitable fusion opportunities according to custom fusion templates upon XGraph, by a subgraph isomorphism algorithm. In addition, DNNVM searches for the optimal execution strategies by a heuristic shortest-path algorithm. On Xilinx [email protected], we achieve up to 1.26x speedup than naïve implementations without fusion on GoogLeNet. On Xilinx [email protected], we achieve the throughput of 2.82 TOPs/s for VGG, 1.38 TOPs/s for ResNet50 - he fastest ever reported on comparable FPGAs.

Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization

Ifpna: A Flexible and Efficient Deep Learning Processor in 28-Nm CMOS Using a Domain-Specific Instruction Set and Reconfigurable Fabric.

Ifpna: A Flexible and Efficient Deep Neural Network Accelerator with a Programmable Data Flow Engine in 28nm CMOS.

DaDianNao: A Machine-Learning Supercomputer

FP-DNN: an Automated Framework for Mapping Deep Neural Networks Onto FPGAs with RTL-HLS Hybrid Templates

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications.

Latency optimized Deep Neural Networks (DNNs): An Artificial Intelligence approach at the Edge using Multiprocessor System on Chip (MPSoC)

FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters

NxTF: An API and Compiler for Deep Spiking Neural Networks on Intel Loihi

Live Demonstration: An Efficient Neural Network Processor with Reduced Data Transmission and On-chip Shortcut Mapping

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Systematic realization of a fully connected deep and convolutional neural network architecture on a field programmable gate array

Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

DNNVM - End-to-End Compiler Leveraging Operation Fusion on FPGA-based CNN Accelerators.

Accelerating DNN Inference with Heterogeneous Multi-DPU Engines

A Reconfigurable Convolutional Neural Network-Accelerated Coprocessor Based on RISC-V Instruction Set

OCEAN: an On-Chip Incremental-Learning Enhanced Processor with Gated Recurrent Neural Network Accelerators.