Abstract:Current deep neural network (DNN) architectures are paying increasing attention on the reduction of neural parameters and operations for the applications on embedded and IoT platforms. Relatively, the intermediate feature maps of such lightweight neural networks begin to grow and usually outsize the on-chip memory as the new bottleneck, which introduces considerable power-consuming off-chip memory accesses. To reduce the feature-induced memory accesses, operator fusion has been proposed to parallelize the execution of multiple convolutional layers and shown significant reduction of off-chip memory accesses. However, how to fuse the neural operators is still a challenging issue that heavily depends on both the neural network (NN) topology and the specific DNN accelerator configuration. In this work, we observed prior operator fusion approaches fail to guarantee memory-level optimality as they search in the constrained operator fusion design space. Considering the complexity of the NN topologies and the constrained resources of the DNN accelerators, we develop a novel operator fusion framework, Optimus. Optimus includes an accurate memory cost model dedicated to the scheduler to evaluate the potential operator-fusion schemes, and a directed acyclic graph (DAG) based operator fusion algorithm for both off-line and on-line workload deployment scenarios, which altogether generates high-efficiency operator-fusion solutions for arbitrary network models running on DNN accelerators. The experimental results show that Optimus reduces 17% - 75% off-chip memory accesses and obtains 1.86 × - 3.66 × energy efficiency on state-of-the-art DNN workloads when compared to the baselines, and brings significant power-efficiency boost to the DNN accelerators of different architectures and dataflows.

DUET: Boosting Deep Neural Network Efficiency on Dual-Module Architecture

DaDianNao: A Machine-Learning Supercomputer

A Convolutional Neural Network Accelerator Architecture with Fine-Granular Mixed Precision Configurability.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Dual-module Inference for Efficient Recurrent Neural Networks

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

A Data-Driven Asynchronous Neural Network Accelerator

Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

Energy-Efficient Accelerator Design for Deformable Convolution Networks

Software-defined Design Space Exploration for an Efficient DNN Accelerator Architecture

Layer-Wise Mixed-Modes CNN Processing Architecture With Double-Stationary Dataflow and Dimension-Reshape Strategy

A Parallel Loading Based Accelerator for Convolution Neural Network

A High Efficient Architecture for Convolution Neural Network Accelerator

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

Invited: Algorithm-Software-Hardware Co-Design for Deep Learning Acceleration

Energy-Efficient Architecture for FPGA-based Deep Convolutional Neural Networks with Binary Weights

HybridDNN: A Framework for High-Performance Hybrid DNN Accelerator Design and Implementation.

Optimus: An Operator Fusion Framework for Deep Neural Networks