Abstract:Current deep neural network (DNN) architectures are paying increasing attention on the reduction of neural parameters and operations for the applications on embedded and IoT platforms. Relatively, the intermediate feature maps of such lightweight neural networks begin to grow and usually outsize the on-chip memory as the new bottleneck, which introduces considerable power-consuming off-chip memory accesses. To reduce the feature-induced memory accesses, operator fusion has been proposed to parallelize the execution of multiple convolutional layers and shown significant reduction of off-chip memory accesses. However, how to fuse the neural operators is still a challenging issue that heavily depends on both the neural network (NN) topology and the specific DNN accelerator configuration. In this work, we observed prior operator fusion approaches fail to guarantee memory-level optimality as they search in the constrained operator fusion design space. Considering the complexity of the NN topologies and the constrained resources of the DNN accelerators, we develop a novel operator fusion framework, Optimus. Optimus includes an accurate memory cost model dedicated to the scheduler to evaluate the potential operator-fusion schemes, and a directed acyclic graph (DAG) based operator fusion algorithm for both off-line and on-line workload deployment scenarios, which altogether generates high-efficiency operator-fusion solutions for arbitrary network models running on DNN accelerators. The experimental results show that Optimus reduces 17% - 75% off-chip memory accesses and obtains 1.86 × - 3.66 × energy efficiency on state-of-the-art DNN workloads when compared to the baselines, and brings significant power-efficiency boost to the DNN accelerators of different architectures and dataflows.

Memory and Computation Coordinated Mapping of DNNs Onto Complex Heterogeneous SoC.

DaDianNao: A Machine-Learning Supercomputer

NN-Baton: DNN Workload Orchestration and Chiplet Granularity Exploration for Multichip Accelerators

Cocco: Hardware-Mapping Co-Exploration towards Memory Capacity-Communication Optimization

A Customized NoC Architecture to Enable Highly Localized Computing-On-the-Move DNN Dataflow

M2M: A Fine-Grained Mapping Framework to Accelerate Multiple DNNs on a Multi-Chiplet Architecture

Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators Via Layer Conscious Memory Management

Dataflow Aware Mapping of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-based Accelerators

Optimizing DNN Inference on Multi-Accelerator SoCs at Training-time

Incremental Run-time Application Mapping for Heterogeneous Network on Chip

High-performance application mapping in network-on-chip-based multicore systems

Optimized Mapping Spiking Neural Networks onto Network-on-Chip.

CoMN: Algorithm-Hardware Co-Design Platform for Non-Volatile Memory Based Convolutional Neural Network Accelerators

A Novel Scheme to Map Convolutional Networks to Network-on-Chip with Computing-In-Memory Nodes

An Application-oblivious Memory Scheduling System for DNN Accelerators

AOME: Autonomous Optimal Mapping Exploration Using Reinforcement Learning for NoC-based Accelerators Running Neural Networks

Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing

Optimus: An Operator Fusion Framework for Deep Neural Networks

Approaching the mapping limit with closed-loop mapping strategy for deploying neural networks on neuromorphic hardware