Abstract:Training of convolutional neural networks (CNN) consumes a lot of time and resources. While most previous works have focused on accelerating the convolutional (CONV) layer, the proportion of non-convolutional (non-CONV) layers, such as batch normalization, is gradually increasing during training. Non-CONV layers have low cache reuse and arithmetic intensity, thereby performance is limited by memory bandwidth. Processing-in-memory (PIM) can utilize wide memory bandwidth, making it suitable for acceleration of non-CONV layers. Therefore, it makes sense to perform the computationally complex CONV layer on the host and handle the memory bottleneck challenges of the non-CONV layer on the PIM. Further improved performance can be expected if they run simultaneously. However, memory access conflicts between the host and PIM are the biggest factors hindering performance improvement. Prior studies proposed bank partitioning to alleviate memory conflicts, but it is not effective because CNN training involves significant data sharing between CONV and non-CONV layers. In this paper, we propose a memory scheduling and CNN training flow for the pipelined execution of CONV layers on the host and non-CONV layers on PIM. First, instead of applying bank partitioning, the host and PIM exclusively access memory for a certain period to avoid the movement of shared data between host memory and PIM memory. The conditions for switching the memory access authority between the host and PIM are set per layer, taking into account memory access characteristics and the number of queued memory requests. Second, in the training flow, CONV and non-CONV layers are pipelined in units of output feature map channels. Specifically, for the backward pass, the non-CONV tasks of the feature map gradient calculation phase and the weight gradient update phase are rearranged so that they can be easily performed within CONV layers. Experimental results show that the proposed pipelined execution achieves an average speedup of 18.1% at the network level compared to the serial operation of the host and PIM.

ILP-based Multi-Branch CNNs Mapping on Processing-in-Memory Architecture

DaDianNao: A Machine-Learning Supercomputer

DDAM: D Ata D Istribution- A Ware M Apping of CNNs on Processing-In-Memory Systems

DDAM: Data Distribution-Aware Mapping of CNNs on Processing-In-Memory Systems

NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators with 3D-Stacked-DRAM

An Efficient Racetrack Memory-Based Processing-in-Memory Architecture for Convolutional Neural Networks

A Collaborative PIM Computing Optimization Framework for Multi-Tenant DNN

Accelerating Neural Network Training with Processing-in-Memory GPU

A Spatial-Designed Computing-In-Memory Architecture Based on Monolithic 3D Integration for High-Performance Systems.

Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks

Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

pPIM: A Programmable Processor-in-Memory Architecture With Precision-Scaling for Deep Learning

Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-Memory

Fast-OverlaPIM: A Fast Overlap-driven Mapping Framework for Processing In-Memory Neural Network Acceleration

PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation

An Energy-Efficient Quantized and Regularized Training Framework for Processing-In-Memory Accelerators

Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators

A Configurable Multi-Precision CNN Computing Framework Based on Single Bit RRAM

DDC-PIM: Efficient Algorithm/Architecture Co-design for Doubling Data Capacity of SRAM-based Processing-In-Memory

Low power driven loop tiling for RRAM crossbar-based CNN.

DATIC: A Data-Aware Time-Domain Computing-in-Memory-Based CNN Processor with Dynamic Channel Skipping and Mapping