Abstract:Training of convolutional neural networks (CNN) consumes a lot of time and resources. While most previous works have focused on accelerating the convolutional (CONV) layer, the proportion of non-convolutional (non-CONV) layers, such as batch normalization, is gradually increasing during training. Non-CONV layers have low cache reuse and arithmetic intensity, thereby performance is limited by memory bandwidth. Processing-in-memory (PIM) can utilize wide memory bandwidth, making it suitable for acceleration of non-CONV layers. Therefore, it makes sense to perform the computationally complex CONV layer on the host and handle the memory bottleneck challenges of the non-CONV layer on the PIM. Further improved performance can be expected if they run simultaneously. However, memory access conflicts between the host and PIM are the biggest factors hindering performance improvement. Prior studies proposed bank partitioning to alleviate memory conflicts, but it is not effective because CNN training involves significant data sharing between CONV and non-CONV layers. In this paper, we propose a memory scheduling and CNN training flow for the pipelined execution of CONV layers on the host and non-CONV layers on PIM. First, instead of applying bank partitioning, the host and PIM exclusively access memory for a certain period to avoid the movement of shared data between host memory and PIM memory. The conditions for switching the memory access authority between the host and PIM are set per layer, taking into account memory access characteristics and the number of queued memory requests. Second, in the training flow, CONV and non-CONV layers are pipelined in units of output feature map channels. Specifically, for the backward pass, the non-CONV tasks of the feature map gradient calculation phase and the weight gradient update phase are rearranged so that they can be easily performed within CONV layers. Experimental results show that the proposed pipelined execution achieves an average speedup of 18.1% at the network level compared to the serial operation of the host and PIM.

Channel and filter parallelism for large-scale CNN training

Model Parallel Training and Transfer Learning for Convolutional Neural Networks by Domain Decomposition

Fast and accurate variable batch size convolution neural network training on large scale distributed systems

Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs

Parallelizing Convolutional Neural Networks On Intel (R) Many Integrated Core Architecture

A Domain Decomposition-Based CNN-DNN Architecture for Model Parallel Training Applied to Image Recognition Problems

A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Improving Efficiency in Convolutional Neural Network with Multilinear Filters

Geryon: Accelerating Distributed CNN Training by Network-Level Flow Scheduling

Decomposition and Composition of Deep Convolutional Neural Networks and Training Acceleration Via Sub-Network Transfer Learning

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-Memory

An Efficient 2D Method for Training Super-Large Deep Learning Models

Training CNNs faster with Dynamic Input and Kernel Downsampling

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Restructuring Batch Normalization to Accelerate CNN Training

A New Approach to Compute CNNs for Extremely Large Images

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Towards accelerating model parallelism in distributed deep learning systems

Lightweight Multiattention Recursive Residual CNN-based In-loop Filter driven by Neuron Diversity