Abstract:Training of convolutional neural networks (CNN) consumes a lot of time and resources. While most previous works have focused on accelerating the convolutional (CONV) layer, the proportion of non-convolutional (non-CONV) layers, such as batch normalization, is gradually increasing during training. Non-CONV layers have low cache reuse and arithmetic intensity, thereby performance is limited by memory bandwidth. Processing-in-memory (PIM) can utilize wide memory bandwidth, making it suitable for acceleration of non-CONV layers. Therefore, it makes sense to perform the computationally complex CONV layer on the host and handle the memory bottleneck challenges of the non-CONV layer on the PIM. Further improved performance can be expected if they run simultaneously. However, memory access conflicts between the host and PIM are the biggest factors hindering performance improvement. Prior studies proposed bank partitioning to alleviate memory conflicts, but it is not effective because CNN training involves significant data sharing between CONV and non-CONV layers. In this paper, we propose a memory scheduling and CNN training flow for the pipelined execution of CONV layers on the host and non-CONV layers on PIM. First, instead of applying bank partitioning, the host and PIM exclusively access memory for a certain period to avoid the movement of shared data between host memory and PIM memory. The conditions for switching the memory access authority between the host and PIM are set per layer, taking into account memory access characteristics and the number of queued memory requests. Second, in the training flow, CONV and non-CONV layers are pipelined in units of output feature map channels. Specifically, for the backward pass, the non-CONV tasks of the feature map gradient calculation phase and the weight gradient update phase are rearranged so that they can be easily performed within CONV layers. Experimental results show that the proposed pipelined execution achieves an average speedup of 18.1% at the network level compared to the serial operation of the host and PIM.

Latency-Based Inter-Operator Scheduling for CNN Inference Acceleration on GPU

IOS: Inter-Operator Scheduler for CNN Acceleration

DeepSlicing: Collaborative and Adaptive CNN Inference with Low Latency

Dynamic Space-Time Scheduling for GPU Inference

CPU-Accelerator Co-Scheduling for CNN Acceleration at the Edge

Cooperative Inference with Interleaved Operator Partitioning for CNNs

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

Accelerating CNN Training With Concurrent Execution of GPU and Processing-in-Memory

A Unified Optimization Approach for CNN Model Inference on Integrated GPUs

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Systolic-CNN: An OpenCL-defined Scalable Run-time-flexible FPGA Accelerator Architecture for Accelerating Convolutional Neural Network Inference in Cloud/Edge Computing

High Throughput CNN Inference and Training with In-Cache Computation

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Efficient Scheduling of Irregular Network Structures on CNN Accelerators

Collaborative edge computing for distributed CNN inference acceleration using receptive field-based segmentation

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Distributed Deep Learning Inference Acceleration using Seamless Collaboration in Edge Computing

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks

Accelerating convolutional neural network by exploiting sparsity on GPUs

GAAS: An Efficient Group Associated Architecture and Scheduler Module for Sparse CNN Accelerators