Abstract:In the last decade, Convolutional Neural Network with a multi-layer architecture has advanced rapidly. However, training its complex network is very space-consuming, since a lot of intermediate data are preserved across layers, especially when processing high-dimension inputs with a big batch size. That poses great challenges to the limited memory capacity of current accelerators (e.g., GPUs). Existing efforts mitigate such bottleneck by external auxiliary solutions with additional hardware costs, and internal modifications with potential accuracy penalty. Differently, our analysis reveals that computations intra- and inter-layers exhibit the spatial-temporal weak dependency and even complete independency features. That inspires us to break the traditional layer-by-layer (column) dataflow rule. Now operations are novelly re-organized into rows throughout all convolution layers. This lightweight design allows a majority of intermediate data to be removed without any loss of accuracy. We particularly study the weak dependency between two consecutive rows. For the resulting skewed memory consumption, we give two solutions with different favorite scenarios. Evaluations on two representative networks confirm the effectiveness. We also validate that our middle dataflow optimization can be smoothly embraced by existing works for better memory reduction.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive memory occupation during the training process of convolutional neural networks (CNNs). Specifically: 1. **Huge memory consumption**: When dealing with high - dimensional inputs and large - batch - size data, a large amount of intermediate data (such as feature maps) needs to be saved during the forward propagation (FP) and backward propagation (BP) processes of CNNs, which leads to extremely large memory consumption, especially in deeper networks. 2. **Limitations of existing solutions**: - **External auxiliary schemes**: Alleviating the memory bottleneck by distributed training or unloading data from the GPU to the CPU, but these methods increase hardware costs and frequent data migrations inevitably affect performance. - **Internal modification schemes**: For example, checkpointing, quantization compression, network pruning, etc. Although they can reduce memory occupation to a certain extent, they usually bring problems such as runtime delay, increased hardware investment, or loss of model accuracy. 3. **New optimization direction**: The authors found that the current CNN training assumes a complex many - to - many relationship between kernel parameters and matrices. This assumption makes the calculation must be carried out layer by layer, thus accumulating a large number of feature maps. However, in fact, there is a weak or complete independence in space - time between convolution operations. Based on this insight, the authors proposed a new row - centric convolutional neural network training method (LR - CNN), which reduces memory occupation by reorganizing convolution operations into rows without losing accuracy or increasing additional hardware costs. ### Main contributions 1. **Proposed a lightweight row - centric CNN training method (LR - CNN)**: This method significantly reduces memory occupation by reorganizing convolution operations into rows during forward propagation and backward propagation, without losing accuracy and without requiring additional hardware costs. 2. **Proposed two different row - partitioning solutions**: These two solutions are designed differently and can select the optimal performance according to the hardware configuration while shielding the user from operating the underlying details. 3. **Conducted extensive experimental research**: Experiments were carried out on two representative CNN networks (VGG - 16 and ResNet - 50). The results show that, when using only GPU memory, LR - CNN reduces memory occupation by up to 78% compared to the latest competitors (whether using CPU memory or not). Through these contributions, this paper provides an effective solution to deal with the memory bottleneck problem in deep - learning training, especially in resource - limited environments.

LR-CNN: Lightweight Row-centric Convolutional Neural Network Training for Memory Reduction

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Layer-Wise Training To Create Efficient Convolutional Neural Networks

Layup: Layer-adaptive and Multi-type Intermediate-oriented Memory Optimization for GPU-based CNNs

Accelerating Low Bit-Width Deep Convolution Neural Network in MRAM.

Learning Efficient Convolutional Networks Through Network Slimming.

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs.

Low-Memory Neural Network Training: A Technical Report

Multiscale Low-Frequency Memory Network for Improved Feature Extraction in Convolutional Neural Networks

A ReRAM-Based Row-Column-Oriented Memory Architecture for Convolutional Neural Networks.

Reducing SRAM Reading Power with Column Data Segment and Weights Correlation Enhancement for CNN Processing.

Accelerating Recurrent Neural Networks: A Memory-Efficient Approach

A Fully Pipelined Hardware Architecture for Convolutional Neural Network with Low Memory Usage and DRAM Bandwidth

CNN with large memory layers

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Smart-DNN: Efficiently Reducing the Memory Requirements of Running Deep Neural Networks on Resource-constrained Platforms

A Computing Efficient Hardware Architecture for Sparse Deep Neural Network Computing

Training Deep Nets with Sublinear Memory Cost

Communication Minimized Model-Architecture Co-design for Efficient Convolution Acceleration

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning