LR-CNN: Lightweight Row-centric Convolutional Neural Network Training for Memory Reduction

Zhigang Wang,Hangyu Yang,Ning Wang,Chuanfei Xu,Jie Nie,Zhiqiang Wei,Yu Gu,Ge Yu
2024-01-21
Abstract:In the last decade, Convolutional Neural Network with a multi-layer architecture has advanced rapidly. However, training its complex network is very space-consuming, since a lot of intermediate data are preserved across layers, especially when processing high-dimension inputs with a big batch size. That poses great challenges to the limited memory capacity of current accelerators (e.g., GPUs). Existing efforts mitigate such bottleneck by external auxiliary solutions with additional hardware costs, and internal modifications with potential accuracy penalty. Differently, our analysis reveals that computations intra- and inter-layers exhibit the spatial-temporal weak dependency and even complete independency features. That inspires us to break the traditional layer-by-layer (column) dataflow rule. Now operations are novelly re-organized into rows throughout all convolution layers. This lightweight design allows a majority of intermediate data to be removed without any loss of accuracy. We particularly study the weak dependency between two consecutive rows. For the resulting skewed memory consumption, we give two solutions with different favorite scenarios. Evaluations on two representative networks confirm the effectiveness. We also validate that our middle dataflow optimization can be smoothly embraced by existing works for better memory reduction.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive memory occupation during the training process of convolutional neural networks (CNNs). Specifically: 1. **Huge memory consumption**: When dealing with high - dimensional inputs and large - batch - size data, a large amount of intermediate data (such as feature maps) needs to be saved during the forward propagation (FP) and backward propagation (BP) processes of CNNs, which leads to extremely large memory consumption, especially in deeper networks. 2. **Limitations of existing solutions**: - **External auxiliary schemes**: Alleviating the memory bottleneck by distributed training or unloading data from the GPU to the CPU, but these methods increase hardware costs and frequent data migrations inevitably affect performance. - **Internal modification schemes**: For example, checkpointing, quantization compression, network pruning, etc. Although they can reduce memory occupation to a certain extent, they usually bring problems such as runtime delay, increased hardware investment, or loss of model accuracy. 3. **New optimization direction**: The authors found that the current CNN training assumes a complex many - to - many relationship between kernel parameters and matrices. This assumption makes the calculation must be carried out layer by layer, thus accumulating a large number of feature maps. However, in fact, there is a weak or complete independence in space - time between convolution operations. Based on this insight, the authors proposed a new row - centric convolutional neural network training method (LR - CNN), which reduces memory occupation by reorganizing convolution operations into rows without losing accuracy or increasing additional hardware costs. ### Main contributions 1. **Proposed a lightweight row - centric CNN training method (LR - CNN)**: This method significantly reduces memory occupation by reorganizing convolution operations into rows during forward propagation and backward propagation, without losing accuracy and without requiring additional hardware costs. 2. **Proposed two different row - partitioning solutions**: These two solutions are designed differently and can select the optimal performance according to the hardware configuration while shielding the user from operating the underlying details. 3. **Conducted extensive experimental research**: Experiments were carried out on two representative CNN networks (VGG - 16 and ResNet - 50). The results show that, when using only GPU memory, LR - CNN reduces memory occupation by up to 78% compared to the latest competitors (whether using CPU memory or not). Through these contributions, this paper provides an effective solution to deal with the memory bottleneck problem in deep - learning training, especially in resource - limited environments.