Abstract:In real applications, it is common to train deep neural networks (DNNs) on modest clusters. With the continuous increase of model size and batch size, the training of DNNs becomes challenging under restricted memory budget. The tensor partition and tensor rematerialization are two major memory optimization techniques to enable larger model size and batch size within the limited-memory constrain. However, the related algorithms failed to fully extract the memory reduction opportunity, because they ignored the invariable characteristics of dynamic computational graphs and the variation among the same size tensors at different memory locations. In this work, we propose MegTaiChi, a dynamic tensor-based memory management optimization module for the DNN training, which first achieves an efficient coordination of tensor partition and tensor rematerialization. The key feature of MegTaiChi is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, MegTaiChi exploits the total memory optimization space and achieves the heuristic, adaptive and fine-grained memory management. The experimental results show, MegTaiChi can reduce the memory footprint by up to 11% for ResNet-50 and 10.5% for GL-base compared with DTR. For the training of 6 representative DNNs, MegTaiChi outperforms MegEngine and Sublinear by 5x and 2.4x of the maximum batch sizes. Compared with FlexFlow, Gshard and ZeRo-3, MegTaiChi achieves 1.2x, 1.8x and 1.5x performance speedups respectively on average. For the million-scale face recognition application, MegTaiChi achieves 1.8x speedup compared with the optimal empirical parallelism strategy on 256 GPUs.

Memory Relevant Hyperparameters Optimization for DNN Training at Edge

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

Adaptive ensemble optimization for memory-related hyperparameters in retraining DNN at edge

FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices.

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

Efficient Memory Management for Deep Neural Net Inference

Overcoming Memory Constraint for Improved Target Classification Performance on Embedded Deep Learning Systems

Memory-efficient Deep Learning Inference with Incremental Weight Loading and Data Layout Reorganization on Edge Systems.

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Intelligent Measurement on Edge Devices Using Hardware Memory-Aware Joint Compression Enabled Neural Networks

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning.

Optimizing Memory Efficiency of Graph NeuralNetworks on Edge Computing Platforms

Low-Rank Training of Deep Neural Networks for Emerging Memory Technology

Memory-Computing Decoupling: A DNN Multitasking Accelerator with Adaptive Data Arrangement.

MACA: Memory-aware Convolution Accelerating for CNN Inference on Edge Devices

Pinpointing the Memory Behaviors of DNN Training

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

Optimizing for In-memory Deep Learning with Emerging Memory Technology

MegTaiChi: Dynamic Tensor-based Memory Management Optimization for DNN Training

Efficient Neural Network Deployment for Microcontroller