Abstract:Wide models such as generalized linear models and factorization-based models have been extensively used in various predictive applications, e.g., recommendation, CTR prediction, and image recognition. Due to the memory bounded property of the models, the performance improvement on CPU is reaching the limitation. GPU is known to have many computation units and high memory bandwidth, and becomes a promising platform for training machine learning models. However, the GPU training for the wide models is far from optimal due to the sparsity and irregularity in wide models. The existing GPU-based wide models are even slower than the ones using CPU. The classical training schema of the wide models does not optimized for the GPU architecture, which suffers from large amount of random memory accesses and redundant read/write of intermediate values. In this paper, we propose an efficient GPU-training framework for the large-scale wide models, named cuWide. To fully benefit from the memory hierarchy of GPU, cuWide applies a new flow-based schema for training, which leverages the spatial and temporal locality of wide models to drastically reduce the amount of communication with GPU global memory. To do so, we adopt a bigraph computation model to efficiently realize the flow-based schema and exploit three flexible interfaces for programming. Further, we use the 2D partition of mini-batch (in sample and feature dimensions) with proposed graph abstraction to optimize GPU memory access for sparse data, and apply several spatial-temporal caching mechanisms (importance-based model caching and cross-stage accumulation caching mechanisms) to achieve a high performance kernel. To efficiently implement cuWide, we also propose several GPU-oriented optimizations, including feature-oriented data layout to enhance the data locality, replication mechanism to reduce update conflicts in shared memory, and multi-stream scheduling to overlap data transferring and kernel computing. We show that c-Wide can be up to more than 20× faster than the state-of-the-art GPU solutions and multi-core CPU solutions.

Efficient Use of GPU Memory for Large-Scale Deep Learning Model Training

Optimization of GPU Memory Usage for Training Deep Neural Networks.

Efficient Memory Management for GPU-based Deep Learning Systems

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Memory-Efficient Implementation of DenseNets

TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses

Data-parallel distributed training of very large models beyond GPU capacity

Forecasting GPU Performance for Deep Learning Training and Inference

An Efficient 2D Method for Training Super-Large Deep Learning Models

Memory Optimization for Deep Networks

Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Estimating GPU memory consumption of deep learning models

Survey on Large Scale Neural Network Training

CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs

An Incremental Iterative Acceleration Architecture in Distributed Heterogeneous Environments With GPUs for Deep Learning

Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs

UnifiedNN: Efficient Neural Network Training on the Cloud