Abstract:Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.

A Topology-Aware Performance Prediction Model for Distributed Deep Learning on GPU Clusters

PerfTop: Towards Performance Prediction of Distributed Learning over General Topology

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network.

Topology-aware GPU scheduling for learning workloads in cloud environments

Interference-aware parallelization for deep learning workload in GPU cluster

Prediction of GPU Failures Under Deep Learning Workloads

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Energy-Efficient GPU Clusters Scheduling for Deep Learning

A Generic Performance Model for Deep Learning in a Distributed Environment

Forecasting GPU Performance for Deep Learning Training and Inference

A Survey on Performance Modeling and Prediction for Distributed DNN Training

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

HetHub: A Heterogeneous Distributed Hybrid Training System for Large-Scale Models

GPU Cluster Scheduling for Network-Sensitive Deep Learning

An Optimal Network-Aware Scheduling Technique for Distributed Deep Learning in Distributed HPC Platforms

Proteus: Simulating the Performance of Distributed DNN Training