Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Jake Choi,Heon Young Yeom,Yoonhee Kim
DOI: https://doi.org/10.1007/s10586-022-03805-x
2022-11-11
Cluster Computing
Abstract:Popular deep learning frameworks like PyTorch utilize GPUs heavily for training, and suffer from out-of-memory (OOM) problems if memory is not managed properly. CUDA Unified Memory (UM) allows the oversubscription of tensor objects in the GPU, but suffers from heavy performance penalties. In this paper, we build upon our UM implementation and create and utilize a minimal overhead CUPTI dynamic profiler to trace unified memory page fault and memory transfer statistics in PyTorch applications. We also implement CUDA memory prefetch and advise API which can be called directly from the PyTorch application based on the dynamically profiled statistics to improve oversubscription performance in various PyTorch models including Resnet and BERT.
computer science, information systems, theory & methods
What problem does this paper attempt to address?