gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

Erik D. Huckvale,Hunter N.B. Moseley
2024-06-25
Abstract:Determining the maximum usage of random-access memory (RAM) on both the motherboard and on a graphical processing unit (GPU) over the lifetime of a computing task can be extremely useful for troubleshooting points of failure as well as optimizing memory utilization, especially within a high-performance computing (HPC) setting. While there are tools for tracking compute time and RAM, including by job management tools themselves, tracking of GPU usage, to our knowledge, does not currently have sufficient solutions. We present gpu_tracker, a Python package that tracks the computational resource usage of a task while running in the background, including the real compute time that the task takes to complete, its maximum RAM usage, and the maximum GPU RAM usage, specifically for Nvidia GPUs. We demonstrate that gpu_tracker can seamlessly track computational resource usage with minimal overhead, both within desktop and HPC execution environments.
Performance
What problem does this paper attempt to address?
The paper aims to address the issue of monitoring and optimizing the usage of computing resources (including Random Access Memory (RAM), Central Processing Unit (CPU) utilization, and Graphics Processing Unit (GPU) utilization) during computational tasks, especially in High-Performance Computing (HPC) environments. Although there are many tools available to track computing time, CPU utilization, and RAM usage, there is still a lack of tracking tools for GPU usage in Unix/Linux operating systems, particularly in HPC environments. The paper introduces a Python package called `gpu_tracker`, which can seamlessly track the usage of computing resources in the background with minimal impact on system performance, making it suitable for both desktop and high-performance computing environments. By using `gpu_tracker`, users can better optimize resource allocation, avoiding task failures or inefficient execution due to improper resource allocation. Additionally, the tool can help users reasonably estimate the amount of resources needed when submitting jobs to HPC systems, thereby improving job scheduling efficiency.