HPC-Coder: Modeling Parallel Programs using Large Language Models

Daniel Nichols,Aniruddha Marathe,Harshitha Menon,Todd Gamblin,Abhinav Bhatele
DOI: https://doi.org/10.23919/ISC.2024.10528929
2024-05-14
Abstract:Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.
Distributed, Parallel, and Cluster Computing,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenges posed by the increasing complexity and scale of parallel programs in high-performance computing (HPC), especially in the exascale era. Specifically, the diversity of hardware and parallel programming models makes the development, optimization, and maintenance of parallel software more burdensome for developers. To alleviate these burdens, the paper proposes a method that leverages large-scale language models (LLMs) to automate some development and analysis tasks. The main contributions of the paper include: 1. **Introduction of a new dataset**: The paper introduces a new dataset containing HPC and scientific code, collected from popular open-source repositories. 2. **Proposing the HPC-Coder model**: The paper demonstrates how to utilize pre-trained large language models and fine-tune these models to better handle HPC-related code tasks. Experiments show that this model outperforms other models on HPC-specific tasks. 3. **Code generation and OpenMP pragma annotation**: The paper presents a new model that can automatically complete HPC functions, decorate loops with OpenMP pragmas, and predict performance changes in scientific application repositories. 4. **Performance modeling**: The paper shows how to use the model to predict the relative performance after source code changes, with an accuracy of up to 92%. In summary, the goal of the paper is to improve developer productivity and reduce the likelihood of errors by introducing language models specifically tailored for the HPC domain.