HPC-Coder: Modeling Parallel Programs using Large Language Models

Daniel Nichols,Aniruddha Marathe,Harshitha Menon,Todd Gamblin,Abhinav Bhatele

DOI: https://doi.org/10.23919/ISC.2024.10528929

2024-05-14

Abstract:Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.

Distributed, Parallel, and Cluster Computing,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the challenges posed by the increasing complexity and scale of parallel programs in high-performance computing (HPC), especially in the exascale era. Specifically, the diversity of hardware and parallel programming models makes the development, optimization, and maintenance of parallel software more burdensome for developers. To alleviate these burdens, the paper proposes a method that leverages large-scale language models (LLMs) to automate some development and analysis tasks. The main contributions of the paper include: 1. **Introduction of a new dataset**: The paper introduces a new dataset containing HPC and scientific code, collected from popular open-source repositories. 2. **Proposing the HPC-Coder model**: The paper demonstrates how to utilize pre-trained large language models and fine-tune these models to better handle HPC-related code tasks. Experiments show that this model outperforms other models on HPC-specific tasks. 3. **Code generation and OpenMP pragma annotation**: The paper presents a new model that can automatically complete HPC functions, decorate loops with OpenMP pragmas, and predict performance changes in scientific application repositories. 4. **Performance modeling**: The paper shows how to use the model to predict the relative performance after source code changes, with an accuracy of up to 92%. In summary, the goal of the paper is to improve developer productivity and reduce the likelihood of errors by introducing language models specifically tailored for the HPC domain.

HPC-Coder: Modeling Parallel Programs using Large Language Models

HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

Can Large Language Models Write Parallel Code?

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

Performance-Aligned LLMs for Generating Fast Code

Large Language Models as Code Executors: An Exploratory Study

Scope is all you need: Transforming LLMs for HPC Code

LM4HPC: Towards Effective Language Model Application in High-Performance Computing

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

The Landscape and Challenges of HPC Research and LLMs

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Multi-Programming Language Ensemble for Code Generation in Large Language Model

HPC-GPT: Integrating Large Language Model for High-Performance Computing

OMPGPT: A Generative Pre-trained Transformer Model for OpenMP

AUTOPARLLM: GNN-Guided Automatic Code Parallelization using Large Language Models

Planning-Driven Programming: A Large Language Model Programming Workflow

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

A Systematic Evaluation of Large Language Models of Code