Portability of Fortran's `do concurrent' on GPUs

Ronald M. Caplan,Miko M. Stulajter,Jon A. Linker,Jeff Larkin,Henry A. Gabb,Shiquan Su,Ivan Rodriguez,Zachary Tschirhart,Nicholas Malaya
2024-08-15
Abstract:There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.
Programming Languages,Solar and Stellar Astrophysics,Computational Engineering, Finance, and Science,Mathematical Software,Performance
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore the portability of the `do concurrent` (DC) structure in the Fortran language on different GPU vendor platforms. Specifically, the paper focuses on the following points: 1. **Cross - platform support**: The paper evaluates the implementation and support of `do concurrent` on NVIDIA, Intel, and AMD GPUs. Although `do concurrent` has been successful on the NVIDIA platform, it has been supported later and is not completely consistent on other vendors' platforms. 2. **Performance comparison**: By testing a practical application - the HipFT (High - performance Flux Transport) code for solar surface magnetic flux evolution, the paper compares the performance of using `do concurrent` on different vendors' GPUs and contrasts it with traditional external APIs (such as OpenACC, OpenMP, etc.). 3. **Data management strategies**: The paper explores the effects of using automatic memory management and manual memory management (through OpenMP or OpenACC instructions) on different platforms to determine which method can provide better performance and portability. 4. **Optimization techniques**: The paper studies how to optimize some difficult - to - parallelize nested loops by adding additional compiler instructions (for example, `!$omp parallel loop` on the Intel platform) to improve performance. 5. **Unified programming model**: One of the goals of the paper is to verify whether high - performance computing can be achieved through standard Fortran code (that is, without relying on vendor - specific APIs), thereby improving the portability and durability of the code. ### Specific content - **Background introduction**: The paper first introduces the trend of using standard language features (such as C++ parallel algorithms, Fortran's `do concurrent`, etc.) for accelerated computing in recent years. These new features can reduce the dependence on external APIs and improve the portability and durability of the code. - **Related work**: Reviews the support history of `do concurrent` on different vendor platforms, including the support progress of `do concurrent` by NVIDIA, Intel, and HPE companies. - **Test code**: Selects the High - performance Flux Transport (HipFT) code as the test object, which is a high - performance computing code for simulating the evolution of solar surface magnetic flux. - **Experimental setup**: Describes in detail how to compile and run the HipFT code on different vendors' GPU platforms, including the required compiler flags, environment variable settings, and the choice of manual/automatic memory management. - **Result analysis**: Presents the performance results of running the HipFT code on different platforms and discusses the factors affecting performance, such as memory management methods, compiler optimization, etc. Through these studies, the paper provides valuable references and guidance for the future use of `do concurrent` on different GPU platforms.