Abstract:There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore the portability of the `do concurrent` (DC) structure in the Fortran language on different GPU vendor platforms. Specifically, the paper focuses on the following points: 1. **Cross - platform support**: The paper evaluates the implementation and support of `do concurrent` on NVIDIA, Intel, and AMD GPUs. Although `do concurrent` has been successful on the NVIDIA platform, it has been supported later and is not completely consistent on other vendors' platforms. 2. **Performance comparison**: By testing a practical application - the HipFT (High - performance Flux Transport) code for solar surface magnetic flux evolution, the paper compares the performance of using `do concurrent` on different vendors' GPUs and contrasts it with traditional external APIs (such as OpenACC, OpenMP, etc.). 3. **Data management strategies**: The paper explores the effects of using automatic memory management and manual memory management (through OpenMP or OpenACC instructions) on different platforms to determine which method can provide better performance and portability. 4. **Optimization techniques**: The paper studies how to optimize some difficult - to - parallelize nested loops by adding additional compiler instructions (for example, `!$omp parallel loop` on the Intel platform) to improve performance. 5. **Unified programming model**: One of the goals of the paper is to verify whether high - performance computing can be achieved through standard Fortran code (that is, without relying on vendor - specific APIs), thereby improving the portability and durability of the code. ### Specific content - **Background introduction**: The paper first introduces the trend of using standard language features (such as C++ parallel algorithms, Fortran's `do concurrent`, etc.) for accelerated computing in recent years. These new features can reduce the dependence on external APIs and improve the portability and durability of the code. - **Related work**: Reviews the support history of `do concurrent` on different vendor platforms, including the support progress of `do concurrent` by NVIDIA, Intel, and HPE companies. - **Test code**: Selects the High - performance Flux Transport (HipFT) code as the test object, which is a high - performance computing code for simulating the evolution of solar surface magnetic flux. - **Experimental setup**: Describes in detail how to compile and run the HipFT code on different vendors' GPU platforms, including the required compiler flags, environment variable settings, and the choice of manual/automatic memory management. - **Result analysis**: Presents the performance results of running the HipFT code on different platforms and discusses the factors affecting performance, such as memory management methods, compiler optimization, etc. Through these studies, the paper provides valuable references and guidance for the future use of `do concurrent` on different GPU platforms.

Portability of Fortran's `do concurrent' on GPUs

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms?

A Lightweight Approach to Performance Portability with targetDP

Porting a sparse linear algebra math library to Intel GPUs

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

Accelerating Fortran Codes: A Method for Integrating Coarray Fortran with CUDA Fortran and OpenMP

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

GPU First -- Execution of Legacy CPU Codes on GPUs

Method for portable, scalable, and performant GPU-accelerated simulation of multiphase compressible flow

Taking GPU Programming Models to Task for Performance Portability

Performance Portable Monte Carlo Neutron Transport in MCDC via Numba

Experiences Porting NAMD to the Data Parallel C++ Programming Model

Studying performance portability of LAMMPS across diverse GPU‐based platforms

Providing performance portable numerics for Intel GPUs

Massive parallelization and performance enhancement of an immersed boundary method based unsteady flow solver

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Evaluating Portable Parallelization Strategies for Heterogeneous Architectures in High Energy Physics

Portability: A Necessary Approach for Future Scientific Software

A Study of Performance Portability in Plasma Physics Simulations