Experiences Readying Applications for Exascale

Paul T. Bauman,Reuben D. Budiardja,Dmytro Bykov,Noel Chalmers,Jacqueline Chen,Nicholas Curtis,Marc Day,Markus Eisenbach,Lucas Esclapez,Alessandro Fanfarillo,William Freitag,Nicholas Frontiere,Antigoni Georgiadou,Joseph Glenski,Kalyana Gottiparthi,Marc T. Henry de Frahan,Gustav R. Jansen,Wayne Joubert,Justin G. Lietz,Jakub Kurzak,Nicholas Malaya,Bronson Messer,Damon McDougall,Paul Mullowney,Stephen Nichols,Matthew Norman,Thomas Papatheodore,Jon Rood,Philip C. Roth,Sarat Sreepathi,James White III,Noah Wolfe
2023-10-03
Abstract:The advent of exascale computing invites an assessment of existing best practices for developing application readiness on the world's largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily explores the application readiness process and best practices for Exascale (10^18 operations per second) supercomputers. Specifically, the paper focuses on the Oak Ridge Leadership Computing Facility (OLCF)'s Frontier system, a supercomputer built in collaboration with AMD and HPE, designed to achieve sustained double-precision floating-point performance exceeding 1 Exaflop. The main objectives of the paper include: 1. **Evaluating the readiness of existing applications on Exascale systems**: Considering that when the Frontier project was announced in 2019, almost no applications could fully utilize Exascale-level computing power, it is necessary to evaluate and adjust existing software development best practices. 2. **Detailing the experience of preparing scientific applications over the past 4 years**: This includes considerations in software programming, tuning, and porting, which are critical factors in migrating applications from existing systems to future installations. 3. **Providing a series of representative workloads as general system and software test case studies**: These case studies help understand how to better adapt to new hardware environments. 4. **Evaluating the development use of early access systems**: The paper discusses development across several generations of hardware. 5. **Sharing methods for identifying and disseminating best practices**: This knowledge is conveyed to the community through user guides and training. 6. **Offering recommendations to ensure application readiness for future leadership-class computing systems**: Based on the above experiences, the paper provides recommended practices to ensure applications can run smoothly on the next generation of supercomputers. The paper also specifically mentions an organization called the Frontier Center of Excellence (COE), which brings together key personnel from HPE, AMD, and ORNL to serve as a center of knowledge and expertise on application readiness and optimization, and as a focal point for application "co-design," coordinating efforts across different domains. Additionally, the paper details work in software testing and preparation, including programming strategies such as AMD's HIP (Heterogeneous-compute Interface for Portability) and OpenMP target offloading, as well as specific optimization cases for multiple scientific applications.