Abstract:(ABSTRACT) Today, heterogeneous computing has truly reshaped the way scientists think and approach high-performance computing (HPC). Hardware accelerators such as general-purpose graphics processing units (GPUs) and Intel Many Integrated Core (MIC) architecture continue to make inroads in accelerating large-scale scientific applications. These advancements, however, introduce new sets of challenges to the scientific community such as: selection of best processor for an application, effective performance optimization strategies, maintaining performance portability across architectures etc. In this thesis, we present our techniques and approach to address some of these significant issues. Firstly, we present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs. We then present our approach to accelerate a seismology modeling application that is based on the finite difference method (FDM), using MPI and CUDA over a hybrid CPU+GPU cluster. We describe the generic computational complexities involved in porting such applications to the GPUs and present our strategy of efficient performance optimization and characterization. We also show how performance modeling can be used to reason and drive the hardware-specific optimizations on the GPU. The performance evaluation of our approach delivers a maximum speedup of 23-fold with a single GPU and 33-fold with dual GPUs per node over the serial version of the application, which in turn results in a many-fold speedup when coupled with the MPI distribution of the computation across the cluster. We also study the efficacy of GPU-integrated MPI, with MPI-ACC as an example implementation, on a seismology modeling application and discuss the lessons learned. Dedication I dedicate this thesis to my mom, dad, and sister. iii Acknowledgments I would like to express my heartfelt gratitude to my advisor Dr. Wu-chun Feng, foremost, for his invaluable guidance and support throughout my M.S. studies. His belief in his students and encouragement to think independently to develop solutions to the research problems made the overall research experience greatly rewarding. His diligent work-ethic and dynamic personality had been a wonderful source of inspiration to me. It has been a pleasure …

Performance Modeling and Tuning for DFT Calculations on Heterogeneous Architectures

Large Scale Numerical Simulation Via Parallelization and Reconfigurable Computing Hardware

Hybrid Performance Modeling And Analyzing Of Parallel Systems

Performance Modeling, Optimization, and Characterization on Heterogeneous Architectures

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures

Multi-GPU Hybrid Programming Accelerated Three-Dimensional Phase-Field Model in Binary Alloy

Performance analysis and modeling for quantum computing simulation on distributed GPU platforms

Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality

Achieving Performance Portability in Gaussian Basis Set Density Functional Theory on Accelerator Based Architectures in NWChemEx

Using Hardware Counter-Based Performance Model to Diagnose Scaling Issues of HPC Applications.

Prediction models for multi-dimensional power-performance optimization on many cores

Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

High Performance Optimization at the Door of the Exascale

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers

Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems

Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels

Exploration of Performance and Energy Trade-offs for Heterogeneous Multicore Architectures

Scalability of high-performance PDE solvers

Performance Optimization using Multimodal Modeling and Heterogeneous GNN