A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Xinyao Yi
2024-09-17
Abstract:Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers have made efforts using different parallel technologies, including developing applications, conducting performance analyses, identifying performance bottlenecks, and proposing feasible solutions. However, balancing and optimizing parallel programs remain challenging due to the complexity of parallel algorithms and hardware architectures. Issues such as data transfer between hosts and devices in heterogeneous systems continue to be bottlenecks that limit performance. This work summarizes a vast amount of information on various parallel programming techniques, aiming to present the current state and future development trends of parallel programming, performance issues, and solutions. It seeks to give readers an overall picture and provide background knowledge to support subsequent research.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the application and optimization of parallel computing technologies in the field of high - performance computing (HPC). Specifically, the paper is concerned with how to achieve efficient computing performance through different parallel computing methods, which include: 1. **CPU multi - threading technology**: Use multi - threading technology on single - core or multi - core CPUs to improve computing efficiency. The paper discusses the basic concepts of multi - threading, scheduling strategies, and practical applications in multi - core CPUs. 2. **Use of accelerator devices**: Integration of powerful parallel computing devices such as graphics processing units (GPUs) and field - programmable gate arrays (FPGAs). In particular, the GPU, due to its multi - core architecture's advantages in data - parallel computing, becomes the key object of discussion in the paper. The paper explores the CUDA programming model and its application on NVIDIA GPUs. 3. **Special parallel architectures**: Such as single - instruction - multiple - data (SIMD) architectures. The paper analyzes the characteristics of SIMD architectures and their applications in different devices, especially their role in dealing with the stagnation of Moore's Law. The main objective of the paper is to summarize the current status and development trends of various parallel programming techniques, analyze the existing performance problems, and propose feasible solutions. Through these studies, the paper aims to provide readers with a comprehensive overview to support subsequent research work. In particular, the paper conducts a detailed discussion on GPU - based parallel optimization because it is one of the most popular parallel computing solutions at present. ### Main problem summary: - **Balancing and optimizing parallel programs**: Due to the complexity of parallel algorithms and hardware architectures, balancing and optimizing parallel programs remains a challenge. - **Data transfer bottleneck**: In heterogeneous systems, the data transfer problem between the host and the device is still a key bottleneck that limits performance. - **Trade - off between automatic and manual parallelization**: Although automatic parallelization is simple and easy to use, it usually cannot achieve the best performance; while manual parallelization is flexible but requires high programming skills and time investment. Through these discussions, the paper hopes to provide valuable references for researchers and developers to help them better utilize parallel computing technologies in the high - performance computing field.