Abstract:Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2) incorporating powerful parallel computing devices such as GPUs, FPGAs, and other accelerators; and 3) utilizing special parallel architectures like Single Instruction/Multiple Data (SIMD). Many researchers have made efforts using different parallel technologies, including developing applications, conducting performance analyses, identifying performance bottlenecks, and proposing feasible solutions. However, balancing and optimizing parallel programs remain challenging due to the complexity of parallel algorithms and hardware architectures. Issues such as data transfer between hosts and devices in heterogeneous systems continue to be bottlenecks that limit performance. This work summarizes a vast amount of information on various parallel programming techniques, aiming to present the current state and future development trends of parallel programming, performance issues, and solutions. It seeks to give readers an overall picture and provide background knowledge to support subsequent research.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the application and optimization of parallel computing technologies in the field of high - performance computing (HPC). Specifically, the paper is concerned with how to achieve efficient computing performance through different parallel computing methods, which include: 1. **CPU multi - threading technology**: Use multi - threading technology on single - core or multi - core CPUs to improve computing efficiency. The paper discusses the basic concepts of multi - threading, scheduling strategies, and practical applications in multi - core CPUs. 2. **Use of accelerator devices**: Integration of powerful parallel computing devices such as graphics processing units (GPUs) and field - programmable gate arrays (FPGAs). In particular, the GPU, due to its multi - core architecture's advantages in data - parallel computing, becomes the key object of discussion in the paper. The paper explores the CUDA programming model and its application on NVIDIA GPUs. 3. **Special parallel architectures**: Such as single - instruction - multiple - data (SIMD) architectures. The paper analyzes the characteristics of SIMD architectures and their applications in different devices, especially their role in dealing with the stagnation of Moore's Law. The main objective of the paper is to summarize the current status and development trends of various parallel programming techniques, analyze the existing performance problems, and propose feasible solutions. Through these studies, the paper aims to provide readers with a comprehensive overview to support subsequent research work. In particular, the paper conducts a detailed discussion on GPU - based parallel optimization because it is one of the most popular parallel computing solutions at present. ### Main problem summary: - **Balancing and optimizing parallel programs**: Due to the complexity of parallel algorithms and hardware architectures, balancing and optimizing parallel programs remains a challenge. - **Data transfer bottleneck**: In heterogeneous systems, the data transfer problem between the host and the device is still a key bottleneck that limits performance. - **Trade - off between automatic and manual parallelization**: Although automatic parallelization is simple and easy to use, it usually cannot achieve the best performance; while manual parallelization is flexible but requires high programming skills and time investment. Through these discussions, the paper hopes to provide valuable references for researchers and developers to help them better utilize parallel computing technologies in the high - performance computing field.

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

A parallel computing method for irregular work

Hybrid Performance Modeling And Analyzing Of Parallel Systems

Implementing Performance Portability of High Performance Computing Programs in the New Golden Age of Chip Architecture

On the Parallelization Optimization Strategy for High Performance Computing Software

Performance Evaluation of Parallel Programming in Virtual Machine Environment

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

CPU GPU computing: Overview, optimization, and applications

Parallel Programming Models and Languages

A Survey of Accelerating Parallel Sparse Linear Algebra

Performance Optimization Strategies of High Performance Computing on GPU

High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

Survey of CPU/GPU Synergetic Parallel Computing

Power-aware Programming with GPU Accelerators

Parallel Model Research on the Heterogeneous Computer System

ON PARALLEL PROGRAMMING AND OPTIMISATION FOR MULTI-CORE

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Performance Evaluation of Parallel Algorithms

Energy Cost Evaluation of Parallel Algorithms for Multiprocessor Systems

Research on Application-driven Parallel Program Performance Tuning

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores