A Path-Based Method of Parallelizing C++ Programs

GJ ZHU,L XIE,ZX SUN

DOI: https://doi.org/10.1145/181748.181751

1994-01-01

Abstract:This paper presents a path-based method for identifying parallelism in C++ programs and proposes an execution model which trades off the efficiency and parallelism. Based on the path information provided at compiler time or computed at run time, the model reduces the rollbacks while enhances the parallelism.

What problem does this paper attempt to address?

A parallel computing method for irregular work

杨鑫,许端清,杨冰

DOI: https://doi.org/10.3785/j.issn.1008-973X.2013.11.026

2013-01-01

Abstract:In order to effectively use the powerful computing provided by the heterogeneous multi-core architecture, re-organize the data and a reasonable schedule for the execution of tasks is very necessary, according to the characteristics of the hardware architecture. This paper presents a parallel computing method for irregular work, the method is an multiple parallel integration of data parallelism, task parallelism, pipeline parallelism, is particularly suitable for the implementation of the work with dynamic behavior and complex irregular data structures algorithms, and run the program according to the storage locality and SIMD character, using priority-based dynamic scheduling and data management, to maximize the efficient use of CPU and GPU hardware computing resources and storage resources. The experiments results show that the approach can improve the parallel rendering algorithm performance for the dynamic execution and irregular data structures construction and maintenance.
Design, Implementation of the Parallel C Language Based on C/S Mode in Distributed Systems

Xiaohui Zou

DOI: https://doi.org/10.1109/tmee.2011.6199258

2011-01-01

Abstract:In this paper, we designed Parallel C Language, a kind of parallel programming Language, to implement parallel computing in distributed systems. It is to add some special identification statements on ANSI C Language. The parallel computing is based on C/S mode, incorporating multithread and RPC(remote procedure call). Multithread is used to cut tasks and concurrent the subtasks, and RPC is used to deploy and execute the subtasks on remote server nodes. The allocation of remote server nodes is dependent on load balancing mechanism. We also implemented the Pre-compiler of this Parallel C Language program, which parses the Parallel C Language program into several RPC application files. The other RPC application files are constructed by RPCGEN. The result of test shows that using this Parallel C Language to program in distributed systems reduces the codes size and makes full use of the system resources and improves efficiency.
Hybrid Performance Modeling And Analyzing Of Parallel Systems

Bin Cheng,Weiqin Tong,Xingang Wang

2012-01-01

International Journal of Numerical Analysis and Modeling

Abstract:Performance is a key feature of parallel system. However, there is a great gap between the peak performance and performance attainable by a practical application. The model-based performance evaluation may be used to support the performance-oriented program development for parallel system. In this paper a hybrid TCPN model is proposed to describe the parallel program and the resources respectively. This method can bring less effect to modify the program structure because of running environment changes. And the performance engineering activities based on this model ranges from performance prediction in early development stages, performance analysis in the coding phase, to locate the performance bottleneck and modify it. After the correctness verification of the TCPN model, a reachable graph can be got. Then the further performance-tuning can be done by summing the execution time of corresponding action in the critical path.
Scalable-Grain Pipeline Parallelization Method For Multi-Core Systems

Peng Liu,Chunming Huang,Jun Guo,Yang Geng,Weidong Wang,Mei Yang

DOI: https://doi.org/10.1007/978-3-642-40820-5_23

2013-01-01

Abstract:How to parallelize the great amount of legacy sequential programs is the most difficult challenge faced by multi-core designers. The existing parallelization methods at the compile time due to the obscured data dependences in C are not suitable for exploring the parallelism of streaming applications. In this paper, a software pipeline for multilayer loop method is proposed for streaming applications to exploit the coarse-grained pipeline parallelism hidden in multi-layer loops. The proposed method consists of three major steps: 1) transform the task dependence graph of a streaming application to resolve intricate dependence, 2) schedule tasks to multiprocessor system-on-chip with the objective of minimizing the maximal execution time of all pipeline stages, and 3) adjust the granularity of pipeline stages to balance the workload among all stages. The efficiency of the method is validated by case studies of typical streaming applications on multi-core embedded system.
NUAPC: A Parallelizing Compiler for C++

Zhu Genjiang,Xie Li,Sun Zhongxiu

DOI: https://doi.org/10.1007/bf02943177

1997-01-01

Abstract:This paper presents a model for automatically parallelizing compiler based on C++ which consists of compile-time and run-time parallelizing facilities. The paper also describes a method for finding both intra-object and inter-object parallelism. The parallelism detection is completely transparent to users.
On the Parallelization Optimization Strategy for High Performance Computing Software

Weile Jia,Xiaodong Shi,Haifeng Lyu

2013-01-01

Abstract:With the arrival of the multi-core era, traditional software can barely utilize the peak performance of the hardware. Parallelization and optimization of the industrial code is a tough problem the HPC community is facing. In this paper, we present our parallel im-plementation using MPI, OpenMP and CUDA programming models. Different methods, implementation skills and optimization strategies are introduced. At the end, the challenge and the vision of the future work of our optimization strategy are discussed.
A New Methodology of Data Dependence Analysis for Parallelizing C++

GJ ZHU,L XIE,ZX SUN

DOI: https://doi.org/10.1145/219726.219744

1995-01-01

Abstract:This paper details parallelism detection in NUAPC[4], an automatically parallelizing compiler based on C++. Two kinds of parallelism, i.e. the inter- and intra-object parallelism, are introduced in NUAPC. The paper presents the method for seeking these parallelism based on the key structures Conflict Set and method Reference Vector, and proposes a new theoretical formation for expressing and performing data dependence analysis in object-oriented program paradigm. Data dependence involving inherited data and function members are also discussed.
Object-oriented parallel execution model

Genjiang Zhu,Li Xie,ZhongXiu Sun

1998-01-01

Tien Tzu Hsueh Pao/Acta Electronica Sinica

Abstract:This paper proposes an object-oriented parallel execution model EM which aims to find more parallelism at run-time. EM uses reference and path analysis information for message receiving and sending control. Experimental results show that EM has better performance as comparing with the rollback mechanism.
A Parallel Method With Hybrid Algorithms For Mixed Integer Nonlinear Programming

Kai Zhou,Wei Wan,Xi Chen,Zhijiang Shao,Lorenz T. Biegler

DOI: https://doi.org/10.1016/B978-0-444-63234-0.50046-4

2013-01-01

Abstract:This study aims at improving the solution efficiency of Mixed Integer Nonlinear Programming (MINLP) through parallelism. Unlike most conventional parallel implementations of MINLP solvers, which utilize multi-threads to share the burden in the serial mode, the proposed method combines hybrid algorithms running on different threads. Two types of algorithms are designed in a parallel structure. One is the Quesada and Grossman's LP/NLP based branch and bound algorithm (QG); the other is Tabu Search (TS). The proposed method attempts to minimize the search space through continuous communication and exchange of intermediate results from each thread. Three kinds of information are exchanged between the two threads. First, the best solution in TS, if feasible, serves as a valid upper bound for QG. Second, new approximations which can further tighten the lower bound of QG can be generated at nodes provided by the TS. Third, strong branching in QG may fix some integer variables, which can help reduce the search space of TS. Both threads can thus benefit from the exchanged information in the hybrid method. Numerical results show that solution time can be greatly reduced for the tested MINLP. In addition, complexity analysis of the parallel approach suggests that the proposed method has the potential for superlinear speedup.
A Parallel Function Evaluation Approach for Solution to Large-Scale Equation-Oriented Models.

Yannan Ma,Zhijiang Shao,Xi Chen,Lorenz T. Biegler

DOI: https://doi.org/10.1016/j.compchemeng.2016.07.015

IF: 4.13

2016-01-01

Computers & Chemical Engineering

Abstract:The equation-oriented (EO) approach is widely used for process simulation and optimization. Nevertheless, large-scale EO models consist of a huge number of nonlinear equations and make the solution procedure a challenging and time-consuming task. For most gradient-based numerical algorithms, function evaluations are the dominant step during the solution procedure. Here, a parallel computation method is developed for function evaluations within EO optimization strategies. After dividing the equations into several groups, function evaluations are calculated by using multiple threads on a parallel hardware platform simultaneously. Theoretical analysis for the speedup ratio is conducted. The implementation of the proposed method on a multi-core processor platform as well as a graphics processing unit (GPU) platform is then presented with several case studies. Numerical results are compared and discussed to show that the multi-core processor implementation has good computational performance, whereas the GPU implementation only achieves computational acceleration under relatively specific conditions.
Parallel Computing of Shared Memory Multiprocessors Based on JOMP

Zhang Hong,Cao Jie,Wang Xiaoming,Zhu Changsheng

DOI: https://doi.org/10.2991/meic-15.2015.346

2015-01-01

Abstract:Aiming at parallel computing of shared memory multiprocessors based on JOMP, this paper probes the writing, compiling and running of a (star). jomp file in detail and describes a complete specification and process for parallel programming. It expounds the format of parallel directive sets, main runtime libraries and their functions, making a further illustration of variable attributes in parallel region. Taking an example of computing values of pi on a lenovo Intel(R) Core(TM) i3-2120, it obtained a 1.73 speedup factor and 86.5% efficiency, and proved the feasibility and effectiveness of this system.
Source-code-level Transformation and APT-Driven Parallelism Pre-processes for Embedded System Automated Design

Kang Zhao,Jinian Bian,Qiang Wu,Xianlong Hong

DOI: https://doi.org/10.1109/cesa.2006.4281702

2006-01-01

Abstract:A particular pre-processing framework for embedded system design automation is presented in this paper. The main motivation of this framework is to construct a unified internal platform that bridges the gap from the original system application input to the intermediate kernel representation in hardware/software (HW/SW) co-design. To cope with this issue, novel algorithms for the transformation from C specification to hierarchical control data flow graph (HCDFG) and parallelism optimization are employed in this paper, which satisfy the front-end requirements of HW/SW partitioning in the whole design. In particular, a novel model named abstract parallel tree (APT) is emphatically presented in detail to offer a theoretic support for the implementation of parallelism optimization. Finally, the summary of experimental implementations is presented and the feasibility of this framework is validated
Enhanced Parallelization via Constraints

weingan chin,zhenjiang hu,masato takeichi,akihiko takano

1997-01-01

Abstract:Systematic parallelization of sequential programs remains a major challenge in parallel computing. Traditional approaches using program schemes are somewhat narrow in scope, as the properties which enable parallelism are di cult to capture via ad-hoc schemes. We propose a more systematic approach to parallelization based on the notion of preserving the context of recursive sub-terms. This approach can be used to derive a class of divide-andconquer programs. To enhance the methodology further, we advocate the use of required constraints to widen the class of programs that could be handled. A unique feature of our approach is that it supports both reusability and e ciency. In particular, both general and specialised contraints are gathered to make this marriage possible.
A Language of Suggestions for Program Parallelization

Chao Zhang,Chen Ding,Kirk Kelsey,Tongxin Bai,Xiaoming Gu,Xiaobing Feng

2009-01-01

Abstract:Coarse-grained task parallelism exists in sequential code and can be leveraged to boost the use of chip multi-processors. However, large tasks may execute thousands of lines of code and are often too complex to analyze and manage statically. This report describes a programming system called suggestible parallelization. It consists of a programming in- terface and a support system. The interface is a small language with three primitives for marking possibly parallel tasks and their possible dependences. The support system is implemented in software and ensures correct parallel execution through speculative par- allelization, speculative communication and speculative memory allocation. It manages parallelism dynamically to tolerate unevenness in task size, inter-task delay and hardware speed. When evaluated using four full-size benchmark applications, suggestible parallelization obtains up to a 6 times speedup over 10 processors for sequential legacy applications up to 35 thousand lines in size. The overhead of software speculation is not excessively high compared to unprotected parallel execution.
Integrated optimization scheme in parallelizing compilers

Sun Tong,Li Sanli,Li Xiaoming

1996-01-01

Ruan Jian Xue Bao/Journal of Software

Abstract:This paper presents a complete suit of systematic optimizing methods which may be used in parallelizing compilers for multicomputers or computer clusters. In the compilation scheme, two strategy are adopted. One is trading off parallelism and communication cost and the other is reducing and hiding communication overhead. Through analyzing the properties of data communication required for the program partition approach based on affine functions, the authors find a method to exploit parallelism in aerial programs satisfying the special requirements of distributed memory machines. In order to minimize the total of data needing to be communicated, they invent a global optimization program partition method based on solving linear equations. In order to optimize the organization of communication codes and generate more efficient node programs, they invent a more practical method based on linear inequalities to perform communication optimization and node programs generation.
Parallelism analysis based on generalized method invocation model

Meng Yu,XueLin Yang,Wanyu Zang,Li Xie,ZhongXiu Sun

2002-01-01

Jisuanji Xuebao/Chinese Journal of Computers

Abstract:This paper introduces a parallelism analysis method that considers the polymorphism and reference alias of object-oriented languages. Firstly a generalized model of method invocation is introduced. According to the generalized model an algorithm of parallelism analysis is proposed. The new algorithm includes three steps, which are variable expression reduction, intra procedure analysis and inter procedure analysis to compute define-use sets. The reference of function is computed to get the possible set of objects in inter procedure analysis, which helps to get more precise results than earlier work. In the algorithm the recursive procedure has been processed. The computing complexities of every algorithm are given in this paper. Finally, a simple enough example is used to compare authors is work with earlier works. JAPS-II,Java parallelizing compiler developed, exploits and implements intra and inter object parallelism of serial Java programs. Its target architecture is NOW based distributed memory computer system. The optimizations in JAPS-II have been implemented.
Characterizing Fine-Grain Parallelism on Modern Multicore Platform

Xuhao Chen,Wei Chen,Jiawen Li,Zhong Zheng,Li Shen,Zhiying Wang

DOI: https://doi.org/10.1109/ICPADS.2011.41

2011-01-01

Abstract:Since chip multiprocessors have dominated the processor market, developing a parallel programming model with proper trade-off between productivity and efficiency become increasingly important. As a typical fine-grain parallelism model, Intel Threading Building Blocks (TBB) simplifies parallel programming by runtime schedule. Despite its simplicity, it costs non-trivial runtime overhead which may increase as the thread counts increase. In this work, we conduct an experiment on real commodity hardware to evaluate performance scalability of TBB using PARSEC benchmark suite. We first compare TBB with Pthreads to show that TBB applications can achieve comparable performance as Pthreads applications. To find the performance bottleneck of TBB applications, we measure the runtime overhead of TBB focused on 3 basic TBB runtime activities. The result provides valuable implications which can be used to develop scalable runtime libraries and architectural support for alleviating performance bottlenecks.
Speculative Parallelization Using State Separation and Multiple Value Prediction.

Chen Tian,Min Feng,Rajiv Gupta

DOI: https://doi.org/10.1145/1837855.1806663

2010-01-01

Abstract:With the availability of chip multiprocessor (CMP) and simultaneous multithreading (SMT) machines, extracting thread level parallelism from a sequential program has become crucial for improving performance. However, many sequential programs cannot be easily parallelized due to the presence of dependences. To solve this problem, different solutions have been proposed. Some of them make the optimistic assumption that such dependences rarely manifest themselves at runtime. However, when this assumption is violated, the recovery causes very large overhead. Other approaches incur large synchronization or computation overhead when resolving the dependences. Consequently, for a loop with frequently arising cross-iteration dependences, previous techniques are not able to speed up the execution. In this paper we propose a compiler technique which uses state separation and multiple value prediction to speculatively parallelize loops in sequential programs that contain frequently arising cross-iteration dependences. The key idea is to generate multiple versions of a loop iteration based on multiple predictions of values of variables involved in cross-iteration dependences (i.e., live-in variables). These speculative versions and the preceding loop iteration are executed in separate memory states simultaneously. After the execution, if one of these versions is correct (i.e., its predicted values are found to be correct), then we merge its state and the state of the preceding iteration because the dependence between the two iterations is correctly resolved. The memory states of other incorrect versions are completely discarded. Based on this idea, we further propose a runtime adaptive scheme that not only gives a good performance but also achieves better CPU utilization. We conducted experiments on 10 benchmark programs on a real machine. The results show that our technique can achieve 1.7x speedup on average across all used benchmarks.
A New Parallel Skeleton for General Accumulative Computations

Hideya Iwasaki,Zhenjiang Hu

DOI: https://doi.org/10.1023/B:IJPP.0000038069.80050.74

2004-01-01

International Journal of Parallel Programming

Abstract:Skeletal parallel programming enables programmers to build a parallel program from ready-made components (parallel primitives) for which efficient implementations are known to exist, making both the parallel program development and the parallelization process easier. Constructing efficient parallel programs is often difficult, however, due to difficulties in selecting a proper combination of parallel primitives and in implementing this combination without having unnecessary creations and exchanges of data among parallel primitives and processors. To overcome these difficulties, we propose a powerful and general parallel skeleton, accumulate, which can be used to naturally code efficient solutions to problems as well as be efficiently implemented in parallel using Message Passing Interface (MPI).
ON PARALLEL PROGRAMMING AND OPTIMISATION FOR MULTI-CORE

Chen Dai,Peng Chen,Donglei Yang,Weihua Zhang

DOI: https://doi.org/10.3969/j.issn.1000-386x.2013.12.052

2013-01-01

Abstract:With the popularity of multi-core and even many-core platforms , parallel programming and optimisation for multi-core have be-come the focuses of research in computer science area .However , most of the programmers are still go on the traditional serial programming habits, therefore how to effectively parallelise the serial programs and to efficiently compile the multi -core programs become the issues that need to be urgently resolved .We make the overall studies and analyses on the status quo of multi-core programming and optimisation technolo-gies in the paper .While describing the way to parallelise the serial programs , we also analyse the tools and models for multi-core parallel pro-gramming which are of the mainstream nowadays .Based on that , we further discuss the factors in multi-core programming process that may af-fect the programs performance , and expatiate on the optimisations made for multi-core programming in both software and hardware area .On the basis of analysing and appraising various research projects , we also present the prospects on possible development direction in regard to parallel programming and optimisation technologies for multi-core.

A Path-Based Method of Parallelizing C++ Programs

A parallel computing method for irregular work

Design, Implementation of the Parallel C Language Based on C/S Mode in Distributed Systems

Hybrid Performance Modeling And Analyzing Of Parallel Systems

Scalable-Grain Pipeline Parallelization Method For Multi-Core Systems

NUAPC: A Parallelizing Compiler for C++

On the Parallelization Optimization Strategy for High Performance Computing Software

A New Methodology of Data Dependence Analysis for Parallelizing C++

Object-oriented parallel execution model

A Parallel Method With Hybrid Algorithms For Mixed Integer Nonlinear Programming

A Parallel Function Evaluation Approach for Solution to Large-Scale Equation-Oriented Models.

Parallel Computing of Shared Memory Multiprocessors Based on JOMP

Source-code-level Transformation and APT-Driven Parallelism Pre-processes for Embedded System Automated Design

Enhanced Parallelization via Constraints

A Language of Suggestions for Program Parallelization

Integrated optimization scheme in parallelizing compilers

Parallelism analysis based on generalized method invocation model

Characterizing Fine-Grain Parallelism on Modern Multicore Platform

Speculative Parallelization Using State Separation and Multiple Value Prediction.

A New Parallel Skeleton for General Accumulative Computations

ON PARALLEL PROGRAMMING AND OPTIMISATION FOR MULTI-CORE