Abstract:Although the multi-core processors have become the mainstream processor architectures of the time,it is still hard to take advantage of the parallel computing power for many serial programs and software due to the lack of efficient parallelization.Manually re-engineering and refactoring of these legacy software is time consuming and costly.Therefore,the automatic parallelization techniques become the focus of attention in academia and industry.In this article,a novel semi-automatic parallelization approach is proposed targeting on optimization for regular for-loops and coarse-grained parallelism for irregular code sections in general programs.This approach employs a dynamic program analyzer to obtain the control-and data-dependences of programs.The gathered dependences information is used to form the Computational Unit (CU) graphs,and then the task graphs are further created,from which coarse-grained task parallelism of code sections can be extracted.Meanwhile,for the for-loop codes,a series of optimizations are adopted for code transformations.A profitable tiling model is proposed to effectively choose suitable loop codes for further optimization.The model is based on a mass of statistical data on locality analysis of loop iterations and it can determine whether the loop codes should perform loop tiling by invoking a loop transformation optimizer.The tile size selection (TSS) has an important impact on the performance of tiled codes and a uniform-mapping-in-cache-based tile size selection (UMC-TSS) is proposed to generate optimal tiled codes and achieve better performance during tiling.The UMC-TSS improves the method of a state-of-the-art TSS to exploit better cache utilization and loop parallelism.Eventually,a source-to-source transformation frame based on the LLVM frontend Clang is developed to transform sequential C/C++ codes to Intel TBB parallel codes.The frame is integrated with dynamic program analysis,coarse-grained parallelism extraction,loop optimizations (including the proposed profitable tiling model and UMC-TSS) and code transformations.It performs high-level code restructuring on the program abstract syntax tree.According to the task graphs,the Intel TBB parallel_for and flow graph templates are used to package the for-loops and irregular code sections into parallel codes respectively.The code transformation is semi-automatic that only a little manual effort and intervention is involved.A series of experiments have been conducted to evaluate the performance of the transformed parallel codes over 18 representative benchmarks selected from 4 different kinds of benchmark suits.The experiment results show that the parallel codes generated by the semi-automatic approach can achieve good parallelism when compared to the parallel codes written by experts,especially the codes with optimized for-loops.The average speedups of for-loops parallelization and task parallelization are 10.95 and 4.45 respectively on an Intel Xeon multi-core server.The correctness of the profitable tiling model is validated as well in the evaluation.The experiment results also show that the UMC-TSS improves the performance of 4％ on average in the tiled loop codes in comparison with a state-of-the-art tile size selection algorithm.The experiment results also show that the generated Intel TBB parallel codes have good scalability when the thread number varies,which demonstrates the effectiveness of the parallelization approach and the source-to-source transformation frame presented in this paper.

Loop-Oriented Pointer Analysis for Automatic SIMD Vectorization.

Loop-oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

A Compiler Approach for Exploiting Partial SIMD Parallelism.

Exploiting Mixed SIMD Parallelism by Reducing Data Reorganization Overhead

LLM-Vectorizer: LLM-based Verified Loop Vectorizer

A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation

Accelerating Lattice Boltzmann Method By Fully Exposing Vectorizable Loops

A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors

goSLP: Globally Optimized Superword Level Parallelism Framework

Rethinking Incremental and Parallel Pointer Analysis

A Semi-Automatic Coarse-Grained Parallelization Approach for Loop Optimization And Irregular Code Sections

Pointer Analysis Algorithm in Static Buffer Overflow Analysis

Lasa: Abstraction and Specialization for Productive and Performant Linear Algebra on FPGAs

An Approach to Enhance Loop Performance for Multicluster VLIW DSP Processor.

A New Approach to Pointer Analysis for Assignments.

Mis-speculation-Driven Compiler Framework for Aggressive Loop Automatic Parallelization

Automatically harnessing sparse acceleration

Improving SIMD Parallelism via Dynamic Binary Translation

Automated Compiler Optimization of Multiple Vector Loads/Stores

LUAEMA: A Loop Unrolling Approach Extending Memory Accessing for Vector Very-Long-Instruction-Word Digital Signal Processor with Multiple Register Files

Automatic parallelization of fine-grained metafunctions on a chip multiprocessor