Abstract:NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs. It is beneficial to exploit multiple levels of parallelism for a wide range of applications, because a typical server already has tens of processor cores now. As the number of cores in a computer is increasing rapidly, efficient support of nested parallelism will be more important. However, compared to single-level parallelism, nested-parallelism is much more complicated for programming since its configuration space of degree of parallelism is more complicated. Nowadays parallel programming models such as OpenMP only have naive support for nested parallelism, and programmers need to specify number of threads for each parallel task explicitly to get a reasonable performance. Such method has two drawbacks. First, it is a complicated job to write code to figure out appropriate configurations for different environments and contexts. Second, the runtime system lacks sufficient global information about threads allocation to make optimal decision on task-core mapping, which easily causes significant performance loss. To deal with such problems, we propose NestedMP, a set of directives which extends OpenMP. NestedMP adopts a model that propagate available threads on task tree in a top-down way, which provides global information about threads allocation for runtime system when high level parallel tasks are launched, to help it make locality-aware task-core mapping decisions. On the other side, instead of configuring number of threads explicitly, programmers control that by policies defined in NestedMP. We have written a few benchmarks by NestedMP, which shows NestedMP makes the code more concise on most cases. We have implemented NestedMP in GCC 4.8.2 and tested the performance of these benchmarks on a 4-way 8-core SandyBridge server. The result shows NestedMP improves the performance significantly over GCC's OpenMP implementation.

Hybrid Parallel Programming Model for Hierarchical NoC

Hybrid Parallel Programming Model for Hierarchical NoC

Hybrid Performance Modeling And Analyzing Of Parallel Systems

A hybird hierarchical architecture for 3D multi-cluster NoC

Application-level pipelining on Hierarchical NoC

Implementation and Simulation Ofa Cluster-Based Hierarchical NoC Architecture for Multi-Processor SoC

Parallel approach of three-dimensional phase-field model of binary alloy in MPI+OpenMP environment

Multi-GPU Hybrid Programming Accelerated Three-Dimensional Phase-Field Model in Binary Alloy

NestedMP: Enabling Cache-Aware Thread Mapping for Nested Parallel Shared Memory Applications

Hierarchical Network-on-Chip Design Method

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs

Study on MPI/OpenMP hybrid parallelism for Monte Carlo neutron transport code

A Multi-Phase Based Multi-Application Mapping Approach for Many-Core Networks-on-Chip

Customized Network-on-Chip Oriented to MPI Collective Operations

Software/Hardware Hybrid Network-On-Chip Simulation On Fpga

A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor

HPH: Hybrid Parallelism on Heterogeneous Clusters for Accelerating Large-scale DNNs Training.

On the Parallelization Optimization Strategy for High Performance Computing Software

Mapping of Embedded Applications on Hybrid Networks-on-Chip with Multiple Switching Mechanisms

High Performance Network-on-Chips (NoCs) Design: Performance Modeling, Routing Algorithm and Architecture Optimization

Comparison of distributed parallel scheduling schemes for crop growth model