Abstract:NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs. It is beneficial to exploit multiple levels of parallelism for a wide range of applications, because a typical server already has tens of processor cores now. As the number of cores in a computer is increasing rapidly, efficient support of nested parallelism will be more important. However, compared to single-level parallelism, nested-parallelism is much more complicated for programming since its configuration space of degree of parallelism is more complicated. Nowadays parallel programming models such as OpenMP only have naive support for nested parallelism, and programmers need to specify number of threads for each parallel task explicitly to get a reasonable performance. Such method has two drawbacks. First, it is a complicated job to write code to figure out appropriate configurations for different environments and contexts. Second, the runtime system lacks sufficient global information about threads allocation to make optimal decision on task-core mapping, which easily causes significant performance loss. To deal with such problems, we propose NestedMP, a set of directives which extends OpenMP. NestedMP adopts a model that propagate available threads on task tree in a top-down way, which provides global information about threads allocation for runtime system when high level parallel tasks are launched, to help it make locality-aware task-core mapping decisions. On the other side, instead of configuring number of threads explicitly, programmers control that by policies defined in NestedMP. We have written a few benchmarks by NestedMP, which shows NestedMP makes the code more concise on most cases. We have implemented NestedMP in GCC 4.8.2 and tested the performance of these benchmarks on a 4-way 8-core SandyBridge server. The result shows NestedMP improves the performance significantly over GCC's OpenMP implementation.

Parallelization of Module Network Structure Learning and Performance Tuning on SMP

An Implement of Parallel Module Network Learning Algorithm on Distributed Memory Multiprocessors

Parallel Module Network Learning on Distributed Memory Multiprocessors

Hybrid Performance Modeling And Analyzing Of Parallel Systems

Parallelization of Bayesian Network Based SNPs Pattern Analysis and Performance Characterization on SMP/HT

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

A Prediction Model For Parallel Back Propagation Neural Network On Smp-Cluster

Performance Optimization using Multimodal Modeling and Heterogeneous GNN

Coded Parallelism for Distributed Deep Learning.

Performance and Energy Consumption of Parallel Machine Learning Algorithms

Hybrid Parallel Programming Model for Hierarchical NoC

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel Programs

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration

A Stage-Level Network Parallelization Method Based on Depth Decomposition

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Accelerating Deep Neural Network guided MCTS using Adaptive Parallelism

Core Placement Optimization of Many-core Brain-Inspired Near-Storage Systems for Spiking Neural Network Training

Parallel Network RAM: Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs

AUTOPARLLM: GNN-Guided Automatic Code Parallelization using Large Language Models

Proteus: Simulating the Performance of Distributed DNN Training