Abstract:Abstract In the era of heterogeneous computing, a new paradigm called accelerator level parallelism (ALP) has emerged. In ALP, accelerators are used concurrently to provide unprecedented levels of performance and energy efficiency. To reach that there are many problems to be solved, one of the most challenging being co-execution. In this paper, we present a new scheduling framework called POAS, a general method for providing co-execution to applications. Our proposal consists of four steps: predict, optimize, adapt and schedule. With POAS, an unseen application can be executed concurrently in ALP with little effort. We evaluate POAS on a heterogeneous environment consisting of CPUs, GPUs (CUDA cores), and XPUs (Tensor cores) on two different fields, namely linear algebra (matrix multiplication benchmark) and deep learning (convolution benchmark). Our experiments prove that POAS provides excellent performance and completes the tasks within a time very close to the optimal time for the hardware and applications used, with a negligible execution time overhead. Moreover, the POAS predictor performed exceptionally well, achieving very low RMSE values for both use cases. Therefore, POAS can be a valuable tool for fully exploiting ALP and improving overall performance over offloading in heterogeneous settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of achieving accelerator - level parallelism (ALP) in a heterogeneous computing environment, especially the co - execution problem. Specifically, the authors propose a new scheduling framework, POAS (Predict, Optimize, Adapt, and Schedule), which aims to minimize the execution time of a given workload by concurrently executing a single task on multiple accelerators. The main objective of the paper is to show how POAS can effectively utilize ALP in a heterogeneous environment, improve overall performance, and have a low execution overhead. ### Main Contributions 1. **Defined a new framework**: for utilizing accelerator - level parallelism (ALP) in a heterogeneous environment. This framework is based on a new scheduling model that uses a performance predictor and combines the definition and optimization of a mathematical model. 2. **Detailed how the framework works in practical applications**: especially in two practical cases of matrix multiplication and convolution. 3. **Provided an experimental evaluation of the framework in an ALP environment**: including experimental results for CPU, GPU, and XPU (tensor cores). ### Specific Problems Solved - **Challenges of co - execution**: In a heterogeneous environment, it is very challenging for software to break down work into multiple parts and schedule these parts to different devices. - **Scheduling objectives**: Minimize execution time, energy consumption, or both. - **Hardware dependence**: The effectiveness of scheduling is highly dependent on the target hardware platform. ### Key Technologies - **Predict**: Develop a performance prediction model to estimate the execution time and memory transfer cost of the CPU and accelerators. - **Optimize**: Use the performance prediction model to construct a constraint satisfaction problem (CSP), and find the values that minimize the objective function through optimization. - **Adapt**: Adjust the optimization results according to the requirements of the application so that they are suitable for the scheduler. - **Schedule**: Use the optimized results to assign tasks to different accelerators and manage the communication between the CPU and accelerators. ### Experimental Verification - **Application cases**: Matrix multiplication and convolution. - **Hardware platforms**: Multi - core CPU, GPU (CUDA cores), and XPU (tensor cores). - **Experimental results**: POAS performed excellently in the experiments, being able to complete tasks in a nearly optimal time with negligible execution overhead. Through these technologies and experiments, the paper shows the potential of POAS to achieve efficient ALP in a heterogeneous computing environment, providing a valuable tool for future computing systems.

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

An Approach for Low-power Heterogeneous Parallel Implementation of ALC-PSO Algorithm using OmpSs and CUDA

Parallel asynchronous particle swarm optimization

Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures

Optimizing the Exploitation of Multicore Processors and GPUs with OpenMP and OpenCL

Exploiting co-execution with oneAPI: heterogeneity from a modern perspective

nOS-V: Co-Executing HPC Applications Using System-Wide Task Scheduling

Optimizing Offload Performance in Heterogeneous MPSoCs

Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems

An Adaptive Performance-oriented Scheduler for Static and Dynamic Heterogeneity

Towards Co-execution on Commodity Heterogeneous Systems: Optimizations for Time-Constrained Scenarios

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

POPA: Expressing High and Portable Performance Across Spatial and Vector Architectures for Tensor Computations

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems

Numerical eigen-spectrum slicing, accurate orthogonal eigen-basis, and mixed-precision eigenvalue refinement using OpenMP data-dependent tasks and accelerator offload

Poly: Efficient Heterogeneous System and Application Management for Interactive Applications

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

High Level Programming for Heterogeneous Architectures

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures