Abstract:Calculation of many-body correlation functions is one of the critical kernels utilized in many scientific computing areas, especially in Lattice Quantum Chromodynamics (Lattice QCD). It is formalized as a sum of a large number of contraction terms each of which can be represented by a graph consisting of vertices describing quarks inside a hadron node and edges designating quark propagations at specific time intervals. Due to its computation- and memory-intensive nature, real-world physics systems (e.g., multi-meson or multi-baryon systems) explored by Lattice QCD prefer to leverage multi-GPUs. Different from general graph processing, many-body correlation function calculations show two specific features: a large number of computation-/data-intensive kernels and frequently repeated appearances of original and intermediate data. The former results in expensive memory operations such as tensor movements and evictions. The latter offers data reuse opportunities to mitigate the data-intensive nature of many-body correlation function calculations. However, existing graph-based multi-GPU schedulers cannot capture these data-centric features, thus resulting in a sub-optimal performance for many-body correlation function calculations. To address this issue, this paper presents a multi-GPU scheduling framework, MICCO, to accelerate contractions for correlation functions particularly by taking the data dimension (e.g., data reuse and data eviction) into account. This work first performs a comprehensive study on the interplay of data reuse and load balance, and designs two new concepts: local reuse pattern and reuse bound to study the opportunity of achieving the optimal trade-off between them. Based on this study, MICCO proposes a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds. Specifically, MICCO is integrated into a real-world Lattice QCD system, Redstar, for the first time running on multiple GPUs. The evaluation demonstrates MICCO outperforms other state-of-art works, achieving up to 2.25× speedup in synthesized datasets, and 1.49× speedup in real-world correlation functions.

Interference-aware execution framework with Co-scheML on GPU clusters

Interference-aware parallelization for deep learning workload in GPU cluster

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference

Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

Unleashing the Power of Preemptive Priority-based Scheduling for Real-Time GPU Tasks

A HPC Co-Scheduler with Reinforcement Learning

Themis: Fair and Efficient GPU Cluster Scheduling

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Work-in-Progress: Scheduler for Collaborated FPGA-GPU-CPU Based on Intermediate Language

MICCO: an Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

Scheduling Distributed Deep Learning Jobs in Heterogeneous Cluster with Placement Awareness

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

GCAPS: GPU Context-Aware Preemptive Priority-based Scheduling for Real-Time Tasks

Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications