Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000

Jianbin Fang,Peng Zhang,Chun Huang,Tao Tang,Kai Lu,Ruibo Wang,Zheng Wang

DOI: https://doi.org/10.48550/arXiv.2210.12230

2022-10-22

Abstract:As the hardware industry moves towards using specialized heterogeneous many-cores to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. This article shares our experience when developing a programming model and its supporting compiler and libraries for Matrix-3000, which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization. To assist its software development, we developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler. Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000, while the high-level model allows programmers to use the OpenCL programming standard. We detail our design choices and highlight the lessons learned from developing systems software to enable the programming of bare-metal accelerators. Our programming models have been deployed to the production environment of an exascale prototype system.

Programming Languages,Distributed, Parallel, and Cluster Computing,Performance

What problem does this paper attempt to address?

The main problem this paper attempts to address is the development of a programming model suitable for the Matrix-3000 heterogeneous multi-core accelerator, along with its supporting compiler and runtime system. The Matrix-3000 is designed for next-generation exascale supercomputers, featuring a complex memory hierarchy and processor organization. To tackle these challenges, the authors developed a software stack from scratch, including a low-level programming interface (hthreads) and a high-level OpenCL compiler (MOCL3), to enhance programming performance, programmability, and portability. Specifically, the paper addresses the following issues: 1. **Programming Complexity**: The current hardware architecture and programming model are significantly different from traditional multi-core processors, making it very difficult to write and optimize code. The paper reduces programming difficulty by introducing the low-level programming interface hthreads and the high-level programming interface MOCL3. 2. **Memory Management**: The Matrix-3000 has a complex memory hierarchy, including multiple memory levels and distributed memory. The paper addresses this issue by designing efficient memory management and data transfer mechanisms. 3. **Thread Management**: Since the Matrix-3000 is a bare-metal device without operating system support, debugging and managing parallel threads is very challenging. The paper manages threads by introducing a heterogeneous thread model and optimized synchronization mechanisms. 4. **Performance Optimization**: To fully exploit the computational potential of the Matrix-3000, the paper implements various optimization techniques in the compiler and runtime system, such as vector extensions and atomic operation implementations. Overall, the paper aims to develop a comprehensive programming model and toolchain that enables developers to more efficiently utilize the high-performance computing capabilities of the Matrix-3000.

Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000

High Performance Matrix Multiplication on Many Cores

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

A Unified Programming Model for Heterogeneous Computing with CPU and Accelerator Technologies

AHA: An Agile Approach to the Design of Coarse-Grained Reconfigurable Accelerators and Compilers

Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems

The Feature,Programming Model and Performance Optimization Strategy of Heterogeneous Many-Core System:A Review

Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures

Performance Evaluation of Hybrid Programming Patterns for Large CPU/GPU Heterogeneous Clusters.

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

A Heterogeneous Accelerated Matrix Multiplication: OpenCL + APU + GPU+ Fast Matrix Multiply

Allo: A Programming Model for Composable Accelerator Design

Parallel Model Research on the Heterogeneous Computer System

Characterizing Fine-Grain Parallelism on Modern Multicore Platform

Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster.

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

High Level Programming for Heterogeneous Architectures

A homegrown many-core processor architecture for high-performance computing

Exploring the Architecture of Multiple GEMM Accelerators in Heterogeneous Systems

Accelerating and Tuning Small Matrix Multiplications on Sunway TaihuLight: A Case Study of Spectral Element CFD Code Nek5000

Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJA