Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000

Jianbin Fang,Peng Zhang,Chun Huang,Tao Tang,Kai Lu,Ruibo Wang,Zheng Wang
DOI: https://doi.org/10.48550/arXiv.2210.12230
2022-10-22
Abstract:As the hardware industry moves towards using specialized heterogeneous many-cores to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. This article shares our experience when developing a programming model and its supporting compiler and libraries for Matrix-3000, which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization. To assist its software development, we developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler. Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000, while the high-level model allows programmers to use the OpenCL programming standard. We detail our design choices and highlight the lessons learned from developing systems software to enable the programming of bare-metal accelerators. Our programming models have been deployed to the production environment of an exascale prototype system.
Programming Languages,Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
The main problem this paper attempts to address is the development of a programming model suitable for the Matrix-3000 heterogeneous multi-core accelerator, along with its supporting compiler and runtime system. The Matrix-3000 is designed for next-generation exascale supercomputers, featuring a complex memory hierarchy and processor organization. To tackle these challenges, the authors developed a software stack from scratch, including a low-level programming interface (hthreads) and a high-level OpenCL compiler (MOCL3), to enhance programming performance, programmability, and portability. Specifically, the paper addresses the following issues: 1. **Programming Complexity**: The current hardware architecture and programming model are significantly different from traditional multi-core processors, making it very difficult to write and optimize code. The paper reduces programming difficulty by introducing the low-level programming interface hthreads and the high-level programming interface MOCL3. 2. **Memory Management**: The Matrix-3000 has a complex memory hierarchy, including multiple memory levels and distributed memory. The paper addresses this issue by designing efficient memory management and data transfer mechanisms. 3. **Thread Management**: Since the Matrix-3000 is a bare-metal device without operating system support, debugging and managing parallel threads is very challenging. The paper manages threads by introducing a heterogeneous thread model and optimized synchronization mechanisms. 4. **Performance Optimization**: To fully exploit the computational potential of the Matrix-3000, the paper implements various optimization techniques in the compiler and runtime system, such as vector extensions and atomic operation implementations. Overall, the paper aims to develop a comprehensive programming model and toolchain that enables developers to more efficiently utilize the high-performance computing capabilities of the Matrix-3000.