Productivity, Portability, Performance: Data-Centric Python

Alexandros Nikolaos Ziogas,Timo Schneider,Tal Ben-Nun,Alexandru Calotoiu,Tiziano De Matteis,Johannes de Fine Licht,Luca Lavarini,Torsten Hoefler
DOI: https://doi.org/10.1145/1122445.1122456
2021-08-23
Abstract:Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python's high productivity while achieving portable performance across different architectures. The workflow's key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes.
Programming Languages,Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the contradiction among productivity, portability and performance when using the Python language for scientific computing in high - performance computing (HPC). Specifically: 1. **Productivity**: Python has become the preferred language for scientific computing because of its ease of use and rich scientific computing libraries (such as NumPy), but it does not have high - efficiency performance in itself. The paper proposes a workflow aiming to achieve high - performance computing across different architectures while maintaining Python's high productivity. 2. **Portability**: With the increase in hardware diversity, how to run the same code efficiently on different hardware platforms (such as CPU, GPU, FPGA) has become a challenge. By providing a set of automatic optimization tools, the paper enables Python code to run efficiently on multiple hardware, thus improving the portability of the code. 3. **Performance**: Although Python is very popular in scientific computing, its interpreted execution characteristics lead to performance bottlenecks. By introducing the data - centric intermediate representation (DCIR) and a series of automatic optimization techniques, the paper significantly improves the performance of Python code in the HPC environment. The main contributions of the paper include: - Defining the methodology of high - performance Python and proposing an extension to improve the transformation to the data - centric intermediate representation (DCIR) through explicit annotations. - Providing a set of automatic optimization methods for CPU, GPU and FPGA. These methods improve the average performance by 2.47 times on the CPU and 3.75 times on the GPU. - Implementing automatic implicit MPI conversion and communication optimization, as well as explicit distribution management. The former has an efficiency of up to 93.16% on 512 nodes. Through these contributions, the paper successfully achieves a comprehensive improvement in productivity, portability and performance in a single system.