Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Takuma Yamaguchi,Kohei Fujita,Tsuyoshi Ichimura,Muneo Hori,Maddegedara Lalith,Kengo Nakajima
DOI: https://doi.org/10.48550/arXiv.1710.08679
2017-10-24
Abstract:In this paper, we develop a low-order three-dimensional finite-element solver for fast multiple-case crust deformation analysis on GPU-based systems. Based on a high-performance solver designed for massively parallel CPU based systems, we modify the algorithm to reduce random data access, and then insert OpenACC directives. The developed solver on ten Reedbush-H nodes (20 P100 GPUs) attained speedup of 14.2 times from 20 K computer nodes, which is high considering the peak memory bandwidth ratio of 11.4 between the two systems. On the newest Volta generation V100 GPUs, the solver attained a further 2.45 times speedup from P100 GPUs. As a demonstrative example, we computed 368 cases of crustal deformation analyses of northeast Japan with 400 million degrees of freedom. The total procedure of algorithm modification and porting implementation took only two weeks; we can see that high performance improvement was achieved with low development cost. With the developed solver, we can expect improvement in reliability of crust-deformation analyses by many-case analyses on a wide range of GPU-based systems.
Distributed, Parallel, and Cluster Computing,Mathematical Software
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform multiple crustal deformation calculations quickly on GPU systems. Specifically, the author has developed a low - order three - dimensional finite - element solver, aiming to optimize the algorithm by reducing random data access and inserting OpenACC instructions, thereby achieving efficient multi - case crustal deformation calculations on GPU systems. The paper mentions that, compared with the original solver, this solver achieved a 14.2 - fold speedup when using 10 Reedbush - H nodes (20 P100 GPUs), and on the latest Volta - architecture V100 GPU, it further achieved a 2.52 - fold speedup relative to the P100 GPU. In addition, the paper also shows that this solver was used to estimate the slip distribution of the 2011 Tohoku - Oki earthquake in Japan, proving its effectiveness in practical applications. ### Main contributions of the paper 1. **Algorithm optimization**: By modifying the algorithm, reducing random memory access and inserting OpenACC instructions, the solver can run efficiently on the GPU. 2. **Performance improvement**: On the Reedbush - H system, a 14.2 - fold speedup was achieved using 20 P100 GPUs; on the V100 GPU, a further 2.52 - fold speedup was achieved. 3. **Practical application**: By calculating 368 crustal deformation analysis cases in Tohoku, Japan, the effectiveness and reliability of the solver in practical applications were verified. ### Key technologies - **Adaptive conjugate gradient method**: Through the adaptive pre - processing method, the convergence speed of the iterative solver is increased. - **Mixed - precision arithmetic**: Double - precision variables are used in the outer loop and single - precision variables are used in the inner loop to reduce memory usage and communication volume. - **Geometric / algebraic multigrid method**: Through the multigrid method, the problem is gradually coarsened to reduce the calculation cost. - **Element - by - element method**: Sparse matrix - vector products are calculated by the element - by - element method to reduce memory bandwidth load and improve computational performance. ### Performance tests - **Benchmark test**: A finite - element model with 125,177,217 degrees of freedom and 30,720,000 second - order tetrahedral elements was used for testing. - **Weak scalability test**: Testing was carried out on the Reedbush - H system using 240 P100 GPUs, and the results showed that the calculation time was approximately constant, verifying the parallel efficiency of the algorithm. - **Latest GPU architecture test**: The performance of P100 and V100 GPUs was compared on the DGX - 1 system, and the results showed that the V100 GPU achieved a significant performance improvement. ### Conclusion Through algorithm optimization and the use of OpenACC instructions, the paper has successfully achieved efficient multi - case crustal deformation calculations on GPU systems. This not only significantly improves the calculation speed, but also provides strong support for earthquake disaster simulation and prediction.