GPU-accelerated MART and Concurrent Cross-Correlation for Tomographic PIV

Zeng Xin,He Chuangxin,Liu Yingzheng
DOI: https://doi.org/10.1007/s00348-022-03444-3
IF: 2.797
2022-01-01
Experiments in Fluids
Abstract:This paper presents a novel Graphics Processing Unit (GPU)-accelerated method for large-scale data processing of tomographic particle image velocimetry. The multiplicative algebraic reconstruction technique (MART) is utilized to reconstruct three-dimensional (3D) particle fields, and cross-correlation with fast Fourier transform is used to generate the displacement vectors. The Compute Unified Device Architecture (CUDA) C programming model is used to port the velocity field reconstruction from CPU code to GPU code to improve efficiency. For similar reconstruction tasks, a particular thread grid hierarchy is designed to construct the corresponding computational kernel functions, and each task is launched in a single thread. A modified strategy of pixel batch processing is then used to manage the GPU memory access. Subsequently, the asynchronous stream concurrency is used to generate the velocity field with the GPU cuFFT library. A synthetic 3D experiment with a ring vortex is carried out to verify the accuracy and efficiency of the developed method. The parallel results agree well with the generated data and other research conclusions reported in the literature. The speed-up ratio by multi-core CPU (Intel® Xeon® Platinum 8168) parallel implementation with OpenMP converges to 2.5 × in MFG-MART and 3.0 × in cross-correlation. In contrast to a 24-core CPU implementation, a GPU (NVIDIA Tesla V100S, 32 GB) under maximum memory usage achieves an impressive speed-up ratio of over 20 × in parallel MFG-MART and 4 × in concurrent cross-correlation. The measurement of turbulent flow in a circular jet flow at Reynolds 3,000 is used to examine the efficiency promotion of the parallelized framework in real experimental settings. For the synthetic volume reconstruction of 700 × 700 × 140 voxels and cross-correlation with 413 voxels window in a 75% overlap, and the experimental volume reconstruction of 550 × 1100 × 550 voxels and cross-correlation with 323 voxels window in a 50% overlap, a frame of velocity field can be completed within 2 min in each domain.
What problem does this paper attempt to address?