Abstract:Spherical Harmonic Transforms (SHT) are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas new, cutting-edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra-node parallelism appropriate for novel supercomputer architectures, multi-core processors and Graphic Processing Units (GPU). It also discusses their performance, alone and embedded within a top-level, MPI-based parallelisation layer ported from the S2HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest CUDA architecture ("Fermi") outperforms the state of the art implementation for a multi-core processor executed on a current Intel Core i7-2600K. Furthermore, we show that an MPI/CUDA version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3 times faster than the hybrid MPI/OpenMP version executed on the same number of quad-core processors Intel Nahalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance, and elucidates the sources of the dichotomy between the direct and the inverse operations.

A Parallel Implementation of the 2D Wavelet Transform Using CUDA

A generic parallel computational framework of lifting wavelet transform for online engineering surface filtration.

On the Use of Small 2D Convolutions on GPUs

Accelerating Fast Fourier Transforms Using Hadoop and CUDA

High performance multi-dimensional (2D/3D) FFT-Shift implementation on Graphics Processing Units (GPUs)

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

Massively parallel non-stationary EEG data processing on GPGPU platforms with Morlet continuous wavelet transform

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

A Parallel Approach for Contour Extraction Based on CUDA Platform

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

Acceleration of Tensor-Product Operations with Tensor Cores

Implementation of Scale and Rotation Invariant On-Line Object Tracking Based on CUDA.

CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation

GPU parallel simulation algorithm of Brownian particles with excluded volume using Delaunay triangulations

Improving Barnes-Hut t-SNE Algorithm in Modern GPU Architectures with Random Forest KNN and Simulated Wide-Warp

Accelerating Genome-Wide Association Studies Using CUDA Compatible Graphics Processing Units

Parallel Spherical Harmonic Transforms on heterogeneous architectures (GPUs/multi-core CPUs)

Efficient Parallel Video Processing Techniques on GPU: from Framework to Implementation.

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine