Abstract:It is shown micromagnetic and atomistic spin dynamics simulations can use multiple GPUs in order to reduce computation time, but also to allow for a larger simulation size than is possible on a single GPU. Whilst interactions which depend on neighbouring spins, such as exchange interactions, may be implemented efficiently by transferring data between GPUs using halo regions, or alternatively using direct memory accesses, implementing the long-range demagnetizing interaction is the main difficulty in achieving good performance scaling, where the data transfer rate between GPUs is a significant bottleneck. A multi-GPU convolution algorithm is developed here, which relies on single-GPU FFTs executed in parallel. It is shown that even for micromagnetic simulations where the demagnetizing interaction computation time dominates, good performance scaling may be achieved, with speedup factors up to 1.8, 2.5, and 3.1, for 2, 3, and 4 GPUs respectively. The code developed here can be used for any number of GPUs in parallel, with performance scaling strongly dependent on inter-GPU data transfer rate and connection topology. This is further improved in micromagnetic simulations which include a spin transport solver, obtaining speedup factors up to 1.96, 2.8, and 3.7, for 2, 3, and 4 GPUs respectively. The best case scenario is obtained for atomistic spin dynamics simulations, where the demagnetizing interaction is implemented with spin-averaged cells. Using a single workstation with 4 GPUs, it is shown atomistic spin dynamics simulations with up to 1 billion spins, and atomistic Monte Carlo simulations with up to 2 billion spins are possible, with a near-ideal performance scaling.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the issue of how to utilize multiple GPUs to accelerate computations in magnetic and atomic-scale spin dynamics simulations. Specifically, the authors propose a multi-GPU convolution algorithm to handle long-range demagnetization interactions, thereby significantly improving computational efficiency. #### Main Issues: 1. **Efficient Implementation of Long-Range Demagnetization Interactions**: Long-range demagnetization interactions are a bottleneck in traditional single-GPU computations, especially in large-scale simulations. This paper proposes a multi-GPU convolution algorithm to solve this problem. 2. **Increasing Computational Scale**: By using multiple GPUs, larger-scale simulation problems can be handled, overcoming the limitations of a single GPU. 3. **Data Transfer Bottleneck**: The data transfer rate between multiple GPUs is a key factor affecting performance. The paper proposes a mixed-precision method to reduce the amount of data transfer, thereby improving overall computational efficiency. #### Specific Methods: - **Multi-GPU Convolution Algorithm**: Utilizes fast Fourier transform (FFT) on a single GPU to perform parallel execution, achieving efficient long-range demagnetization interaction calculations. - **Data Transfer Optimization**: By using mixed precision (reducing data transfer precision), the amount of data transfer is reduced, improving computational efficiency. - **Performance Testing under Different Connection Topologies**: Compares the performance differences under point-to-point connections (such as NVSwitch) and bus connections. #### Experimental Results: - With 4 GPUs, atomic-scale spin dynamics simulations can achieve nearly ideal performance scaling, simulating up to 1 billion spins. - In micromagnetic simulations including demagnetization interactions, the speedup factors are 1.8, 2.5, and 3.1 (for 2, 3, and 4 GPUs, respectively). - For micromagnetic simulations including spin transport solvers, the speedup factors further increase to 1.96, 2.8, and 3.7 (for 2, 3, and 4 GPUs, respectively). ### Summary The paper proposes a new multi-GPU convolution algorithm that effectively addresses the computational bottleneck of long-range demagnetization interactions. By optimizing data transfer methods, it significantly enhances the computational efficiency of large-scale micromagnetic and atomic-scale spin dynamics simulations.

Accelerating micromagnetic and atomistic simulations using multiple GPUs

Grace: A cross-platform micromagnetic simulator on graphics processing units

Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core

GPU-accelerated micromagnetic simulations using cloud computing

Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS

Accelerating Dissipative Particle Dynamics Simulations on GPUs: Algorithms, Numerics and Applications

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures

Graphics Processing Unit Acceleration and Parallelization of GENESIS for Large-Scale Molecular Dynamics Simulations

Accelerating high-order continuum kinetic plasma simulations using multiple GPUs

Multi-GPU Hybrid Programming Accelerated Three-Dimensional Phase-Field Model in Binary Alloy

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

Employing multi-GPU power for molecular dynamics simulation: an extension of GALAMOST

Molecular dynamics simulation of macromolecules using graphics processing unit

General-purpose molecular dynamics simulations on GPU-based clusters

Multi-GPU RI-HF Energies and Analytic Gradients $-$ Towards High Throughput Ab Initio Molecular Dynamics

Efficient molecular dynamics simulations with many-body potentials on graphics processing units

Accelerating molecular dynamics simulations using Graphics Processing Units with CUDA

Particle-resolved thermal lattice Boltzmann simulation using OpenACC on multi-GPUs

Large-scale micromagnetics simulations with dipolar interaction using all-to-all communications

A GPU-based Large-Scale Monte Carlo Simulation Method for Systems with Long-Range Interactions.