Abstract:Gravitational $N$-body simulations calculate numerous interactions between particles. The tree algorithm reduces these calculations by constructing a hierarchical oct-tree structure and approximating gravitational forces on particles. Over the last three decades, the tree algorithm has been extensively used in large-scale simulations, and its parallelization in distributed memory environments has been well studied. However, recent supercomputers are equipped with many CPU cores per node, and optimizations of the tree construction in shared memory environments are becoming crucial. We propose a novel tree construction method in contrast to the conventional top-down approach. It first creates all leaf cells without traversing the tree and then constructs the remaining cells by a bottom-up approach. We evaluated the performance of our novel method on the supercomputer Fugaku and an Intel machine. On a single thread, our method accelerates one of the most time-consuming processes of the conventional tree construction method by a factor of above 3.0 on Fugaku and 2.2 on the Intel machine. Furthermore, as the number of threads increases, our parallel tree construction time reduces considerably. Compared to the conventional sequential tree construction method, we achieve a speedup of over 45 on 48 threads of Fugaku and more than 56 on 112 threads of the Intel machine. In stark contrast to the conventional method, the tree construction with our method no longer constitutes a bottleneck in the tree algorithm, even when using many threads.

Implementing O(N) N-Body Algorithms Efficiently in Data-Parallel Languages.

Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions

Performance Analysis of Direct N-Body Algorithms on Special-Purpose Supercomputers

Parallelization of the Symplectic Massive Body Algorithm (SyMBA) $N$-body Code

Interfacing Interpreted and Compiled Languages to Support Applications on a Massively Parallel Network of Workstations (MP-NOW)

A hierarchical O(N log N) force-calculation algorithm

A Numerical Model Oriented Large-scale Parallel I/O Optimization Method.

Direct N-body application on low-power and energy-efficient parallel architectures

Efficiency of parallel computations of gravitational forces by TreeCode method in N-body models

N-Body Simulations on GPUs

High Performance Optimizations For Nuclear Physics Code Mfdn On Knl

Molecular Dynamics Simulation On Commodity Shared-Memory Multiprocessor Systems With Lightweight Multithreading

Acceleration of Hybrid MPI Parallel NBODY6++ for Largen-Body Globular Cluster Simulations

Optimization of Computation-Intensive Applications in cc-NUMA Architecture

Hierarchical Parallelisation of Functional Renormalisation Group Calculations -- hp-fRG

A Work- and Data-Sharing Parallel Tree N-body Code

Modeling Data Movement Performance on Heterogeneous Architectures

The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units

Optimizing the Gravitational Tree Algorithm for Many-Core Processors

Parallel N-body algorithm on some parallel computational models

A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor