Study of Molecular Dynamics Simulation : Multi-core VS Many-core
Liu Peng,Guangming Tan,Rajiv K. Kalia,Aiichiro Nakano,Priya Vashishta
2011-01-01
Abstract:Molecular dynamics (MD) simulation has broad applications, and increasing computing power is needed to satisfy the large spatiotemporal scales of the real world simulation. The advent of multi-core and many-core paradigm brings unprecedented computing power, however, it remains a great challenge to harvest the computing power due to MD’s irregular memoryaccess pattern. To address the challenge, this paper presents a joint application/architecture study to enhance scalability of MD on multi-core and many-core architecture. First, a hierarchical scheme is designed to explore the multi-level parallelism mapping application to hardware. Then to further harvest the many-core computing power, three incremental optimization strategies—-a novel data-layout to improve data locality, an onchip locality-aware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory—are proposed. Experiments show that the hierarchical framework achieves inter-node weak-scaling parallel efficiency 0.985 on 106,496 BlueGene/L nodes (0.975 on 32,768 BlueGene/P nodes), and 0.99 strong-scaling parallel efficiency on 64 cores GodsonT simulation, which is further confirmed by an FPGA emulator. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most and certain architectural features are found essential for these optimizations, which could guide future hardware developments. Furthermore, a simple performance model suggests that the optimization scheme is likely to scale well toward exascale. I. HIERARCHICAL PARALLELIZATION FRAMEWORK FOR MOLECULAR DYNAMICS MD simulation follows the phase-space trajectories of an N -atom system, where force fields describing the atomic force laws between atoms are spatial derivatives of a potential energy function E(r ) (r = {r1, r2, ..., rN} are positions of all atoms). We have previously proposed a space-time multiresolution MD (MRMD) algorithm to reduce the O(N) time complexity of potential evaluation to O(N) [1]. In the MRMD, E(r ) consists of two-body E2{rij} and threebody E3{rijk} terms within a cutoff radius rc. Although the emergence of the multi-core/many-core paradigm has provided unprecedented computing power, it remains a challenge to develop efficient parallel applications on multi-core/many-core clusters. To address the challenge, we first propose a hierarchical framework for scalable parallelization framework that takes advantage of the multilevel feature of multi-core/many-core clusters: (1) Inter-node (message passing) parallelization via 2048 4096 8192 16384 65536 98304 106496 0.0 0.2 0.4 0.6 0.8 1.0 pa ra lle l e ffi ci en cy number of nodes BG/L N/P=4,088,832 1 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 pa ra lle l e ffi ci en cy number of cores Intel Quadcore SMP (a) Weak scalability on BG/L (b) Strong scalability on multi-core SMP Fig. 1. Scalability Analysis an embedded divide-and-conquer (EDC) scheme based on spatial decomposition; (2) inter-core parallelization via cellular decomposition using critical section-free multithreading with a master-worker paradigm. Figure 1 shows the scalability experimental results, where inter-node strong-scalability tests are conducted on Blue Gene/L cluster (each node with two IBM PowerPC 440 processors at 700 MHz clock) at Lawrence Livermore National Laboratory and inter-core strong-scalability tests are conducted on 8-core Intel Nehalem SMP. Although our hierarchical framework has achieved inter-node strong-scalable parallel efficiency well over 0.95 for 218 billion-silica-atom MD simulations on 106, 496 Blue Gene/L nodes based on the speedup over 2,048 nodes, it suffers poor strong-scaling intercore parallel efficiency only 0.65 for 8 threads on a dual Intel Nehalem symmetric multiprocessors (SMP) platform.The scalability analysis indicates that the hierarchical scheme on multi-core can achieve excellent weak scalability, but just fair strong scalability. Then the following sessions show how to explore the possibility of enhancing the scalability on manycore platform Godson-T . II. Godson-T MANY-CORE ARCHITECTURE Godson-T is a low-power many-core architecture developed by Institute of Computing Technology, Chinese Academy of Sciences to serve as a dedicated petaflops computing engine. As shown in Fig. 2, Godson-T has 64 homogeneous, dualissue and in-order processing cores running at 1 GHz, where a floating-point multiply-accumulate operation can be issued to a fully-pipelined function unit in each cycle, resulting in a peak floating-point performance of 128Gflops. The 8-pipeline processing core supports 32-bit MIPS ISA (64-bit ISA will be