High Performance First Principles Method for Complex Magnetic Properties

B. Ujfalussy,Xindong Wang,Xiaoguang Zhang,D. M. C. Nicholson,W. A. Shelton,G. M. Stocks,A. Canning,Yang Wang,B. L. Gyorffy
DOI: https://doi.org/10.5555/509058.509129
1998-01-01
Abstract:The understanding of metallic magnetism is of fundamental importance for a wide range of technological applications ranging from thin film disc drive read heads to bulk magnets used in motors and power generation. In this submission for the Gordon Bell Prize we use the power of massively parallel processing (MPP) computers to perform first principles calculations of large system models of non-equilibrium magnetic states in metallic magnets. The calculations are based on a new constrained local moment (CLM) model that places the recently proposed Spin-Dynamics of Antropov et al. [1] on firm theoretical foundations. The equations of constrained local spin density approximation (constrained LSDA) are solved using the massively parallel locally self-consistent multiple scattering (LSMS) method[3] extended to treat general non-collinear arrangements of the magnetic moments [4]. A general algorithm has been developed for self-consistently finding the constraining fields which are introduced into LSDA in order to maintain a prescribed magnetic moment orientation configuration. The existence of CLM states is demonstrated for 1024 atom per unit cell models of Iron above its Curie temperature. The constrained LSMS method we have developed exploits the locality in the physics of the problem to produce an algorithm that has only local and limited communications on parallel computers leading to very good scale-up to large processor counts and linear scaling of the number of operations with the number of atoms in the system. The computationally intensive step of inversion of a dense complex matrix is largely reduced to matrix-matrix multiplies which are implemented in BLAS. Throughout the code attention is paid to minimizing both the total operation count and total execution time, with primacy given to the latter. Full 64-bit arithmetic is used throughout. The code shows near linear scale-up to 1024-processing elements (PE) and attains a performance of 657 Gflops on a Cray T3E1200 LC1024 at a US Government site. Performance figures of 276 Gflops and 329 Gflops have also been obtained on T3E900 and T3E1200 LC512 machines at the National Energy Research Scientific Computing Center (NERSC), and Cray Research respectively. All performance figures include necessary I/O.
What problem does this paper attempt to address?