Abstract:Graphics processing units (GPUs) have become a promising architecture to tackle many computational bottlenecks in quantum chemistry calculations. Herein, we present our new development on using GPU to accelerate Hartree-Fock (HF) and density functional theory (DFT) calculations in Beijing Density Functional (BDF) Package. Our program utilizes the OpenCL platform and thus can execute on a variety of computing devices from different companies as NVIDIA and AMD. All time-consuming steps in HF/DFT, such as calculation of electron repulsion integrals (ERIs), the formation of the Fock matrix, and the exchange-correlation (XC) functional related works, have been implemented on the GPU. In our algorithm, the coulomb- and exchange-matrix are calculated directly on GPU by contracting the primitive ERIs with the density matrix. The 1T1PI (1 thread 1 primitive integral) algorithm in which each thread evaluates one primitive ERI, is used to schedule the computational tasks on GPU. To achieve this task, the primitive Gaussian basis shell pairs mu. are first prescreened and sorted according to their values. The Gaussian product theorem (GPT) is applied to each shell pairs and the intermediate values are calculated and copied into the GPU memory for further use. Then, the one-dimensional mapping is used to assign 32 work items (threads) into one workgroup to calculate the J matrix element and the permutation symmetry of the primitive ERIs is fully utilized. To calculate the K matrix, two-dimensional mapping is used and every 64 work items are packed into one workgroup. Permutation symmetry of exchanging the bra pair mu lambda and the ket pair.s is ignored for reducing the expensive commutation between different workgroups on GPU. After a batch of coulomb- or exchange-matrix elements are computed on the GPU, the results are copied back to the CPU and accumulated to the Fock matrix. The XC terms are calculated through a numerical procedure due to the complex form of the XC functionals. We first pack the numerical grids as batches in which one batch has 128 grids. Then the none zero Gaussian basis shells on each grid batch are sifted out. The grid batches and the basis function sieving indices are duplicated on CPU and GPU memory to avoid unnecessary communication between CPU and GPU. The computational tasks are scheduled dynamically according to the grid batches on GPU. All steps as calculating the numerical grids and their weights, electron density and density gradient, the XC functional and its derivative, and the XC energy and the matrix elements of the XC potential, are optimized step by step on GPU. All calculations are carried out in 64-bit double-precision accuracy to achieve the same numerical precision as on the CPU. Benchmark calculations are carried out on several different GPUs from NVIDIA and AMD for assessing the performance of our code. The benchmark result indicates that the algorithm implemented on the GPU can achieve up to 148-fold speedup over a serial CPU implementation, and the total energy calculated on the GPU is as accurate as the resulting calculated on the CPU.

An alternative GPU acceleration for a pseudopotential plane-waves density functional theory code with applications to metallic systems

Parallelized Implementation of the Finite Particle Method for Explicit Dynamics in GPU

Implementation of the moving particle semi-implicit method for free-surface flows on GPU clusters.

The Implementation of the Three-Dimensional Unified Gas-Kinetic Wave-Particle Method on Multiple Graphics Processing Units

Hartree-Fock and Density Functional Calculations on Graphics Processing Unit

An Efficient Multi-GPU Implementation for Linear-Response Time-Dependent Density Functional Theory

Accelerated molecular dynamics force evaluation on graphics processing units for thermal conductivity calculations.

Accelerating Relativistic Exact-Two-Component Density Functional Theory Calculations with Graphical Processing Units

On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters

Large Scale Plane Wave Pseudopotential Density Functional Theory Calculations on GPU Clusters

Efficient molecular dynamics simulations with many-body potentials on graphics processing units

The Analysis of a Plane Wave Pseudopotential Density Functional Theory Code on a GPU Machine

GPU_PBTE: an Efficient Solver for Three and Four Phonon Scattering Rates on Graphics Processing Units

Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory

A Scheme of Full Kinetic Particle-in-cell Algorithms for GPU Acceleration Using CUDA Fortran Programming

GPGPU Acceleration of All-Electron Electronic Structure Theory Using Localized Numeric Atom-Centered Basis Functions

GPU Acceleration of Numerical Atomic Orbitals-Based Density Functional Theory Algorithms within the ABACUS package

KSSOLV-GPU: an Efficient GPU-enabled MATLAB Toolbox for Solving the Kohn-Sham Equations Within Density Functional Theory in Plane-Wave Basis Set

Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives

Very-Large-Scale GPU-Accelerated Nuclear Gradient of Time-Dependent Density Functional Theory with Tamm-Dancoff Approximation and Range-Separated Hybrid Functionals

Massively Parallel Implementation of Iterative Eigensolvers in Large-Scale Plane-Wave Density Functional Theory