Abstract:Graphics processing units (GPUs) have become a promising architecture to tackle many computational bottlenecks in quantum chemistry calculations. Herein, we present our new development on using GPU to accelerate Hartree-Fock (HF) and density functional theory (DFT) calculations in Beijing Density Functional (BDF) Package. Our program utilizes the OpenCL platform and thus can execute on a variety of computing devices from different companies as NVIDIA and AMD. All time-consuming steps in HF/DFT, such as calculation of electron repulsion integrals (ERIs), the formation of the Fock matrix, and the exchange-correlation (XC) functional related works, have been implemented on the GPU. In our algorithm, the coulomb- and exchange-matrix are calculated directly on GPU by contracting the primitive ERIs with the density matrix. The 1T1PI (1 thread 1 primitive integral) algorithm in which each thread evaluates one primitive ERI, is used to schedule the computational tasks on GPU. To achieve this task, the primitive Gaussian basis shell pairs mu. are first prescreened and sorted according to their values. The Gaussian product theorem (GPT) is applied to each shell pairs and the intermediate values are calculated and copied into the GPU memory for further use. Then, the one-dimensional mapping is used to assign 32 work items (threads) into one workgroup to calculate the J matrix element and the permutation symmetry of the primitive ERIs is fully utilized. To calculate the K matrix, two-dimensional mapping is used and every 64 work items are packed into one workgroup. Permutation symmetry of exchanging the bra pair mu lambda and the ket pair.s is ignored for reducing the expensive commutation between different workgroups on GPU. After a batch of coulomb- or exchange-matrix elements are computed on the GPU, the results are copied back to the CPU and accumulated to the Fock matrix. The XC terms are calculated through a numerical procedure due to the complex form of the XC functionals. We first pack the numerical grids as batches in which one batch has 128 grids. Then the none zero Gaussian basis shells on each grid batch are sifted out. The grid batches and the basis function sieving indices are duplicated on CPU and GPU memory to avoid unnecessary communication between CPU and GPU. The computational tasks are scheduled dynamically according to the grid batches on GPU. All steps as calculating the numerical grids and their weights, electron density and density gradient, the XC functional and its derivative, and the XC energy and the matrix elements of the XC potential, are optimized step by step on GPU. All calculations are carried out in 64-bit double-precision accuracy to achieve the same numerical precision as on the CPU. Benchmark calculations are carried out on several different GPUs from NVIDIA and AMD for assessing the performance of our code. The benchmark result indicates that the algorithm implemented on the GPU can achieve up to 148-fold speedup over a serial CPU implementation, and the total energy calculated on the GPU is as accurate as the resulting calculated on the CPU.

Hybrid Parallel Optimization of Density Matrix Renormalization Group Method

Improved Hybrid Parallel Strategy for Density Matrix Renormalization Group Method*

Parallelization strategies for density matrix renormalization group algorithms on shared-memory systems

Parallel implementation of the Density Matrix Renormalization Group method achieving a quarter petaFLOPS performance on a single DGX-H100 GPU node

Two dimensional quantum lattice models via mode optimized hybrid CPU-GPU density matrix renormalization group method

Real-space Parallel Density Matrix Renormalization Group with Adaptive Boundaries

Multi-GPU Hybrid Programming Accelerated Three-Dimensional Phase-Field Model in Binary Alloy

Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory

High-Performance Computing for Density Matrix Renormalization Group

Efficient Parallel Implementation of the Lattice Boltzmann Method on Large Clusters of Graphic Processing Units

Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

A Efficient Algorithm for Molecular Dynamics Simulation on Hybrid CPU-GPU Computing Platforms

Parallel Computation of Entropic Lattice Boltzmann Method on Hybrid CPU–GPU Accelerated System

An Efficient Multi-GPU Implementation for Linear-Response Time-Dependent Density Functional Theory

A distributed multi-GPU ab initio density matrix renormalization group algorithm with applications to the P-cluster of nitrogenase

Hartree-Fock and Density Functional Calculations on Graphics Processing Unit

Accelerating Coupled-Cluster Calculations with GPUs: An Implementation of the Density-Fitted CCSD(T) Approach for Heterogeneous Computing Architectures Using OpenMP Directives

Exploiting Hierarchy Parallelism for Molecular Dynamics on a Petascale Heterogeneous System

IMPLEMENTATION OF A MASSIVELY PARALLEL METHOD OF CHARACTERISTICS NEUTRON TRANSPORT CALCULATION ON CPUS/GPUS HETEROGENEOUS HIGHPERFORMANCE COMPUTING CLUSTERS

Phaseless Auxiliary-Field Quantum Monte Carlo on Graphical Processing Units

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics