A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations.

Hasan Metin Aktulga,Md. Afibuzzaman,Samuel Williams,Aydin Buluc,Meiyue Shao,Chao Yang,Esmond G. Ng,Pieter Maris,James P. Vary
DOI: https://doi.org/10.1109/tpds.2016.2630699
IF: 5.3
2017-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. We consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We present techniques to significantly improve the SpMM and the transpose operation SpMM $^T$ by using the compressed sparse blocks (CSB) format. We achieve 3-4 $\times$ speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15$\times$ speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4 $\times$ to 1.8 $\times$ speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.
What problem does this paper attempt to address?