Abstract:A number of recently released numerical libraries including Automatically Tuned Linear Algebra Subroutines (ATLAS) library, Intel Math Kernel Library (MKL), GOTO numerical library, and AMD Core Math Library (ACML) for AMD Opteron processors, are linked against the executables of the Gaussian 98 electronic structure calculation package, which is compiled by updated versions of Fortran compilers such as Intel Fortran compiler (ifc/efc) 7.1 and PGI Fortran compiler (pgf77/pgf90) 5.0. The ifc 7.1 delivers about 3% of improvement on 32-bit machines compared to the former version 6.0. Performance improved from pgf77 3.3 to 5.0 is also around 3% when utilizing the original unmodified optimization options of the compiler enclosed in the software. Nevertheless, if extensive compiler tuning options are used, the speed can be further accelerated to about 25%. The performances of these fully optimized numerical libraries are similar. The double-precision floating-point (FP) instruction sets (SSE2) are also functional on AMD Opteron processors operated in 32-bit compilation, and Intel Fortran compiler has performed better optimization. Hardware-level tuning is able to improve memory bandwidth by adjusting the DRAM timing, and the efficiency in the CL2 mode is further accelerated by 2.6% compared to that of the CL2.5 mode. The FP throughput is measured by simultaneous execution of two identical copies of each of the test jobs. Resultant performance impact suggests that IA64 and AMD64 architectures are able to fulfill significantly higher throughput than the IA32, which is consistent with the SpecFPrate2000 benchmarks.

Providing performance portable numerics for Intel GPUs

Porting a sparse linear algebra math library to Intel GPUs

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

Ginkgo -- A Math Library designed for Platform Portability

Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Preparing Ginkgo for AMD GPUs -- A Testimonial on Porting CUDA Code to HIP

Ginkgo - A math library designed to accelerate Exascale Computing Project science applications

Taking GPU Programming Models to Task for Performance Portability

Numerical Performance and Throughput Benchmark for Electronic Structure Calculations in Pc-Linux Systems with New Architectures, Updated Compilers, and Libraries

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

dMath: A Scalable Linear Algebra and Math Library for Heterogeneous GP-GPU Architectures

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

A Lightweight Approach to Performance Portability with targetDP

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.