Improving Performance of Matrix Multiplication and FFT on GPU

Xiang Cui,Yifeng Chen,Hong Mei
DOI: https://doi.org/10.1109/icpads.2009.8
2009-01-01
Abstract:In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
What problem does this paper attempt to address?