Implementation and Optimization of Dense LU Decomposition on the Stream Processor

Ying Zhang,Tao Tang,Gen Li,Xuejun Yang
DOI: https://doi.org/10.1007/978-3-540-68111-3_9
2007-01-01
Abstract:Developing scientific computing applications on the stream processor has absorbed a lot of researchers attention. In this paper, we implement and optimize dense LU decomposition on the stream processor. Different from other existing parallel algorithms for LU decomposition, StreamLUD algorithm aims at exploiting producer-consumer locality and at overlapping chip-off memory access with kernel execution. Simulation results show that dealing with matrices of different sizes, compared with LUD of HPL on an Itanium 2 processor, StreamLUD we implement and optimize gets a speedup from 2.56 to 3.64 ultimately.
What problem does this paper attempt to address?