Non-recursive Parallel Computation for Matrix Multiplication on Multi-core Computers

LU Zhong-long,ZHONG Cheng,HUANG Hua-lin
2011-01-01
Abstract:To achieve zero-loss in shared L2 cache,a delay hidden data prefetching model for supporting in parallel computation and access memory is presented,the concept of basic block of matrix is defined and the matrix is divided into sub-matrices according to the size of basic block in order to simplify data structures of the algorithm and reduce the required storage overhead.The matrix elements are continuously arranged with the storage mode of basic block and the algorithm is optimized on the storage level,and the Translation Lookaside Buffer(TLB) missing can be significantly reduced.A non-recursive strategy for scheduling basic blocks is proposed and the shared L2 cache on multi-core computers is fully utilized to reduce the number of accessing the main memory.The presented computing matrix multiplication algorithm is not limited to the particular storage structure and it is cache oblivious.The experiments on the multi-core computer show that the non-recursive and thread-level parallel algorithm for matrix multiplication is efficient and scalable.
What problem does this paper attempt to address?