Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

Sandra Catalán,Francisco D. Igual,Rafael Mayo,Rafael Rodríguez-Sánchez,Enrique S. Quintana-Ortí
DOI: https://doi.org/10.48550/arXiv.1506.08988
2015-06-30
Abstract:Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the <a class="link-external link-http" href="http://big.LITTLE" rel="external noopener nofollow">this http URL</a> model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency.
Performance,Distributed, Parallel, and Cluster Computing,Mathematical Software,Numerical Analysis
What problem does this paper attempt to address?