Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

Field G. Van Zee,Tyler M. Smith
DOI: https://doi.org/10.1145/3086466
IF: 2.464
2017-07-24
ACM Transactions on Mathematical Software
Abstract:In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries such as Basic Linear Algebra Subprograms and BLAS-Like Library Instantiation Software (BLIS) and allows kernel developers to focus their efforts on fewer and simpler kernels. We develop two alternative approaches—one based on the 3 m method and one that reflects the classic 4 m formulation—each with multiple variants, all of which rely only on real matrix multiplication kernels. We discuss the performance characteristics of these “induced” methods and observe that the assembly-level method actually resides along the 4 m spectrum of algorithmic variants. Implementations are developed within the BLIS framework, and testing on modern hardware confirms that while the less numerically stable 3 m method yields the fastest runtimes, the more stable (and thus widely applicable) 4 m method’s performance is somewhat limited due to implementation challenges that appear inherent in nature.
computer science, software engineering,mathematics, applied
What problem does this paper attempt to address?