Abstract:Given a sparse matrix A, the selected inversion algorithm is an efficient method for computing certain selected elements of A-1. These selected elements correspond to all or some nonzero elements of the LU factors of A. In many ways, the types of matrix updates performed in the selected inversion algorithm are similar to those performed in the LU factorization, although the sequence of operations is different. In the context of LU factorization, it is known that the leftlooking and right-looking algorithms exhibit different memory access and data communication patterns, and hence different behavior on shared memory and distributed memory parallel machines. Corresponding to right-looking and left-looking LU factorization, the selected inversion algorithm can be organized as a left-looking or a right-looking algorithm. The parallel right-looking version of the algorithm has been developed in [9]. The sequence of operations performed in this version of the selected inversion algorithm is similar to those performed in a left-looking LU factorization algorithm. In this paper, we describe the left-looking variant of the selected inversion algorithm, and present an efficient implementation of the algorithm for shared memory machines using a task parallel method. We demonstrate that with the task scheduling features provided by OpenMP 4.0, the left-looking selected inversion algorithm can scale well both on the Intel Haswell multicore architecture and on the Intel Knights Landing (KNL) manycore architecture up to 16 and 64 cores, respectively. On the KNL architecture, we observe that the maximum parallel efficiency achieved by the left-looking selected inversion algorithm can be as high as 62% even when all 64 cores are used, despite the inherent asynchronous nature of the computation and communication patterns in sparse matrix operations. Compared to the right-looking selected inversion algorithm, the left-looking formulation facilitates efficient pipelining of operations along different branches of the elimination tree, and can be a promising candidate for future development of massively parallel selected inversion algorithms on heterogeneous architectures.

A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems.

Enhancing Scalability and Load Balancing of Parallel Selected Inversion Via Tree-Based Asynchronous Communication.

PSelInv - A Distributed Memory Parallel Algorithm for Selected Inversion: the non-symmetric Case

PSelInv -- A Distributed Memory Parallel Algorithm for Selected Inversion : the Symmetric Case

Enhancing the scalability and load balancing of the parallel selected inversion algorithm via tree-based asynchronous communication.

A Fast Parallel Algorithm for Selected Inversion of Structured Sparse Matrices with Application to 2D Electronic Structure Calculations.

SelInv---An Algorithm for Selected Inversion of a Sparse Symmetric Matrix

Parallel Sparse Left-Looking Algorithm

Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

A High Performance Parallel VLSI Design of Matrix Inversion.

Parallel Sorting by Approximate Splitting for Multi-core Processors

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

Blockwise inversion and algorithms for inverting large partitioned matrices

Parallel computation of echelon forms

Adaptive Parallelizable Algorithms for Interpolative Decompositions via Partially Pivoted LU

A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion

Some new techniques to use in serial sparse Cholesky factorization algorithms

Parallelization and scalability analysis of inverse factorization using the Chunks and Tasks programming model

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

Parallel Sparse Matrix Multiplication for Preconditioning and SSTA on a Many-Core Architecture