Abstract:We propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main difficulties specific to linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decomposition: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occur. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency.

Parallel computation of echelon forms

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal LU Factorization

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Adaptive Parallelizable Algorithms for Interpolative Decompositions via Partially Pivoted LU

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Communication Avoiding Block Low-Rank Parallel Multifrontal Triangular Solve with Many Right-Hand Sides

A Parallelization Technique Based on Factor Combination and Graph Partitioning for General Incomplete LU Factorization

Recursive sparse LU decomposition based on nested dissection and low rank approximations

Implementing LU and Cholesky factorizations on artificial intelligence accelerators

Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures

High Performance Block Incomplete LU Factorization

Parallel Tiled QR Factorization for Multicore Architectures

Parallelization and scalability analysis of inverse factorization using the Chunks and Tasks programming model

Parallel Sparse Left-Looking Algorithm

Finite Projective Geometry based Fast, Conflict-free Parallel Matrix Computations

Parallelization of incomplete factorization preconditioning of block tridiagonal linear systems with 1-D domain decomposition

Parallel Factorizations in Numerical Analysis

Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Exploiting nested task-parallelism in the $\mathcal{H}-LU$ factorization

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures