Abstract:The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy. The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about $2\times$ using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs.

CuMF_SGD: Fast and Scalable Matrix Factorization.

GPU accelerated matrix factorization of large scale data using block based approach

A Novel Multi-CPU/GPU Collaborative Computing Framework for SGD-based Matrix Factorization

UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

An Online and Scalable Model for Generalized Sparse Nonnegative Matrix Factorization in Industrial Applications on Multi-GPU

A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

Fast Asynchronous Parallel Stochastic Gradient Decent

Stochastic Gradient Descent for matrix completion: Hybrid parallelization on shared- and distributed-memory systems

Alternating Mixing Stochastic Gradient Descent for Large-scale Matrix Factorization

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Scaling up stochastic gradient descent for non-convex optimisation

Parallel optimization for sparse matrix-vector on GPU

Parallel Adaptive Sparsity-Constrained NMF Algorithm for Hyperspectral Unmixing.

CuLDA_CGS

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs

Optimizing sparse matrix-vector multiplication based on gpu

Stochastic configuration networks with CPU-GPU implementation for large-scale data analytics

Columnsgd: A Column-Oriented Framework For Distributed Stochastic Gradient Descent