Abstract:This thesis deals with two research problems. The first research problem is motivated by the numerical computation involved in the Time Domain Simulation (TDS) of Power Grids. Due to the ever growing size and complexity of Power Grids such as the China National Grid, accelerating TDS has become a stringent need for the online analysis of these large systems. Hence the first part of the research includes the acceleration of the TDS by means of Iterative Solvers and Preconditioners which exploit the sparsity structure of the Power Grid. The second research problem is motivated by the recent trend of using Graphics Processing Units (GPUs) in High Performance Computing (HPC). By using TDS as a sample application, the second part of research involves the design and implementation of Krylov subspace solvers and general-purpose preconditioner which can fully exploit the performance potential of GPUs. TDS is of crucial importance to the analysis and control of Power Grids. The mathematical model of the Power Grid can be represented by a series of nonlinear Differential Algebraic Equations. With numerical time integration by an implicit scheme and linearization by Newton's Method, the major computation in TDS is the solution of a series of Jacobian matrices based linear systems. With the large size of the Power Grid and the need for fast simulation for online analysis, it is desirable to use iterative solvers and preconditioners for the solution of these linear systems. This thesis tackles the numerical problem in TDS from two aspects: design of high-performance preconditioner to the specific characterization of the TDS problem, and development of multi-step techniques for the iterative solution of a series of linear systems based on the Jacobian matrices. We start with the analysis of the sparsity pattern of the Jacobian matrix and its relationship to the Power Network and admittance matrix. The parts in the Jacobian matrix which corresponds to the dynamic parts (i.e., the differential equations) of the Power Grid and has a block-diagonal form. The Schur complement of the dynamic part has virtually no fill-in in the algebraic part which corresponds to the connectivity of the buses in the Power Network. We then formulate a multilevel preconditioner for the Jacobian matrix based on the static sparsity pattern in the matrix and the analysis of the network topology. We show that multilevel structure based on independent sets (\INDSET) can serve as an efficient preconditioners for TDS in terms of both memory efficiency and convergence property. To accommodate matrix and right hand side changes during the simulation, we further transform the Jacobian matrix in an additive form of $Y+\Delta_Y$, where $Y$ is a static matrix and $\Delta_Y$ is a block-diagonal, low rank matrix which changes from linear system to linear system. Based on this transformation, effective preconditioner re-use, also called preconditioner updating, is derived, through dynamically adopting the changes of $\Delta_Y$ into the preconditioner initially constructed for $Y$ itself. This results in much more efficient iterative methods for TDS of Power Grid. Furthermore, we explore the potential of matrix spectral deflation as another multi-step technique for TDS. To accommodate dynamic changes in the linear operator, \GCRODR is used. With the additive form of $Y+\Delta_Y$, we achieve a more computationally feasible form for the updates of the deflation matrices in \GCRODR. Spectral analysis shows that eigenvalues of both large and very small magnitude appear for the preconditioned TDS matrices, hence we propose the use of a heuristics to dynamically choose among the largest and smallest eigenvalues (or Rits values) used for deflation. The experiments show that the dynamic eigenvalue choice could greatly benefit convergence due to the dynamic changes in the matrix spectra of TDS. We also show that preconditioner updates and deflation can be used together which leads to a combined effect on the reduction of the total iteration count in TDS. GPU based accelerated HPC systems are becoming popular due to the high performance potential and good efficiency of GPUs. Iterative Algorithm and Preconditioners are the two fundamental components of Krylov subspace solvers. However, porting them to GPU platform remain a challenge especially for preconditioners. In this thesis we target at porting of Krylov subspace solvers on GPU and the design of GPU-efficient preconditioners. Firstly we discuss the two major parts of computation involved in Krylov solvers: (1) the generation of Krylov subspace basis through Sparse Matrix-Vector multiplications (\SpMV), and (2) orthogonalization by modified Gram-Schmidt method. We show that both parts can be efficiently implemented on GPU with high performance. On GPU platform, incomplete factorization based preconditioners, such as Incomplete LU or Incomplete Cholesky, have not been successfully implemented due to the ``sequentialness and limited parallelism in their preconditioning process. While inverse form based preconditioner such as $A$-biconjugate allows higher parallelism and better performance on GPUs, it introduces too much fill-ins. We aim to design a preconditioner that can achieve high performance while maintaining good memory efficiency and convergence property of the incomplete factorizations. We use recursive multilevel structure retrieved from the elimination tree and $A$-biconjugate algorithm to achieve this. Multilevel structure is constructed based on \INDSET{s} by symbolic analysis of the elimination tree. For the preconditioning of the last-level reduced system, we use $A$-biconjugation. The proposed preconditioner, denoted as \MLAINV, features preconditioning operation which involve a series of matrix-vector products. Through experiments with TDS problems and test matrices from various other applications, we show that \MLAINV achieves the design goal in both aspects: (1) its good convergence property is similar to incomplete factorizations, and (2) it obtains high performance by \SpMV based preconditioning on GPUs.

Solving Dynamic Programming Problem by Pipeline Implementation on GPU

Matrix Decomposition Graphics Processing Unit Solver for Poisson Image Editing

A Parallel Dynamic Programming Algorithm on a Multi-Core Architecture

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Improving Performance of Dynamic Programming Via Parallelism and Locality on Multicore Architectures

Improving Performance of Dynamic Programming via Parallelism and Locality on

Accelerating Iterative Linear Solvers Using Multiple Graphical Processing Units

A Pipeline-Based ODE Solving Framework

Efficient Modeling and Real-Time Rendering of Massive Urban Pipelines Based on GPU

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

Simultaneous Solving of Batched Linear Programs on a GPU

Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

Batched matrix operations on distributed GPUs with application in theoretical physics

Two-dimensional batch linear programming on the GPU

Solving Batched Linear Programs on GPU and Multicore CPU

Power Grid Simulation with GPU-Accelerated Iterative Solvers and Preconditioners

Exploiting GPU/SIMD Architectures for Solving Linear-Quadratic MPC Problems

Optimizing stencil application on multi-thread GPU architecture using stream programming model

Software Pipelining for Graphic Processing Unit Acceleration: Partition, Scheduling and Granularity