Solving Batched Linear Programs on GPU and Multicore CPU

Amit Gurung,Rajarshi Ray
DOI: https://doi.org/10.48550/arXiv.1609.08114
2016-09-27
Abstract:Linear Programs (LPs) appear in a large number of applications and offloading them to the GPU is viable to gain performance. Existing work on offloading and solving an LP on GPU suggests that performance is gained from large sized LPs (typically 500 constraints, 500 variables and above). In order to gain performance from GPU for applications involving small to medium sized LPs, we propose batched solving of a large number of LPs in parallel. In this paper, we present the design and CUDA implementation of our batched LP solver library, keeping memory coalescent access, reduced CPU-GPU memory transfer latency and load balancing as the goals. The performance of the batched LP solver is compared against sequential solving in the CPU using an open source solver GLPK (GNU Linear Programming Kit). The performance is evaluated for three types of LPs. The first type is the initial basic solution as feasible, the second type is the initial basic solution as infeasible and the third type is the feasible region as a Hyperbox. For the first type, we show a maximum speedup of $18.3\times$ when running a batch of $50k$ LPs of size $100$ ($100$ variables, $100$ constraints). For the second type, a maximum speedup of $12\times$ is obtained with a batch of $10k$ LPs of size $200$. For the third type, we show a significant speedup of $63\times$ in solving a batch of nearly $4$ million LPs of size 5 and $34\times$ in solving 6 million LPs of size $28$. In addition, we show that the open source library for solving linear programs-GLPK, can be easily extended to solve many LPs in parallel with multi-threading. The thread parallel GLPK implementation runs $9.6\times$ faster in solving a batch of $1e5$ LPs of size $100$, on a $12$-core Intel Xeon processor. We demonstrate the application of our batched LP solver in the domain of state-space exploration of mathematical models of control systems design.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently solve a large number of linear programming (LP) problems in a CPU - GPU heterogeneous computing environment, especially for small - to medium - sized LP problems. Existing work shows that offloading large - scale LP problems (usually with more than 500 constraints and 500 variables) to the GPU can significantly improve performance. However, for small - to medium - sized LP problems, since the data transfer time exceeds the time savings from parallel solving, it is not efficient to directly transfer a single LP problem to the GPU for processing. Therefore, this paper proposes a batch - processing method, that is, simultaneously solving multiple LP problems in parallel on the GPU to reduce the overhead of each data transfer and achieve performance improvement through batch - processing. Specifically, the main contributions of the paper include: 1. **Design and implementation of a batch - processing linear programming solver**: The author designed and implemented a CUDA - based batch - processing linear programming solver library, which can handle a large number of LP problems simultaneously. The design goals include continuous memory access, reducing the memory transfer latency between the CPU and GPU, and effective load balancing. 2. **Performance evaluation**: Compared with sequential solving on the CPU using the open - source solver GLPK (GNU Linear Programming Kit), this batch - processing solver shows significant performance improvement on different types of LP problems. For example, when dealing with LP problems with a feasible initial basic solution, for 50,000 100 - dimensional LP problems, the maximum speedup ratio reaches 18.3 times; when dealing with LP problems with an infeasible initial basic solution, for 10,000 200 - dimensional LP problems, the maximum speedup ratio reaches 12 times; when dealing with LP problems with a feasible region being a hypercube, for nearly 4 million 5 - dimensional LP problems, the maximum speedup ratio reaches 63 times. 3. **Implementation of multi - threaded GLPK**: To further improve performance, the author also modified GLPK to enable it to solve multiple LP problems in parallel through multi - threading on a multi - core CPU. Experimental results show that on a 12 - core Intel Xeon processor, multi - threaded GLPK has a 9.6 - fold speed increase when solving 100,000 100 - dimensional LP problems. 4. **Application example**: The paper shows the application of the batch - processing LP solver in the state - space exploration of control systems, proving its effectiveness in practical problems. Through the above methods, the paper not only solves the problem of efficient solving of small - to medium - sized LP problems on the GPU, but also provides a reference for other applications requiring a large amount of parallel computing.