A GPU-based Tabu Search for Very Large Hardware/software Partitioning with Limited Resource Usage
Neng Hou,Fazhi He,Yi Zhou,Haojun Ai
DOI: https://doi.org/10.1299/jamdsm.2017jamdsm0060
2017-01-01
Journal of Advanced Mechanical Design Systems and Manufacturing
Abstract:In hardware/software (HW/SW) co-design, HW/SW partitioning is the most important step since it determines which components are implemented in hardware and which are implemented in software. Since most of HW/SW partitioning problems are NP hard, heuristic methods have to be utilized to solve them, especially for the large size problems. GPU-based heuristic methods to accelerate HW/SW co-design are a promising way to reduce run time. However, the existing methods cannot deal with very large embedded applications because of GPU resource limitations. This paper presents a method to overcome the GPU resource limitations for very large partitioning while keeping a reasonable runtime. First, at the stage of computing the costs of the candidates, we propose a fast method of 2-flipping computing for very large HW/SW co-design. Our method is also general and can deal with both odd and even numbers of nodes. More importantly, our method avoids utilizing doubleprecision arithmetic units, which are scarce resources in GPU architecture. Second, since the GPU is constrained by memory limitations and the costs of candidates cannot be directly stored in the GPU’s global memory, we present a time-space tradeoff strategy to break memory limitations for very large HW/SW partitioning. In this way, the following steps can be run under the constraint of GPU’s memory limitations. Third, an in-place removal of infeasible solutions is proposed to reduce the overhead of global memory by half when the neighborhood is compacted. Fourth, when evaluating the tabu status of feasible candidates, we present a bitwise representation of tabu status to minimize the transfer overhead. Finally, we conduct a number of experiments. The results show that the proposed 2-flipping method of single precision data types works well. The results also demonstrate that the proposed approach expands the number of nodes of the task graph from 10,000 to 30,000 under the limitation of the GPU’s global memory of 6 GB. The correlations between compression intensity and solution quality are analyzed to ensure the fairness and soundness of our method. Our work is general and can provide guidance for other applications.