Stage-based Hyper-parameter Optimization for Deep Learning

Ahnjae Shin,Dong-Jin Shin,Sungwoo Cho,Do Yoon Kim,Eunji Jeong,Gyeong-In Yu,Byung-Gon Chun
DOI: https://doi.org/10.48550/arXiv.1911.10504
2019-11-24
Abstract:As deep learning techniques advance more than ever, hyper-parameter optimization is the new major workload in deep learning clusters. Although hyper-parameter optimization is crucial in training deep learning models for high model performance, effectively executing such a computation-heavy workload still remains a challenge. We observe that numerous trials issued from existing hyper-parameter optimization algorithms share common hyper-parameter sequence prefixes, which implies that there are redundant computations from training the same hyper-parameter sequence multiple times. We propose a stage-based execution strategy for efficient execution of hyper-parameter optimization algorithms. Our strategy removes redundancy in the training process by splitting the hyper-parameter sequences of trials into homogeneous stages, and generating a tree of stages by merging the common prefixes. Our preliminary experiment results show that applying stage-based execution to hyper-parameter optimization algorithms outperforms the original trial-based method, saving required GPU-hours and end-to-end training time by up to 6.60 times and 4.13 times, respectively.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational redundancy and resource waste in **Hyper - parameter Optimization (HPO)** during deep - learning model training. Specifically, traditional hyper - parameter optimization methods usually perform a large number of repetitive computations because different trials may share the same hyper - parameter sequence prefix, but these prefixes will be trained multiple times in different trials, resulting in a waste of computational resources. ### Main contributions of the paper 1. **Stage - based Execution Strategy**: - The author proposes a new execution strategy, decomposing each trial into multiple homogeneous stages, and reducing redundant computations by merging stages with the same prefix. - This method realizes more efficient resource utilization by constructing a stage - tree to represent the relationships between different trials. 2. **Improving computational and resource efficiency**: - The experimental results show that, compared with the traditional trial - based execution strategy, the stage - based execution strategy can reduce the GPU usage time and the end - to - end training time by up to 6.6 times and 4.13 times respectively. 3. **Supporting multi - study optimization**: - This method can also be extended to multiple research tasks, further improving the efficiency of hyper - parameter optimization by sharing previous research history. 4. **Handling continuous search spaces**: - For hyper - parameter sequences with discrete values, the stage - based execution strategy shows significant advantages; while for hyper - parameter sequences with continuous values, although there is less overlap, a certain efficiency improvement can still be obtained through appropriate adjustments. ### Formula representation Some key concepts involved in the paper can be represented by formulas as follows: - **Hyper - parameter configuration**: Each trial can be represented as a hyper - parameter sequence \(\mathbf{h} = [h_1, h_2,..., h_T]\), where \(T\) is the length of the sequence. - **Stage - tree**: Each node in the stage - tree represents a stage and can be represented by a triple \((\mathbf{h}_i, t_i, r_i)\), where \(\mathbf{h}_i\) is the hyper - parameter configuration of this stage, \(t_i\) is the number of iterations, and \(r_i\) is the resource requirement. ### Summary By introducing the stage - based execution strategy, this paper effectively reduces the redundant computations in the hyper - parameter optimization process and improves the computational and resource efficiency of deep - learning model training. This is of great significance for the optimization of large - scale deep - learning tasks.