Accelerated Synchronous Model Parallelism Using Cooperative Process for Training Compute-Intensive Models
Chanhee Yu,Kyongseok Park
DOI: https://doi.org/10.1109/ACCESS.2023.3296609
IF: 3.9
IEEE Access
Abstract:As deep learning has been recently applied to a wide variety of fields, there is a growing demand for models that can handle large input data, including high-resolution images. Therefore, model parallelism was proposed to train a model whose size exceeds the memory capacity of an accelerator, but this method has a very slow training speed due to bubbles. To solve this problem, GPipe reduced the bubbles by proposing a micro-batch concept, according to which a mini-batch is divided into smaller units. However, the improvement in the training speed is still limited to a certain extent of micro-batch size, because the smaller the micro-batches, the less the computation and input/output (I/O) efficiency. To overcome the limitations, we proposed acceleration through prediction and synchronization steps based on process cooperation to train compute-intensive models. In the prediction step, the input data of all processes for the forward pass are calculated concurrently using the weights shared in the synchronization step in advance, and the results are gathered into each corresponding process via an all-to-all collective operation. This can increase computational efficiency and reduce bubbles by minimizing the idle state of the device. Additionally, the proposed method requires minimal memory because it does not have to store the activations in memory. Thus, compared to the GPipe, the proposed method achieved performance improvements of 15.3%, 34.5%, and 25.8% with the VGG16bn, ResNet50, and InceptionV3 models with four devices, respectively, and the memory used for training was reduced by up to 75.0%.
Computer Science