Efficient Large Models Fine-tuning on Commodity Servers Via Memory-balanced Pipeline Parallelism

Yujie Liu,Zhiquan Lai,Weijie Liu,Wei Wang,Dongsheng Li
DOI: https://doi.org/10.1109/hpcc-dss-smartcity-dependsys60770.2023.00103
2024-01-01
Abstract:Large-scale models have demonstrated outstanding performance across various downstream tasks. Pipeline parallelism is essential for fine-tuning large models on commodity GPU servers, as it plays a crucial role in making their exceptional performance more accessible and widespread. Existing approaches have encountered challenges in attaining effective memory-balanced pipeline parallelism. This paper presents a novel memory load-balanced pipeline parallel solution that aims to distribute memory usage evenly across stages on commodity GPU servers by leveraging NVLink bridges. The solution presents a novel approach for transferring data from GPUs to CPUs through the PCI-e link between adjacent GPUs interconnected by the NVLink bridge. Moreover, our methodology orchestrates data transfer operations to reduce offloading latency during the fine-tuning of large models. The evaluation demonstrates that our approach enhances switching efficiency by 1.46x to 1.76x and improves throughput by 2.04x to 3.59x compared to the PyTorch offloading technique.
What problem does this paper attempt to address?