AFNFA: an Approach to Automate NCCL Configuration Exploration.

Zibo Wang,Yuhang Zhou,Chen Tian,Xiaoliang Wang,Xianping Chen
DOI: https://doi.org/10.1145/3600061.3600068
2023-01-01
Abstract:With the continuously increasing scale of deep neural network models, there is a clear trend towards distributed DNN model training. State-of-the-art training frameworks support this approach using collective communication libraries such as NCCL, MPI, Gloo, and Horovod. These libraries have many parameters that can be adjusted to fit different hardware environments, and these parameters can greatly impact training performance. Therefore, careful tuning of parameters for each training environment is required. However, given the large parameter space, manual exploration can be time-consuming and laborious. In this poster, we introduce AFNFA, which stands for AI For Network For AI. It is an automated program that utilizes machine learning and simulated annealing to explore NCCL parameters. Preliminary evaluation results demonstrate that compared to the default configuration, the configuration explored by AFNFA improves NCCL communication performance by 22.90%.
What problem does this paper attempt to address?