NEEBS: Nonexpert large‐scale environment building system for deep neural network

Yoshiharu Tajima,Masahiro Asaoka,Akihiro Tabuchi,Akihiko Kasagi,Tsuguchika Tabaru
DOI: https://doi.org/10.1002/cpe.7499
2022-11-22
Concurrency and Computation: Practice and Experience
Abstract:Summary Deep neural networks (DNNs) have greatly improved the accuracy of various tasks in areas such as natural language processing (NLP). Obtaining a highly accurate DNN model requires multiple repetitions of training on a huge dataset, which requires a large‐scale cluster the compute nodes of which are tightly connected by high‐speed interconnects to exchange a large amount of intermediate data with very short latency. However, fully using the computational power of a large‐scale cluster for training requires knowledge of its components such as a distributed file system, an interconnection, and optimized high‐performance libraries. We have developed a Non‐Expert large‐scale Environment Building System (NEEBS) that aids a user in building a fast‐running training environment on a large‐scale cluster. It automatically installs and configures the applications and necessary libraries. It also optimally prepares tools to stage both data and executable programs, and launcher scripts suitable for both the applications and job submission systems of the cluster. NEEBS achieves 93.91% throughput scalability in NLP pretraining. We also present an approach to reduce pretraining time of highly accurate DNN model for NLP using a large‐scale computation environment built using NEEBS. We trained a Bidirectional Encoder Representations from Transformers (BERT)‐3.9b and a BERT‐xlarge using a dense masked language model (MLM) on Megatron‐LM framework and evaluated the improvement in learning time and learning efficiency for a Japanese language dataset using 768 graphics processing units (GPUs) on the AI Bridging Cloud Infrastructure (ABCI). Our implementation NEEBS improved learning efficiency per iteration by a factor of 10 and completed the pretraining of BERT‐xlarge in 4.7 h. This pretraining takes 5 months on a single GPU. To determine if the BERT models are correctly pretrained, we evaluated their accuracy in two tasks, Stanford Natural Language Inference Corpus translated into Japanese (JSNLI) and Twitter reputation analysis (TwitterRA). BERT‐3.9b achieved 94.30% accuracy for JSNLI, and BERT‐xlarge achieved 90.63% accuracy for TwitterRA. We constructed pretrained models with comparable accuracy to other Japanese BERT models in a shorter time.
What problem does this paper attempt to address?