Light-Weight Fault Tolerant Attention for Large Language Model Training

Yuhang Liang,Xinyi Li,Jie Ren,Ang Li,Bo Fang,Jieyang Chen
2024-10-16
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints. To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker on average incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.
Distributed, Parallel, and Cluster Computing,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of extreme errors such as INF (infinity), NaN (not a number), and approximate INF values in the attention mechanism during the training of large-scale language models (LLMs) due to hardware failures. These errors not only disrupt the trainability of the model but also force the training process to reload from checkpoints, thereby increasing time and resource costs. To solve these problems, the authors propose a new algorithm-level fault tolerance technique (ABFT), namely ATTNChecker, specifically designed for the attention mechanism in LLMs. It aims to detect and correct anomalies in real-time, prevent further error propagation, reduce reliance on checkpoint recovery methods, and improve training efficiency. Specifically, the main contributions of the paper include: 1. For the first time, a comprehensive fault injection and error propagation study of INF, NaN, and approximate INF errors in the attention mechanism, and the first analysis of the vulnerability of these errors in critical operations. 2. The design of an extreme error correction ABFT (EEC-ABFT) that can effectively handle INF, NaN, and approximate INF errors. This is the first highly optimized ABFT technique capable of handling errors in various scenarios, including propagation errors, unpredictable patterns, and mixed error types. 3. The development of the first comprehensive soft error protection scheme for the attention mechanism based on EEC-ABFT—ATTNChecker. This scheme is specifically optimized for all major operations in the attention mechanism to adapt to system reliability and model vulnerability, enabling existing LLMs to achieve reliability improvements with minimal modifications. 4. The integration of ATTNChecker into the PyTorch framework. Evaluation results show that ATTNChecker increases training overhead by an average of 7% but achieves a 100% extreme error detection and correction rate. Compared to state-of-the-art recovery techniques, ATTNChecker reduces recovery overhead by up to 49 times. Through these contributions, the paper aims to improve the reliability and efficiency of large-scale language model training, reducing training interruptions and resource wastage caused by hardware failures.