Light-Weight Fault Tolerant Attention for Large Language Model Training

Yuhang Liang,Xinyi Li,Jie Ren,Ang Li,Bo Fang,Jieyang Chen

2024-10-16

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints. To mitigate the impact of these faults, we propose ATTNChecker, the first Algorithm-Based Fault Tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker on average incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.

Distributed, Parallel, and Cluster Computing,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of extreme errors such as INF (infinity), NaN (not a number), and approximate INF values in the attention mechanism during the training of large-scale language models (LLMs) due to hardware failures. These errors not only disrupt the trainability of the model but also force the training process to reload from checkpoints, thereby increasing time and resource costs. To solve these problems, the authors propose a new algorithm-level fault tolerance technique (ABFT), namely ATTNChecker, specifically designed for the attention mechanism in LLMs. It aims to detect and correct anomalies in real-time, prevent further error propagation, reduce reliance on checkpoint recovery methods, and improve training efficiency. Specifically, the main contributions of the paper include: 1. For the first time, a comprehensive fault injection and error propagation study of INF, NaN, and approximate INF errors in the attention mechanism, and the first analysis of the vulnerability of these errors in critical operations. 2. The design of an extreme error correction ABFT (EEC-ABFT) that can effectively handle INF, NaN, and approximate INF errors. This is the first highly optimized ABFT technique capable of handling errors in various scenarios, including propagation errors, unpredictable patterns, and mixed error types. 3. The development of the first comprehensive soft error protection scheme for the attention mechanism based on EEC-ABFT—ATTNChecker. This scheme is specifically optimized for all major operations in the attention mechanism to adapt to system reliability and model vulnerability, enabling existing LLMs to achieve reliability improvements with minimal modifications. 4. The integration of ATTNChecker into the PyTorch framework. Evaluation results show that ATTNChecker increases training overhead by an average of 7% but achieves a 100% extreme error detection and correction rate. Compared to state-of-the-art recovery techniques, ATTNChecker reduces recovery overhead by up to 49 times. Through these contributions, the paper aims to improve the reliability and efficiency of large-scale language model training, reducing training interruptions and resource wastage caused by hardware failures.

Light-Weight Fault Tolerant Attention for Large Language Model Training

Evaluation and Improvement of Fault Detection for Large Language Models

Reliable and Efficient In-Memory Fault Tolerance of Large Language Model Pretraining

Large Language Models for Test-Free Fault Localization

Enhancing Fault Detection for Large Language Models via Mutation-Based Confidence Smoothing

LLMEffiChecker:Understanding and Testing Efficiency Degradation of Large Language Models

LLMEffiChecker: : Understanding and Testing Efficiency Degradation of Large Language Models

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

TAIA: Large Language Models are Out-of-Distribution Data Learners

Characterization of Large Language Model Development in the Datacenter

Efficient and Economic Large Language Model Inference with Attention Offloading

Attention Tracker: Detecting Prompt Injection Attacks in LLMs

Impact of Large Language Models of Code on Fault Localization

TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Cross-layer Attention Sharing for Large Language Models

Empirical Study on Fine-Tuning Pre-Trained Large Language Models for Fault Diagnosis of Complex Systems

Exposing Attention Glitches with Flip-Flop Language Modeling

A Lightweight Model for Train Bearing Fault Diagnosis Based on Multiscale Attentional Feature Fusion

AttentionBreaker: Adaptive Evolutionary Optimization for Unmasking Vulnerabilities in LLMs through Bit-Flip Attacks