IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs

Yuzhen Mao,Martin Ester,Ke Li
2024-05-05
Abstract:One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high computational cost encountered when deploying long - sequence Transformer models on CPUs. In particular, both the time and space complexity of the self - attention mechanism are quadratic with respect to the sequence length. This makes it very challenging to deploy large - language models (LLMs) on CPU - only devices. To address this issue, the paper proposes a new method named IceFormer, which can accelerate the inference time of the self - attention mechanism without retraining the pre - trained model while maintaining high accuracy. Specifically, IceFormer achieves this goal by exploiting the sparsity of the attention matrix, computing only the highest attention weights, and enumerating only the value vectors related to these weights. The design of IceFormer meets the following four criteria: 1. **No Retraining Required**: The method does not require retraining the model, because retraining LLMs demands huge computational resources. 2. **Generality**: The method can be applied to various LLMs, not just those trained with specific constraints. 3. **High Accuracy**: The method should not introduce large approximation errors, because LLMs have many attention layers and errors in the early layers may accumulate. 4. **Fast Inference**: The method should achieve fast performance at test time. The paper verifies the effectiveness of IceFormer through experiments, including tests on the LRA, ZeroSCROLLS, and LongEval benchmarks. The results show that IceFormer significantly improves the inference speed while maintaining an accuracy close to that of the original model. For example, on the LRA benchmark, IceFormer is on average 7.63 times faster than Transformer while maintaining 98.6% accuracy; on the ZeroSCROLLS benchmark, IceFormer is 2.73 times faster than the leading LLaMA 2 - based LLM while maintaining 99.6% accuracy.