Abstract:We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism for smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over $20\%$. Furthermore, we develop a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. We also implement an efficient model parallel schema for TransNormerLLM, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, i.e., LLMs with 175B parameters. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus. Benchmark results demonstrate that our models not only match the performance of state-of-the-art LLMs with Transformer but are also significantly faster. Code is released at: <a class="link-external link-https" href="https://github.com/OpenNLPLab/TransnormerLLM" rel="external noopener nofollow">this https URL</a>.

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

LoLCATs: On Low-Rank Linearizing of Large Language Models

Efficient and Economic Large Language Model Inference with Attention Offloading

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions

LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

The Devil in Linear Transformer

LMTuner: An user-friendly and highly-integrable Training Framework for fine-tuning Large Language Models

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention

LLM-Neo: Parameter Efficient Knowledge Distillation for Large Language Models

NVLM: Open Frontier-Class Multimodal LLMs

PolyLM: An Open Source Polyglot Large Language Model

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models