Abstract:We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism for smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over $20\%$. Furthermore, we develop a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. We also implement an efficient model parallel schema for TransNormerLLM, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, i.e., LLMs with 175B parameters. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus. Benchmark results demonstrate that our models not only match the performance of state-of-the-art LLMs with Transformer but are also significantly faster. Code is released at: <a class="link-external link-https" href="https://github.com/OpenNLPLab/TransnormerLLM" rel="external noopener nofollow">this https URL</a>.

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

A Framework to Implement 1+N Multi-task Fine-tuning Pattern in LLMs Using the CGC-LORA Algorithm

Multimodal Instruction Tuning with Conditional Mixture of LoRA

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Label Supervised LLaMA Finetuning

Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models

Learning Global Controller in Latent Space for Parameter-Efficient Fine-Tuning

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

Efficiency in Focus: LayerNorm as a Catalyst for Fine-tuning Medical Visual Language Pre-trained Models

LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

SuperLoRA: Parameter-Efficient Unified Adaptation of Multi-Layer Attention Modules

Full Parameter Fine-tuning for Large Language Models with Limited Resources

LoFiT: Localized Fine-tuning on LLM Representations

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

LoRA ensembles for large language model fine-tuning

Tuning Language Models by Mixture-of-Depths Ensemble

Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model