Abstract:We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism for smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over $20\%$. Furthermore, we develop a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. We also implement an efficient model parallel schema for TransNormerLLM, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, i.e., LLMs with 175B parameters. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus. Benchmark results demonstrate that our models not only match the performance of state-of-the-art LLMs with Transformer but are also significantly faster. Code is released at: <a class="link-external link-https" href="https://github.com/OpenNLPLab/TransnormerLLM" rel="external noopener nofollow">this https URL</a>.

Improving and Scaling Trans-dimensional Random Field Language Models.

Learning Trans-Dimensional Random Fields with Applications to Language Modeling

Model Interpolation with Trans-dimensional Random Field Language Models for Speech Recognition

Learning neural trans-dimensional random field language models with noise-contrastive estimation

Trans-dimensional Random Fields for Language Modeling.

Improved training of neural trans-dimensional random field language models with dynamic noise-contrastive estimation.

INTEGRATING DISCRETE AND NEURAL FEATURES VIA MIXED-FEATURE TRANS-DIMENSIONAL RANDOM FIELD LANGUAGE MODELS

TRAWL: Tensor Reduced and Approximated Weights for Large Language Models

Language Modeling Using Tensor Trains

FLTRNN: Faithful Long-Horizon Task Planning for Robotics with Large Language Models

LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Exploring the Limits of Language Modeling

Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

Why Does the Effective Context Length of LLMs Fall Short?

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Towards Neural Scaling Laws for Time Series Foundation Models

Transformer-xl: Language modeling with longer-term dependency

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Conditional Topic Random Fields.

Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer