Abstract:We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism for smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over $20\%$. Furthermore, we develop a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. We also implement an efficient model parallel schema for TransNormerLLM, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, i.e., LLMs with 175B parameters. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus. Benchmark results demonstrate that our models not only match the performance of state-of-the-art LLMs with Transformer but are also significantly faster. Code is released at: <a class="link-external link-https" href="https://github.com/OpenNLPLab/TransnormerLLM" rel="external noopener nofollow">this https URL</a>.

An End-to-end Chinese Text Normalization Model Based on Rule-guided Flat-Lattice Transformer.

Chinese Text Classification Using BERT and Flat-Lattice Transformer.

Transformer-based Models of Text Normalization for Speech Applications

Text Normalization in Chinese Text-to-Speech System

A Three-Stage Text Normalization Strategy for Mandarin Text-to-Speech Systems

NFLAT: Non-Flat-Lattice Transformer for Chinese Named Entity Recognition

Text Normalization in Mandarin Text-to-speech System.

FLAT: Chinese NER Using Flat-Lattice Transformer

RNN Approaches to Text Normalization: A Challenge

Empirical Study on Character Level Neural Network Classifier for Chinese Text.

Neural Symbolic Logical Rule Learner for Interpretable Learning

A Unified Tagging Approach to Text Normalization.

Well-Behaved Transformer for Chinese Medical NER

Design of Chinese Grammar Recognition and Error Correction Model Based on the Deep Neural Network

Dependency syntax guided BERT-BiLSTM-GAM-CRF for Chinese NER

KCB-FLAT: Enhancing Chinese Named Entity Recognition with Syntactic Information and Boundary Smoothing Techniques

Attentive batch normalization for lstm-based acoustic modeling of speech recognition

Modeling Bilingual Conversational Characteristics for Neural Chat Translation

Deep Convolutional Neural Network Based Medical Concept Normalization

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer