Abstract:In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to reduce the number of parameters and computational overhead of the decoder - only Transformer model while maintaining performance**. Specifically, the author explores modifying the traditional decoder - only Transformer architecture (such as GPT) to achieve smaller and faster models. ### Background of the Main Problem In recent years, research on large - language models (LLMs) has developed rapidly, mainly focusing on models based on the Transformer architecture, especially decoder - only models (such as GPT). Most current research improves performance by expanding the model scale and increasing training data. However, few studies have focused on how to reduce the model scale without sacrificing performance. This has led to excessive computational costs during model training and inference, limiting its practical applications. ### Core Objectives of the Paper To meet this challenge, the author proposes three new variants of the decoder - only Transformer architecture: 1. **ParallelGPT (pgpt)**: By parallelizing the decoder blocks, it reduces the information sparsity of the deep - layer decoder blocks and allows parallel training. 2. **LinearGPT (lgpt)**: By gradually reducing the dimensions of the decoder blocks, it reduces the number of model parameters while maintaining performance. 3. **ConvGPT (cgpt)**: Replacing the linear compression layer with a one - dimensional convolutional layer, further exploring the potential of convolutional operations in the Transformer. These variants aim to: - **Reduce the number of parameters**: Reduce the complexity and storage requirements of the model. - **Accelerate training and inference speeds**: Improve the efficiency of the model, making it more suitable for practical applications. ### Experimental Results Experiments show that these variants exhibit performance comparable to or even better than the traditional GPT architecture in multiple benchmark tests while significantly reducing the number of parameters. For example, lgpt outperforms GPT in four benchmark tests, and the number of parameters is less than half that of GPT. ### Conclusions This research demonstrates the possibility of significantly reducing the model scale while maintaining performance through architectural innovation, providing a new direction for future research. Although there are some limitations (such as inability to directly compare with the state - of - the - art models due to resource limitations), these preliminary results lay the foundation for exploring more efficient Transformer architectures. ### Formula Summary The following are some key formulas involved in the paper: 1. **ParallelGPT**: \[ h_i = P_{D_i}(x), \quad \text{for } i = 1, 2, \ldots, P \] \[ y_i = f(h_i), \quad \text{for } i = 1, 2, \ldots, P \] \[ \alpha_i = \frac{\exp(w_i)}{\sum_{j = 1}^P \exp(w_j)}, \quad \text{for } i = 1, 2, \ldots, P \] \[ y = \sum_{i = 1}^P \alpha_i\cdot y_i \] 2. **LinearGPT**: \[ h_i = D_i(h_{i - 1}), \quad i = 1, 2, \ldots, N \] \[ h_i = \begin{cases} L_j(h_{i - 1}), & \text{if } i = nj \\ h_{i - 1}, & \text{otherwise} \] \[ d_m=\frac{d_0}{2^m}, \quad m=\left\lfloor\frac{N}{n}\right\rfloor \] These formulas describe the specific implementation methods of different architectures and help understand their working principles.

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

MoDeGPT: Modular Decomposition for Large Language Model Compression

On The Adaptation of Unlimiformer for Decoder-Only Transformers

Transformer on a Diet

Inheritune: Training Smaller Yet More Attentive Language Models

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Memory-Augmenting Decoder-Only Language Models through Encoders (Student Abstract)

Learning to Grow Pretrained Models for Efficient Transformer Training

How Powerful are Decoder-Only Transformer Neural Models?

BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Escaping the Big Data Paradigm with Compact Transformers

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Speculative Decoding with Big Little Decoder

Super Tiny Language Models

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition