Abstract:In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **How to reduce the number of parameters and computational overhead of the decoder - only Transformer model while maintaining performance**. Specifically, the author explores modifying the traditional decoder - only Transformer architecture (such as GPT) to achieve smaller and faster models.
### Background of the Main Problem
In recent years, research on large - language models (LLMs) has developed rapidly, mainly focusing on models based on the Transformer architecture, especially decoder - only models (such as GPT). Most current research improves performance by expanding the model scale and increasing training data. However, few studies have focused on how to reduce the model scale without sacrificing performance. This has led to excessive computational costs during model training and inference, limiting its practical applications.
### Core Objectives of the Paper
To meet this challenge, the author proposes three new variants of the decoder - only Transformer architecture:
1. **ParallelGPT (pgpt)**: By parallelizing the decoder blocks, it reduces the information sparsity of the deep - layer decoder blocks and allows parallel training.
2. **LinearGPT (lgpt)**: By gradually reducing the dimensions of the decoder blocks, it reduces the number of model parameters while maintaining performance.
3. **ConvGPT (cgpt)**: Replacing the linear compression layer with a one - dimensional convolutional layer, further exploring the potential of convolutional operations in the Transformer.
These variants aim to:
- **Reduce the number of parameters**: Reduce the complexity and storage requirements of the model.
- **Accelerate training and inference speeds**: Improve the efficiency of the model, making it more suitable for practical applications.
### Experimental Results
Experiments show that these variants exhibit performance comparable to or even better than the traditional GPT architecture in multiple benchmark tests while significantly reducing the number of parameters. For example, lgpt outperforms GPT in four benchmark tests, and the number of parameters is less than half that of GPT.
### Conclusions
This research demonstrates the possibility of significantly reducing the model scale while maintaining performance through architectural innovation, providing a new direction for future research. Although there are some limitations (such as inability to directly compare with the state - of - the - art models due to resource limitations), these preliminary results lay the foundation for exploring more efficient Transformer architectures.
### Formula Summary
The following are some key formulas involved in the paper:
1. **ParallelGPT**:
\[
h_i = P_{D_i}(x), \quad \text{for } i = 1, 2, \ldots, P
\]
\[
y_i = f(h_i), \quad \text{for } i = 1, 2, \ldots, P
\]
\[
\alpha_i = \frac{\exp(w_i)}{\sum_{j = 1}^P \exp(w_j)}, \quad \text{for } i = 1, 2, \ldots, P
\]
\[
y = \sum_{i = 1}^P \alpha_i\cdot y_i
\]
2. **LinearGPT**:
\[
h_i = D_i(h_{i - 1}), \quad i = 1, 2, \ldots, N
\]
\[
h_i =
\begin{cases}
L_j(h_{i - 1}), & \text{if } i = nj \\
h_{i - 1}, & \text{otherwise}
\]
\[
d_m=\frac{d_0}{2^m}, \quad m=\left\lfloor\frac{N}{n}\right\rfloor
\]
These formulas describe the specific implementation methods of different architectures and help understand their working principles.