Abstract:Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. We studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. We found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the inefficiency and ineffectiveness of existing language model architectures when dealing with long sequences. Specifically:
1. **Limitations of the Transformer Architecture**:
- **High Computational Complexity**: The time complexity of the self - attention mechanism is quadratic (\(O(n^2)\)), which makes the computational cost very high when processing long sequences.
- **Cache Size Limitation**: Since the self - attention mechanism needs to store the entire context information, the cache size becomes a bottleneck in processing long - context.
- **Lack of a Single Summary State**: Each generated token needs to be calculated in the entire context, resulting in an inability to effectively capture bias information.
2. **Limitations of the Selective State Space Model (SSM)**:
- **Information Loss Due to Information Compression**: Although SSM maintains a constant state size through a linear recursive state update mechanism and maintains a constant state size during the generation process, its state does not expand with the sequence length, and information compression will inevitably lead to information loss.
To overcome these limitations, the paper proposes a new hybrid architecture - Cheems. This architecture aims to combine the Selective State Space Algorithm (SSM) and the quadratic self - attention algorithm, and improve efficiency and effectiveness in the following ways:
- **Positional Encoding**: The effectiveness of different forms of positional encoding (such as Rotary Position Encoding (RoPE)) in combining SSM and the self - attention algorithm is studied.
- **Inner Function Attention**: An inner function attention mechanism is introduced, using the Selective State Space Algorithm as an inner function to enhance the expressive ability of the hidden state.
- **Cross - Domain Mixture of Million Experts (CDMoME)**: A cross - domain mixture of million experts architecture is designed to reduce parameter redundancy and improve computational efficiency.
Through these improvements, the Cheems architecture shows higher efficiency and better effectiveness in handling complex language tasks, especially in long - sequence processing.
### Formula Summary
1. **Rotary Position Encoding (RoPE)**:
\[
f_{Q,K}(x_i, i) = R_d^{\Theta,i} W_{Q,K} x_i
\]
\[
f_{C,B}(x_i, i) = R_d^{\Theta,i} W_{C,B} x_i
\]
where:
\[
\Theta = \left\{ \theta_i = \frac{n - 2(i - 1)}{d}, i\in[1, 2,\ldots, d/2] \right\}
\]
\[
R_d^{\Theta,i} = \begin{bmatrix}
\cos(i\theta_0) & -\sin(i\theta_0) & 0 & 0 & \ldots & 0 & 0 \\
\sin(i\theta_0) & \cos(i\theta_0) & 0 & 0 & \ldots & 0 & 0 \\
0 & 0 & \cos(i\theta_1) & -\sin(i\theta_1) & \ldots & 0 & 0 \\
0 & 0 & \sin(i\theta_1) & \cos(i\theta_1) & \ldots & 0 & 0 \\
\ldots & \ldots & \ldots & \ldots & \ldots & \ldots & \ldots \\
0 & 0 & 0 & 0 & \ldots & \cos(i\theta_{d/2}) & -\sin(i\theta_{d/2})
\end{bmatrix}