Abstract:Recent studies have shown that, relative position encoding performs well in selective state space model scanning algorithms, and the architecture that balances SSM and Attention enhances the efficiency and effectiveness of the algorithm, while the sparse activation of the mixture of experts reduces the training cost. We studied the effectiveness of using different position encodings in structured state space dual algorithms, and the more effective SSD-Attn internal and external function mixing method, and designed a more efficient cross domain mixture of experts. We found that the same matrix is very wonderful in different algorithms, which allows us to establish a new hybrid sparse architecture: Cheems. Compared with other hybrid architectures, it is more efficient and more effective in language modeling tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the inefficiency and ineffectiveness of existing language model architectures when dealing with long sequences. Specifically: 1. **Limitations of the Transformer Architecture**: - **High Computational Complexity**: The time complexity of the self - attention mechanism is quadratic (\(O(n^2)\)), which makes the computational cost very high when processing long sequences. - **Cache Size Limitation**: Since the self - attention mechanism needs to store the entire context information, the cache size becomes a bottleneck in processing long - context. - **Lack of a Single Summary State**: Each generated token needs to be calculated in the entire context, resulting in an inability to effectively capture bias information. 2. **Limitations of the Selective State Space Model (SSM)**: - **Information Loss Due to Information Compression**: Although SSM maintains a constant state size through a linear recursive state update mechanism and maintains a constant state size during the generation process, its state does not expand with the sequence length, and information compression will inevitably lead to information loss. To overcome these limitations, the paper proposes a new hybrid architecture - Cheems. This architecture aims to combine the Selective State Space Algorithm (SSM) and the quadratic self - attention algorithm, and improve efficiency and effectiveness in the following ways: - **Positional Encoding**: The effectiveness of different forms of positional encoding (such as Rotary Position Encoding (RoPE)) in combining SSM and the self - attention algorithm is studied. - **Inner Function Attention**: An inner function attention mechanism is introduced, using the Selective State Space Algorithm as an inner function to enhance the expressive ability of the hidden state. - **Cross - Domain Mixture of Million Experts (CDMoME)**: A cross - domain mixture of million experts architecture is designed to reduce parameter redundancy and improve computational efficiency. Through these improvements, the Cheems architecture shows higher efficiency and better effectiveness in handling complex language tasks, especially in long - sequence processing. ### Formula Summary 1. **Rotary Position Encoding (RoPE)**: \[ f_{Q,K}(x_i, i) = R_d^{\Theta,i} W_{Q,K} x_i \] \[ f_{C,B}(x_i, i) = R_d^{\Theta,i} W_{C,B} x_i \] where: \[ \Theta = \left\{ \theta_i = \frac{n - 2(i - 1)}{d}, i\in[1, 2,\ldots, d/2] \right\} \] \[ R_d^{\Theta,i} = \begin{bmatrix} \cos(i\theta_0) & -\sin(i\theta_0) & 0 & 0 & \ldots & 0 & 0 \\ \sin(i\theta_0) & \cos(i\theta_0) & 0 & 0 & \ldots & 0 & 0 \\ 0 & 0 & \cos(i\theta_1) & -\sin(i\theta_1) & \ldots & 0 & 0 \\ 0 & 0 & \sin(i\theta_1) & \cos(i\theta_1) & \ldots & 0 & 0 \\ \ldots & \ldots & \ldots & \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & 0 & 0 & \ldots & \cos(i\theta_{d/2}) & -\sin(i\theta_{d/2}) \end{bmatrix}

Wonderful Matrices: More Efficient and Effective Architecture for Language Modeling Tasks

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models

Enhanced Structured State Space Models via Grouped FIR Filtering and Attention Sink Mechanisms

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Learning Architectures from an Extended Search Space for Language Modeling

EfficientState Space Model viaFast Tensor Convolutionand Block Diagonalization

Diagonal State Spaces are as Effective as Structured State Spaces

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Hymba: A Hybrid-head Architecture for Small Language Models

Longhorn: State Space Models are Amortized Online Learners

Expert-Token Resonance: Redefining MoE Routing through Affinity-Driven Active Selection

IEKM: A Model Incorporating External Keyword Matrices

Efficient Long Sequence Modeling Via State Space Augmented Transformer

A Closer Look into Mixture-of-Experts in Large Language Models

Sparse Mamba: Introducing Controllability, Observability, And Stability To Structural State Space Models

Efficient Multimodal Large Language Models: A Survey

MM-LLMs: Recent Advances in MultiModal Large Language Models

Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences