Abstract:In the realm of deep learning, the self-attention mechanism has substantiated its pivotal role across a myriad of tasks, encompassing natural language processing and computer vision. Despite achieving success across diverse applications, the traditional self-attention mechanism primarily leverages linear transformations for the computation of query, key, and value (QKV), which may not invariably be the optimal choice under specific circumstances. This paper probes into a novel methodology for QKV computation-implementing a specially-designed neural network structure for the calculation. Utilizing a modified Marian model, we conducted experiments on the IWSLT 2017 German-English translation task dataset and juxtaposed our method with the conventional approach. The experimental results unveil a significant enhancement in BLEU scores with our method. Furthermore, our approach also manifested superiority when training the Roberta model with the Wikitext-103 dataset, reflecting a notable reduction in model perplexity compared to its original counterpart. These experimental outcomes not only validate the efficacy of our method but also reveal the immense potential in optimizing the self-attention mechanism through neural network-based QKV computation, paving the way for future research and practical applications. The source code and implementation details for our proposed method can be accessed at <a class="link-external link-https" href="https://github.com/ocislyjrti/NeuralAttention" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when calculating Query (Q), Key (K) and Value (V) in the traditional self - attention mechanism, it mainly depends on linear transformation, which may not fully capture the complex patterns and nonlinear relationships in the input data in some cases. Specifically:
- **Problem statement**: The traditional self - attention mechanism calculates QKV through linear transformation. This linear transformation is essentially a linear mapping and may lack the ability to handle complex patterns and nonlinear relationships. Therefore, in some specific scenarios, the expressive ability of this linear transformation is limited.
- **Research motivation**: Non - linear transformations, such as those implemented through neural networks, are usually able to capture more complex features and patterns. Therefore, the main motivation of this paper is to explore a new method to use neural networks to enhance the QKV calculation in the self - attention mechanism in order to improve its expressive ability and performance.
To verify the effectiveness of this method, the author proposes a new model based on the multi - layer perceptron (MLP) and verifies it through experiments. The experimental results show that the new method is significantly superior to the traditional method in terms of indicators such as BLEU score and perplexity, proving the superiority of neural networks in QKV calculation.
### Formula representation
The QKV calculation formula in the traditional self - attention mechanism is:
\[ Q = W_q X, \quad K = W_k X, \quad V = W_v X \]
where \(W_q\), \(W_k\) and \(W_v\) are weight matrices, and \(X\) is the input.
In the method proposed in this paper, the QKV calculation is implemented through a multi - layer perceptron (MLP):
\[ Q = \text{MLP}_q(X), \quad K = \text{MLP}_k(X), \quad V = \text{MLP}_v(X) \]
where the specific form of MLP is:
\[ \text{MLP}(X) = W_2 \cdot \sigma(\text{LayerNorm}(W_1 X + b_1)) + b_2 \]
- \(X\) is the input,
- \(W_1\) and \(b_1\) are the weight and bias of the first layer respectively,
- \(\sigma\) represents the ReLU activation function,
- \(\text{LayerNorm}\) represents the layer normalization operation,
- \(W_2\) and \(b_2\) are the weight and bias of the second layer respectively.
By introducing non - linear transformation, the new method can better capture the complex patterns and non - linear relationships in the input data, thereby improving the performance of the model.