Abstract:In the realm of deep learning, the self-attention mechanism has substantiated its pivotal role across a myriad of tasks, encompassing natural language processing and computer vision. Despite achieving success across diverse applications, the traditional self-attention mechanism primarily leverages linear transformations for the computation of query, key, and value (QKV), which may not invariably be the optimal choice under specific circumstances. This paper probes into a novel methodology for QKV computation-implementing a specially-designed neural network structure for the calculation. Utilizing a modified Marian model, we conducted experiments on the IWSLT 2017 German-English translation task dataset and juxtaposed our method with the conventional approach. The experimental results unveil a significant enhancement in BLEU scores with our method. Furthermore, our approach also manifested superiority when training the Roberta model with the Wikitext-103 dataset, reflecting a notable reduction in model perplexity compared to its original counterpart. These experimental outcomes not only validate the efficacy of our method but also reveal the immense potential in optimizing the self-attention mechanism through neural network-based QKV computation, paving the way for future research and practical applications. The source code and implementation details for our proposed method can be accessed at <a class="link-external link-https" href="https://github.com/ocislyjrti/NeuralAttention" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when calculating Query (Q), Key (K) and Value (V) in the traditional self - attention mechanism, it mainly depends on linear transformation, which may not fully capture the complex patterns and nonlinear relationships in the input data in some cases. Specifically: - **Problem statement**: The traditional self - attention mechanism calculates QKV through linear transformation. This linear transformation is essentially a linear mapping and may lack the ability to handle complex patterns and nonlinear relationships. Therefore, in some specific scenarios, the expressive ability of this linear transformation is limited. - **Research motivation**: Non - linear transformations, such as those implemented through neural networks, are usually able to capture more complex features and patterns. Therefore, the main motivation of this paper is to explore a new method to use neural networks to enhance the QKV calculation in the self - attention mechanism in order to improve its expressive ability and performance. To verify the effectiveness of this method, the author proposes a new model based on the multi - layer perceptron (MLP) and verifies it through experiments. The experimental results show that the new method is significantly superior to the traditional method in terms of indicators such as BLEU score and perplexity, proving the superiority of neural networks in QKV calculation. ### Formula representation The QKV calculation formula in the traditional self - attention mechanism is: \[ Q = W_q X, \quad K = W_k X, \quad V = W_v X \] where \(W_q\), \(W_k\) and \(W_v\) are weight matrices, and \(X\) is the input. In the method proposed in this paper, the QKV calculation is implemented through a multi - layer perceptron (MLP): \[ Q = \text{MLP}_q(X), \quad K = \text{MLP}_k(X), \quad V = \text{MLP}_v(X) \] where the specific form of MLP is: \[ \text{MLP}(X) = W_2 \cdot \sigma(\text{LayerNorm}(W_1 X + b_1)) + b_2 \] - \(X\) is the input, - \(W_1\) and \(b_1\) are the weight and bias of the first layer respectively, - \(\sigma\) represents the ReLU activation function, - \(\text{LayerNorm}\) represents the layer normalization operation, - \(W_2\) and \(b_2\) are the weight and bias of the second layer respectively. By introducing non - linear transformation, the new method can better capture the complex patterns and non - linear relationships in the input data, thereby improving the performance of the model.

Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism with Neural Networks

Re-examining Lexical and Semantic Attention: Dual-view Graph Convolutions Enhanced BERT for Academic Paper Rating.

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Attention-via-Attention Neural Machine Translation

Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension

Integrating Multi-Head Convolutional Encoders with Cross-Attention for Improved SPARQL Query Translation

Neural Machine Translation with Supervised Attention

Learning When to Attend for Neural Machine Translation

Effective Approaches to Attention-based Neural Machine Translation

Self-Attention and Dynamic Convolution Hybrid Model for Neural Machine Translation

Neural Machine Translation with Attention Based on a New Syntactic Branch Distance

Neural Machine Translation with Recurrent Attention Modeling

Improving Autoregressive NLP Tasks via Modular Linearized Attention

Interactive Attention for Neural Machine Translation

Research on Intelligent English Translation Method Based on the Improved Attention Mechanism Model

Abstractive Summarization Using Attentive Neural Techniques

Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration

Neural Machine Translation with Key-Value Memory-Augmented Attention

Improved Blending Attention Mechanism in Visual Question Answering

Quantum Self-Attention Neural Networks for Text Classification