Self-attention Mechanism at the Token Level: Gradient Analysis and Algorithm Optimization.

Linqing Liu,Xiaolong Xu
DOI: https://doi.org/10.1016/j.knosys.2023.110784
IF: 8.139
2023-01-01
Knowledge-Based Systems
Abstract:The self-attention mechanism is a feature processing mechanism for structured data in deep learning models. It has been widely used in transformer-based deep learning models and has demonstrated superior performance in various fields, such as machine translation, speech recognition, text-to-text conversion, and computer vision. The self-attention mechanism mainly focuses on the surface structure of structured data, but it also involves attention between basic data units and self-attention of basic data units in the deeper structure of the data. In this paper, we investigate the forward attention flow and backward gradient flow in the self-attention module of the transformer model based on the sequence-to-sequence data structure used in machine translation tasks. We found that this combination produces a “gradient distortion” phenomenon at the token level of basic data units. We consider this phenomenon a defect and propose a series of solutions to address it theoretically. Furthermore, we conduct experiments and select the most robust solution as the Unevenness-Reduced Self-Attention (URSA) module, which replaces the original self-attention module. The experimental results demonstrate that the “gradient distortion” phenomenon exists both theoretically and numerically, and the URSA module enables the self-attention mechanism to achieve consistent, stable, and effective optimization across different models, tasks, corpora, and evaluation metrics. The URSA module is both simple and highly portable.
What problem does this paper attempt to address?