Abstract:The self-attention mechanism (SAM) is widely used in various fields of artificial intelligence and has successfully boosted the performance of different models. However, current explanations of this mechanism are mainly based on intuitions and experiences, while there still lacks direct modeling for how the SAM helps performance. To mitigate this issue, in this paper, based on the dynamical system perspective of the residual neural network, we first show that the intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN). Thus the ability of NN to measure SP at the feature level is necessary to obtain high performance and is an important factor in the difficulty of training NN. Similar to the adaptive step-size method which is effective in solving stiff ODEs, we show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP by refining the estimation of stiffness information and generating adaptive attention values, which provides a new understanding about why and how the SAM can benefit the model performance. This novel perspective can also explain the lottery ticket hypothesis in SAM, design new quantitative metrics of representational ability, and inspire a new theoretic-inspired approach, StepNet. Extensive experiments on several popular benchmarks demonstrate that StepNet can extract fine-grained stiffness information and measure SP accurately, leading to significant improvements in various visual tasks.

IDMN: A Two-Pass Attention Mechanism in Dynamic Memory Network

Deep Neural Networks Evolve Human-like Attention Distribution during Reading Comprehension.

Deep Neural Networks Evolve Human-like Attention Distribution during Goal-directed Reading Comprehension

Task Optimization Leads to Human-like Top-down and Bottom-up Attention during Reading Comprehension

Ask Me Even More: Dynamic Memory Tensor Networks (Extended Model)

A generic shared attention mechanism for various backbone neural networks

PAM: Pyramid Attention Mechanism Based on Contextual Reasoning

Reinforced Mnemonic Reader for Machine Reading Comprehension

Understanding Self-attention Mechanism via Dynamical System Perspective

Modeling Intra-Relation in Math Word Problems with Different Functional Multi-Head Attentions

Pay More Attention - Neural Architectures for Question-Answering

DIANet: Dense-and-Implicit Attention Network

Dynamic Fusion Networks for Machine Reading Comprehension

Can Active Memory Replace Attention?

Attention module improves both performance and interpretability of 4D fMRI decoding neural network

A Generalized Attention Mechanism to Enhance the Accuracy Performance of Neural Networks

Attention module improves both performance and interpretability of four‐dimensional functional magnetic resonance imaging decoding neural network

Social Attentional Memory Network

B I -D IRECTIONAL A TTENTION F LOW FOR M ACHINE C OMPREHENSION

BAFN: Bi-Direction Attention Based Fusion Network for Multimodal Sentiment Analysis