Abstract:As the core technology of Transformers, the attention mechanism is almost indispensable. However, many experimental findings show that the models developed based on the attention mechanism are not as perfect as imagined, and there are pitfalls in their ability to capture effective information, especially in some multi-modal tasks. In this paper, we continue to delve into this issue and try to uncover the mysterious nature of the attention mechanism through powerful explainable causal inference techniques. At the theoretical level, we rigorously characterize the capacity bottleneck of the attention mechanism in multi-modal tasks and demonstrate the shortcomings of the attention model in its ability to weed out invalid features. Further, we obtain results consistent with the theoretical analysis in the experimental session. In particular, the model optimized under the guidance of our theoretical analysis achieves superiority over state-of-the-art methods in visual question-answering tasks. Excitingly, we find that the attention mechanism's defects can be repaired, and the repair method has strong generalization properties. This distinct advantage will provide a clear interpretable optimization technique for the attention-based framework.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper attempts to solve the performance bottleneck problem of the traditional attention mechanism in multi - modal tasks. Specifically, the author focuses on the following aspects: 1. **Defects of the traditional attention mechanism**: - Although the attention mechanism performs well in many tasks, experimental results show that it has defects in some multi - modal tasks, especially its limited ability to capture effective information. - The traditional attention mechanism may inevitably absorb some invalid features but lacks the ability to eliminate these invalid features. 2. **Causal inference perspective**: - The author re - examines the working principle of the attention mechanism from the perspective of causal inference, attempting to reveal the root cause of its performance bottleneck. - Through causal graphs and causal reasoning techniques, the author analyzes the shortcomings of the attention mechanism when dealing with effective and invalid features. 3. **Theoretical and experimental verification**: - Theoretically, the author strictly characterizes the capacity bottleneck of the attention mechanism in multi - modal tasks and proves its deficiency in filtering invalid features. - The experimental part verifies the results of the theoretical analysis. In particular, in the visual question - answering task, the optimized model outperforms the existing state - of - the - art methods. 4. **Optimization method**: - The author proposes an optimization method to repair the defects of the attention mechanism through causal reasoning techniques, making it have stronger generalization ability. - This optimization method is not only theoretically verified but also shows significant advantages in experiments. ### Main contributions 1. **Innovatively formalize the working model of the attention mechanism as a disentangled representation problem**: - Clearly show the problem that the traditional attention mechanism cannot distinguish between effective and invalid features. 2. **Explanatory guidance based on causal inference theory**: - Not only rely on quantitative experimental data, but also provide interpretable guidance based on causal inference theory to explore the performance bottleneck of the attention mechanism. 3. **Flexible and general solution**: - Provide a flexible and general method to improve the generalization ability of the attention mechanism and point out the potential directions for future research. ### Summary This paper deeply explores the performance bottleneck of the attention mechanism in multi - modal tasks through causal inference techniques and proposes an optimization method aimed at improving the effectiveness and generalization ability of the attention mechanism. This not only provides new insights theoretically but also is verified in experiments, providing a valuable reference for future research.

Rethinking the role of attention mechanism: a causality perspective

Task Optimization Leads to Human-like Top-down and Bottom-up Attention during Reading Comprehension

Attention: Marginal Probability is All You Need?

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Why Attentions May Not Be Interpretable?

From Cognition to Computation: A Comparative Review of Human Attention and Transformer Architectures

Towards Causal Foundation Model: on Duality between Causal Inference and Attention

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

The Attention Mechanism Demystiûed

Causal Attention for Vision-Language Tasks

Attention cannot be an Explanation

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Attention Meets Post-hoc Interpretability: A Mathematical Perspective

The Costs and Benefits of Goal-Directed Attention in Deep Convolutional Neural Networks

Rethinking Attention Module Design for Point Cloud Analysis

Understanding More about Human and Machine Attention in Deep Neural Networks

Causal-Based Supervision of Attention in Graph Neural Network: A Better and Simpler Choice towards Powerful Attention

Attention in Reasoning: Dataset, Analysis, and Modeling

An Overview of the Attention Mechanisms in Computer Vision

Generic Attention-model Explainability by Weighted Relevance Accumulation