Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Mingqi Jiang,Saeed Khorram,Li Fuxin
2024-06-24
Abstract:In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to delve into and compare the decision mechanisms of different visual recognition networks (including Transformer networks and Convolutional Neural Networks (CNNs)) in image recognition tasks. Specifically, the paper seeks to answer the following key questions: 1. **Is there an inherent difference in the working mechanisms between Transformer networks and CNNs?** - Why do some Transformer networks seem to be more robust than CNNs? 2. **The impact of design principles on network decision mechanisms:** - Recent studies like ConvNeXt have adopted design principles from Transformer methods to design deep convolution-based networks and achieved excellent results. Does this imply that the key factor is not the attention mechanism itself but these design principles? - If so, which specific design principles particularly affect the network's decision mechanisms? To answer these questions, the authors propose a new methodology that systematically analyzes the behavior of different network architectures by applying deep explanation algorithms. This method is not limited to the interpretation of a single image but extracts statistical information at the dataset level to gain a global understanding. ### Main Contributions 1. **Sub-explanation Counting**: This is a method to evaluate how networks handle partial evidence by removing patches from the Minimal Sufficient Explanations (MSEs) and checking the likelihood ratio relative to the full image. This method reveals two characteristics of network behavior—**Compositionality** and **Disjunctivism**. 2. **Cross-testing**: This is a method to evaluate whether different networks use the same type of visual features. By generating explanations (image masks) from one network and then submitting these masked regions as input to another network, it determines whether the two models rely on similar visual features. ### Key Findings - **Differences in Compositionality and Disjunctivism**: The study found that Transformer models (especially those not distilled) and ConvNeXt are more compositional, meaning they consider multiple parts of the image jointly to make decisions. In contrast, traditional CNNs and distilled Transformers are more disjunctive, using fewer but more diverse parts of the image to make confident predictions. - **Impact of Normalization Mechanisms**: The authors observed that the choice of normalization layers (e.g., Batch Normalization, Group Normalization, or Layer Normalization) significantly affects the network's compositionality. Batch Normalization makes networks less compositional, while Group Normalization and Layer Normalization promote compositionality. - **Feature Usage Landscape**: Through cross-testing, the authors mapped the feature usage landscape between different convolutional networks and Transformers, showing that different networks indeed use different visual features for classification. In summary, this paper reveals the decision mechanisms behind these complex black-box models through systematic interpretative analysis of different types of visual recognition networks and provides new insights into these mechanisms.