Abstract:Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model's architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at <a class="link-external link-http" href="http://github.com/wiskott-lab/inverse-detection-transformer" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the challenge of understanding the internal mechanisms of Transformer - based vision models (such as Detection Transformer, DETR)**. Specifically, although these models perform excellently in computer vision tasks (such as object detection, semantic segmentation, and image classification), their internal working principles remain opaque and difficult to interpret. This hinders the understanding of the model prediction process and limits the possibility of further optimization and improvement. To solve this problem, the authors adopted the method of **Feature Inversion**, by training inverse models to reconstruct the input image from the intermediate layer, in order to reveal the information contained in these intermediate representations. This method can help researchers evaluate the information retained by each layer and gain in - depth understanding of the working mechanisms of Transformer - based vision models. ### Specific Research Objectives 1. **Extend the feature inversion technique to Transformer - based vision models**: - Previous feature inversion techniques were mainly applied to convolutional neural networks (CNN), while this paper extends it to more complex Transformer - based models (such as DETR). 2. **Verify the effectiveness of feature inversion**: - Through qualitative and quantitative evaluation of the reconstructed images, show the key characteristics of DETR, such as context - shape preservation, inter - layer correlation, and robustness to color perturbations. 3. **Explore the internal information processing mechanism of DETR**: - Analyze the reconstructed images at different stages to reveal how DETR gradually processes and transforms the information in the input image, so as to better understand its internal working mechanism. ### Method Overview To achieve the above objectives, the authors took the following steps: - **Modular feature inversion**: Perform feature inversion on different components of DETR (such as the backbone network, encoder, decoder, and prediction head) separately, and train the corresponding inverse models. - **Qualitative and quantitative evaluation**: Analyze the feature changes and information loss of DETR when processing images by comparing the reconstructed images at different stages. - **Experimental verification**: Conduct experiments including the processing of grayscale images, the influence of color perturbations, and performance under different optimization strategies, etc., to comprehensively evaluate the effect of the feature inversion method. Through these methods, the authors not only demonstrated the feasibility of feature inversion in Transformer - based vision models, but also provided valuable insights for further understanding and development of these models.

Inverting Visual Representations with Detection Transformers

Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer

CTFCD: Channel Transformer Based on Full Convolutional Decoder for Single Image Deraining

A Simple Interpretable Transformer for Fine-Grained Image Classification and Analysis

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Inverting Visual Representations with Convolutional Networks

Vision Transformers Are Active Learners for Image Copy Detection

IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes

Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds

Intriguing Equivalence Structures of the Embedding Space of Vision Transformers

Efficient Decoder-Free Object Detection with Transformers

Do Vision Transformers See Like Convolutional Neural Networks?

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

FP-DETR: Detection Transformer Advanced by Fully Pre-training

Understanding Video Transformers via Universal Concept Discovery

PnP-DETR: Towards Efficient Visual Analysis with Transformers

A Survey of Visual Transformers

Training Strategies for Vision Transformers for Object Detection

Three things everyone should know about Vision Transformers

Image Reconstruction using Enhanced Vision Transformer