Inverting Visual Representations with Detection Transformers

Jan Rathjens,Shirin Reyhanian,David Kappel,Laurenz Wiskott
2024-12-09
Abstract:Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model's architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at <a class="link-external link-http" href="http://github.com/wiskott-lab/inverse-detection-transformer" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the challenge of understanding the internal mechanisms of Transformer - based vision models (such as Detection Transformer, DETR)**. Specifically, although these models perform excellently in computer vision tasks (such as object detection, semantic segmentation, and image classification), their internal working principles remain opaque and difficult to interpret. This hinders the understanding of the model prediction process and limits the possibility of further optimization and improvement. To solve this problem, the authors adopted the method of **Feature Inversion**, by training inverse models to reconstruct the input image from the intermediate layer, in order to reveal the information contained in these intermediate representations. This method can help researchers evaluate the information retained by each layer and gain in - depth understanding of the working mechanisms of Transformer - based vision models. ### Specific Research Objectives 1. **Extend the feature inversion technique to Transformer - based vision models**: - Previous feature inversion techniques were mainly applied to convolutional neural networks (CNN), while this paper extends it to more complex Transformer - based models (such as DETR). 2. **Verify the effectiveness of feature inversion**: - Through qualitative and quantitative evaluation of the reconstructed images, show the key characteristics of DETR, such as context - shape preservation, inter - layer correlation, and robustness to color perturbations. 3. **Explore the internal information processing mechanism of DETR**: - Analyze the reconstructed images at different stages to reveal how DETR gradually processes and transforms the information in the input image, so as to better understand its internal working mechanism. ### Method Overview To achieve the above objectives, the authors took the following steps: - **Modular feature inversion**: Perform feature inversion on different components of DETR (such as the backbone network, encoder, decoder, and prediction head) separately, and train the corresponding inverse models. - **Qualitative and quantitative evaluation**: Analyze the feature changes and information loss of DETR when processing images by comparing the reconstructed images at different stages. - **Experimental verification**: Conduct experiments including the processing of grayscale images, the influence of color perturbations, and performance under different optimization strategies, etc., to comprehensively evaluate the effect of the feature inversion method. Through these methods, the authors not only demonstrated the feasibility of feature inversion in Transformer - based vision models, but also provided valuable insights for further understanding and development of these models.