Abstract:Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use a single classifier in multi - exit models to improve model performance while reducing the demand for computational resources. Specifically, the authors focus on Transformer models, which perform well in vision and natural language processing tasks, but as the model size increases, it becomes more and more challenging to understand and efficiently deploy these models. To this end, the authors propose an alignment training method, aiming to enhance the similarity of internal representations between different layers, so that the classifier in the last layer can be directly applied to any hidden layer to achieve early exiting, thus saving computing time. This method not only improves the similarity of representations between layers, but also enables the model to achieve high accuracy with a smaller number of layers, helps to determine the minimum number of layers required for a given task, and when used as a multi - exit model, can achieve performance comparable to that of the standard multi - exit architecture with a single classifier. ### Main contributions: 1. **Measurement of similarity of representations between layers**: The authors introduce a sample - based cosine similarity index to measure the similarity of internal representations between different layers in Transformer models. Experiments show that this simple measurement method is consistent with complex methods based on statistical properties (such as CKA) and can reflect the similarity of representations between layers. 2. **Alignment training method**: In order to enhance the performance of multi - exit models, the authors propose an alignment training method, which optimizes the model by increasing the weighted average of the cross - entropy losses of all layers. This method significantly improves the similarity of representations between layers, thereby improving the accuracy of each layer. 3. **Multi - exit model with a single classifier**: The authors show how to construct a multi - exit model using a single classifier, which is of great significance for improving the inference efficiency of large - scale models. Experimental results show that the model using the alignment training method has a higher exit rate in the early layers and achieves a significant speed improvement while maintaining a high accuracy rate. ### Experimental verification: - **Vision tasks**: On the ImageNet dataset, the alignment training method significantly increases the exit rate of the model in the early layers while maintaining a high classification accuracy. - **Natural language processing tasks**: In NLP tasks, such as text classification using BERT and text generation using GPT2, the alignment training method is also effective and can reduce the consumption of computational resources while maintaining performance. In conclusion, this paper solves the problem of using a single classifier in multi - exit models by proposing a new alignment training method, which not only improves the performance of the model, but also provides new ideas for the efficient deployment of large - scale models.

On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier

Representation Alignment Contrastive Regularization for Multi-Object Tracking

Single-layer vision transformers for more accurate early exits with less overhead

Representational Strengths and Limitations of Transformers

Dynamic Transformers Provide a False Sense of Efficiency

Similarity of Neural Network Representations Revisited

An Intrinsic Dimension Perspective of Transformers for Sequential Modeling

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

A Multi-View Model Fusion Network with Double Branch Structure

Contrastive Multi-View Multiplex Network Embedding with Applications to Robust Network Alignment

Reducing the Transformer Architecture to a Minimum

SIMformer: Single-Layer Vanilla Transformer Can Learn Free-Space Trajectory Similarity

BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Bridging the Knowledge Gap via Transformer-Based Multi-Layer Correlation Learning

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Multi-head or Single-head? An Empirical Comparison for Transformer Training

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

Transformer with Layer Fusion and Interaction

Explaining Text Similarity in Transformer Models

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning