Training objective drives the consistency of representational similarity across datasets

Laure Ciernik,Lorenz Linhardt,Marco Morik,Jonas Dippel,Simon Kornblith,Lukas Muttenthaler
2024-11-08
Abstract:The Platonic Representation Hypothesis claims that recent foundation models are converging to a shared representation space as a function of their downstream task performance, irrespective of the objectives and data modalities used to train these models. Representational similarity is generally measured for individual datasets and is not necessarily consistent across datasets. Thus, one may wonder whether this convergence of model representations is confounded by the datasets commonly used in machine learning. Here, we propose a systematic way to measure how representational similarity between models varies with the set of stimuli used to construct the representations. We find that the objective function is the most crucial factor in determining the consistency of representational similarities across datasets. Specifically, self-supervised vision models learn representations whose relative pairwise similarities generalize better from one dataset to another compared to those of image classification or image-text models. Moreover, the correspondence between representational similarities and the models' task behavior is dataset-dependent, being most strongly pronounced for single-domain datasets. Our work provides a framework for systematically measuring similarities of model representations across datasets and linking those similarities to differences in task behavior.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to explore the transferability and consistency of model representation similarity across different datasets. Specifically, the author focuses on: 1. **Can model representation similarity remain consistent across different datasets?** - The author proposes a systematic method to measure the representation similarity of different models on different datasets and analyze whether these similarities can remain consistent across different datasets. 2. **What factors determine the consistency of representation similarity?** - The author experimentally verifies the influence of different training objectives, model architectures, training data, and model sizes on the consistency of representation similarity. The study finds that **the training objective is the most critical factor**, especially the representation similarity of self - supervised learning (SSL) models shows better consistency across different datasets. ### Research Background and Motivation In recent years, deep - learning models have made remarkable progress in various computer vision tasks. Many studies have shown that as the performance of models improves, the representation spaces of different models tend to converge, that is, their internal representations become more and more similar. However, it is not clear whether this similarity can remain consistent across different datasets. Therefore, the author hopes to systematically measure and analyze the representation similarity of different models on multiple datasets to reveal the key factors affecting this consistency. ### Main Contributions - **Proposing a method to measure the consistency of representation similarity**: By calculating the relative similarity of model representations on different datasets, evaluate whether these similarities can remain consistent across different datasets. - **Discovering that the training objective is the most critical factor in determining the consistency of representation similarity**: Compared with model architectures, sizes, and training data, the training objective (especially self - supervised learning) has a more significant impact on the consistency of representation similarity. - **Revealing the differences between different model categories**: The representation similarity of self - supervised learning models is more consistent on different datasets, while the performance of supervised learning and image - text models is more unstable. ### Experimental Design The author uses multiple visual models (including self - supervised learning, supervised learning, and image - text models) and conducts experiments on multiple datasets. By calculating the representation similarity of different models on different datasets (using measurement methods such as CKA) and analyzing the correlation of these similarities across different datasets, the author draws the above conclusions. ### Conclusion This study provides a new perspective for understanding the convergence of model representation spaces and the consistency across datasets, emphasizing the important role of training objectives in this process. This is of great significance for the future development of more general and robust visual models.