On the Role of Discrete Tokenization in Visual Representation Learning

Tianqi Du,Yifei Wang,Yisen Wang
2024-07-12
Abstract:In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at <a class="link-external link-https" href="https://github.com/PKU-ML/ClusterMIM" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper mainly explores the role of discrete tokenization in the field of self - supervised learning (SSL), especially in the masked image modeling (MIM) task. Specifically, the paper aims to answer the following questions: 1. **What is the role of discrete tokenization in MIM?** 2. **How does it affect the generalization ability of downstream tasks?** #### Background and motivation In recent years, self - supervised learning (SSL), as a method for learning meaningful data representations without labels, has received extensive attention. Among them, contrastive learning and masked image modeling (MIM) are two main SSL methods. MIM trains the model by occluding part of the input image and attempting to reconstruct these regions with the unoccluded parts. Some MIM methods use discrete tokens as reconstruction targets, but the theoretical basis for this choice has not been fully explored. #### Research questions The paper points out that different tokenization schemes may lead to significantly different performance. For example, Table 1 shows the linear probing and fine - tuning accuracies of several MIM methods using different tokenizers on ImageNet - 100. These observations raise the following questions: - What is the specific role of discrete tokens in MIM? - How does it affect the generalization performance of downstream tasks? To answer these questions, the paper analyzes the influence of different discrete tokenization schemes on downstream generalization from the perspective of graph theory and proposes a new metric - **Token - Class Alignment Similarity (TCAS)** to evaluate the quality of tokenizers. In addition, based on this metric, the paper designs a new tokenizer and its corresponding MIM method - **ClusterMIM**, and verifies its effectiveness on multiple benchmark datasets. ### Main contributions 1. **For the first time, identify and theorize the role of discrete tokens in MIM**, emphasizing how it improves generalization ability by enhancing the alignment between unoccluded views. 2. **Deeply explore the influence of discrete tokenization on downstream generalization**, especially its impact on the intra - class and inter - class dynamics in the augmented graph. 3. **Propose a new metric TCAS**, which can directly compare the quality of different tokenizers without pre - training. 4. **Propose a simple and effective MIM method ClusterMIM**, which demonstrates its effectiveness by improving performance. Through these contributions, the paper not only deepens the understanding of the role of discrete tokens in MIM, but also provides a practical tool and method to guide future research and applications.