Abstract:In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at <a class="link-external link-https" href="https://github.com/PKU-ML/ClusterMIM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper mainly explores the role of discrete tokenization in the field of self - supervised learning (SSL), especially in the masked image modeling (MIM) task. Specifically, the paper aims to answer the following questions: 1. **What is the role of discrete tokenization in MIM?** 2. **How does it affect the generalization ability of downstream tasks?** #### Background and motivation In recent years, self - supervised learning (SSL), as a method for learning meaningful data representations without labels, has received extensive attention. Among them, contrastive learning and masked image modeling (MIM) are two main SSL methods. MIM trains the model by occluding part of the input image and attempting to reconstruct these regions with the unoccluded parts. Some MIM methods use discrete tokens as reconstruction targets, but the theoretical basis for this choice has not been fully explored. #### Research questions The paper points out that different tokenization schemes may lead to significantly different performance. For example, Table 1 shows the linear probing and fine - tuning accuracies of several MIM methods using different tokenizers on ImageNet - 100. These observations raise the following questions: - What is the specific role of discrete tokens in MIM? - How does it affect the generalization performance of downstream tasks? To answer these questions, the paper analyzes the influence of different discrete tokenization schemes on downstream generalization from the perspective of graph theory and proposes a new metric - **Token - Class Alignment Similarity (TCAS)** to evaluate the quality of tokenizers. In addition, based on this metric, the paper designs a new tokenizer and its corresponding MIM method - **ClusterMIM**, and verifies its effectiveness on multiple benchmark datasets. ### Main contributions 1. **For the first time, identify and theorize the role of discrete tokens in MIM**, emphasizing how it improves generalization ability by enhancing the alignment between unoccluded views. 2. **Deeply explore the influence of discrete tokenization on downstream generalization**, especially its impact on the intra - class and inter - class dynamics in the augmented graph. 3. **Propose a new metric TCAS**, which can directly compare the quality of different tokenizers without pre - training. 4. **Propose a simple and effective MIM method ClusterMIM**, which demonstrates its effectiveness by improving performance. Through these contributions, the paper not only deepens the understanding of the role of discrete tokens in MIM, but also provides a practical tool and method to guide future research and applications.

On the Role of Discrete Tokenization in Visual Representation Learning

Morphing Tokens Draw Strong Masked Image Models

Masked Image Modeling with Denoising Contrast

Learning with Unmasked Tokens Drives Stronger Vision Learners

Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

PR-MIM: Delving Deeper into Partial Reconstruction in Masked Image Modeling

SimMIM: A Simple Framework for Masked Image Modeling

Symmetric masking strategy enhances the performance of Masked Image Modeling

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Improve Supervised Representation Learning with Masked Image Modeling

Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

Kernel Masked Image Modeling Through the Lens of Theoretical Understanding

Mc-Beit: Multi-choice Discretization for Image BERT Pre-training

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

Emerging Property of Masked Token for Effective Pre-training

Remote Sensing Scene Classification with Masked Image Modeling (MIM)

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Attention-Guided Contrastive Masked Image Modeling for Transformer-Based Self-Supervised Learning

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning