Abstract:Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL architectures do not fully exploit the ViT backbone, particularly the patch tokens of the ViT. In this paper, we introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively. We reconceptualize images as graphs, with image patches as nodes and infuse relational inductive biases by explicit message passing using Graph Neural Networks into the SSL framework. Our SGC loss acts as a regularizer, leveraging the underexploited patch tokens of ViTs to construct a graph and enforcing consistency between graph features across multiple views of an image. Extensive experiments on various datasets including ImageNet, RESISC and Food-101 show that our approach significantly improves the quality of learned representations, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation. These experiments coupled with a comprehensive set of ablations demonstrate the promise of our approach in various settings.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that the existing self - supervised learning (SSL) methods based on Vision Transformers (ViTs) fail to fully utilize patch tokens in the ViT architecture. Specifically: 1. **Limitations of existing ViT - based SSL methods**: - Although ViTs perform excellently in representation learning, the existing SSL architectures do not fully tap the potential of ViTs, especially ignoring the patch tokens in images. - Most ViT - based SSL methods mainly rely on the global class token and overlook the patch tokens containing local and fine - grained information. 2. **Introduction of the Semantic Graph Consistency (SGC) module**: - The paper proposes a new Semantic Graph Consistency (SGC) module to regularize ViT - based SSL methods and effectively utilize patch tokens. - The SGC module reconstructs the image into a graph structure, with image patches as nodes, and performs explicit message passing through graph neural networks (GNNs) to introduce relational inductive bias. 3. **Improvement of the quality of representation learning**: - The SGC loss function, as a regularization term, enhances the quality of representation learning by enforcing the consistency of graph features among multiple views. - Experimental results show that on various datasets (such as ImageNet, RESISC, and Food - 101), this method significantly improves the performance of linear evaluation, especially in the case of limited labeled data, with a performance improvement of 5 - 10%. 4. **Combination of theory and practice**: - The paper not only proposes a theoretical framework but also verifies its effectiveness through extensive experiments, including ablation experiments, demonstrating the potential of this method in different settings. In summary, this paper attempts to overcome the problem of insufficient utilization of patch tokens in existing ViT - based SSL methods by introducing the Semantic Graph Consistency module, thereby improving the representation quality of self - supervised learning, especially the performance on small - scale datasets.

Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Patch-level Representation Learning for Self-supervised Vision Transformers

Vision Transformers with Natural Language Semantics

Semi-supervised Vision Transformers at Scale

SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers

Making Vision Transformers Efficient from A Token Sparsification View

Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Discrete Representations Strengthen Vision Transformer Robustness

SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Analyzing Local Representations of Self-supervised Vision Transformers

Vision Transformer with Sparse Scan Prior

Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation

SegViT: Semantic Segmentation with Plain Vision Transformers

Improve Vision Transformers Training by Suppressing Over-smoothing

Self-Distilled Vision Transformer for Domain Generalization

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers