Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers

Chaitanya Devaguptapu,Sumukh Aithal,Shrinivas Ramasubramanian,Moyuru Yamada,Manohar Kaul
2024-06-18
Abstract:Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL architectures do not fully exploit the ViT backbone, particularly the patch tokens of the ViT. In this paper, we introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively. We reconceptualize images as graphs, with image patches as nodes and infuse relational inductive biases by explicit message passing using Graph Neural Networks into the SSL framework. Our SGC loss acts as a regularizer, leveraging the underexploited patch tokens of ViTs to construct a graph and enforcing consistency between graph features across multiple views of an image. Extensive experiments on various datasets including ImageNet, RESISC and Food-101 show that our approach significantly improves the quality of learned representations, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation. These experiments coupled with a comprehensive set of ablations demonstrate the promise of our approach in various settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that the existing self - supervised learning (SSL) methods based on Vision Transformers (ViTs) fail to fully utilize patch tokens in the ViT architecture. Specifically: 1. **Limitations of existing ViT - based SSL methods**: - Although ViTs perform excellently in representation learning, the existing SSL architectures do not fully tap the potential of ViTs, especially ignoring the patch tokens in images. - Most ViT - based SSL methods mainly rely on the global class token and overlook the patch tokens containing local and fine - grained information. 2. **Introduction of the Semantic Graph Consistency (SGC) module**: - The paper proposes a new Semantic Graph Consistency (SGC) module to regularize ViT - based SSL methods and effectively utilize patch tokens. - The SGC module reconstructs the image into a graph structure, with image patches as nodes, and performs explicit message passing through graph neural networks (GNNs) to introduce relational inductive bias. 3. **Improvement of the quality of representation learning**: - The SGC loss function, as a regularization term, enhances the quality of representation learning by enforcing the consistency of graph features among multiple views. - Experimental results show that on various datasets (such as ImageNet, RESISC, and Food - 101), this method significantly improves the performance of linear evaluation, especially in the case of limited labeled data, with a performance improvement of 5 - 10%. 4. **Combination of theory and practice**: - The paper not only proposes a theoretical framework but also verifies its effectiveness through extensive experiments, including ablation experiments, demonstrating the potential of this method in different settings. In summary, this paper attempts to overcome the problem of insufficient utilization of patch tokens in existing ViT - based SSL methods by introducing the Semantic Graph Consistency module, thereby improving the representation quality of self - supervised learning, especially the performance on small - scale datasets.