Abstract:Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

Masked Unsupervised Self-training for Label-free Image Classification

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Investigating Self-Supervised Methods for Label-Efficient Learning

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Masked Channel Modeling for Bootstrapping Visual Pre-training

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

Enhancing Vision-Language Model with Unmasked Token Alignment

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

Learning with Unmasked Tokens Drives Stronger Vision Learners

Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations

Online Zero-Shot Classification with CLIP

Learning to Teach and Learn for Semi-Supervised Few-Shot Image Classification.

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

Diverse and Tailored Image Generation for Zero-shot Multi-label Classification

Extract Free Dense Labels from CLIP

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

CLIP meets DINO for Tuning Zero-Shot Classifier using Unlabeled Image Collections

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Single-stage Zero-Shot Object Detection Network Based on CLIP and Pseudo-Labeling