Abstract:Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The source code is publicly accessible at \url{<a class="link-external link-https" href="https://github.com/ZhichengHuang/CMAE" rel="external noopener nofollow">this https URL</a>}.

Mask Guided Gated Convolution for Amodal Content Completion

When Masked Image Modeling Meets Source-free Unsupervised Domain Adaptation: Dual-Level Masked Network for Semantic Segmentation

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion

Amodal Ground Truth and Completion in the Wild

Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation

Self-Supervised Visual Representations Learning by Contrastive Mask Prediction

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Completing Visual Objects via Bridging Generation and Segmentation

Human De-occlusion: Invisible Perception and Recovery for Humans

Contrastive Masked Autoencoders are Stronger Vision Learners

Amodal Instance Segmentation Via Prior-Guided Expansion.

Hallucinating Visual Instances in Total Absentia – Supplementary Material –

Hyper-Transformer for Amodal Completion

MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness

Amodal Segmentation Based on Visible Region Segmentation and Shape Prior

Masked GAN for Unsupervised Depth and Pose Prediction with Scale Consistency

What to Hide from Your Students: Attention-Guided Masked Image Modeling

Instance-Aware Image Completion

Efficiently Detecting Plausible Locations for Object Placement Using Masked Convolutions