Abstract:Masked image modeling (MIM) has gained significant traction for its remarkable prowess in representation learning. As an alternative to the traditional approach, the reconstruction from corrupted images has recently emerged as a promising pretext task. However, the regular corrupted images are generated using generic generators, often lacking relevance to the specific reconstruction task involved in pre-training. Hence, reconstruction from regular corrupted images cannot ensure the difficulty of the pretext task, potentially leading to a performance decline. Moreover, generating corrupted images might introduce an extra generator, resulting in a notable computational burden. To address these issues, we propose to incorporate adversarial examples into masked image modeling, as the new reconstruction targets. Adversarial examples, generated online using only the trained models, can directly aim to disrupt tasks associated with pre-training. Therefore, the incorporation not only elevates the level of challenge in reconstruction but also enhances efficiency, contributing to the acquisition of superior representations by the model. In particular, we introduce a novel auxiliary pretext task that reconstructs the adversarial examples corresponding to the original images. We also devise an innovative adversarial attack to craft more suitable adversarial examples for MIM pre-training. It is noted that our method is not restricted to specific model architectures and MIM strategies, rendering it an adaptable plug-in capable of enhancing all MIM methods. Experimental findings substantiate the remarkable capability of our approach in amplifying the generalization and robustness of existing MIM methods. Notably, our method surpasses the performance of baselines on various tasks, including ImageNet, its variants, and other downstream tasks.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve **two key problems in the masked image modeling (MIM) pre - training task in self - supervised learning (SSL)**: 1. **The lack of specificity in the conventional methods of generating corrupted images**: - Existing MIM methods usually use general - purpose generators to generate corrupted images, and these generators are not related to specific reconstruction tasks. This leads to the difficulty of the reconstruction task not being effectively guaranteed, and may even introduce additional computational burdens. - Specifically, the generation of conventional corrupted images may require additional generators, thus increasing the computational cost, and the distribution changes of these corrupted images do not significantly improve the performance of the model. 2. **The insufficient generalization ability and robustness of existing MIM methods in downstream tasks**: - Although current MIM methods perform well on some tasks, their generalization ability and robustness still need to be improved when facing more complex downstream tasks. To solve these problems, the authors propose a new framework - **AEMIM (Adversarial Examples Meet Masked Image Modeling)**. This framework improves the challenge of the pre - training task and enhances the generalization ability and robustness of the model by introducing adversarial examples as new reconstruction targets. Specifically, the main innovations of AEMIM include: - **Introducing adversarial examples as reconstruction targets**: Adversarial examples are samples specifically designed to disrupt the performance of the model, and they can provide more challenging reconstruction targets, thus prompting the model to learn better representations. - **Generating adversarial examples online**: Adversarial examples are directly generated by the model during training, without the need for additional generators, improving efficiency. - **Introducing adapters to handle data in different domains**: By introducing adapters to handle clean data and adversarial examples separately, the impact of adversarial examples on the performance of normal visual tasks is prevented. In summary, AEMIM aims to improve the generalization ability and robustness of pre - training models in self - supervised learning while maintaining an efficient training process by combining adversarial examples and masked image modeling.

AEMIM: Adversarial Examples Meet Masked Image Modeling

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Membership Inference Attack Against Masked Image Modeling

Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial Defense

MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness

Stare at What You See: Masked Image Modeling Without Reconstruction

Masked Image Modeling: A Survey

Symmetric masking strategy enhances the performance of Masked Image Modeling

A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

Contrastive Masked Autoencoders are Stronger Vision Learners

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

DPPMask: Masked Image Modeling with Determinantal Point Processes

Adversarial Masked Image Inpainting for Robust Detection of Mpox and Non-Mpox

PR-MIM: Delving Deeper into Partial Reconstruction in Masked Image Modeling

Efficient Masked Autoencoders with Self-Consistency

Rethinking masked image modelling for medical image representation

Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining.

Adversarial Purification of Information Masking

Masked Image Modeling Advances 3D Medical Image Analysis