Abstract:Masked Image Modelling (MIM), a form of self-supervised learning, has garnered significant success in computer vision by improving image representations using unannotated data. Traditional MIMs typically employ a strategy of random sampling across the image. However, this random masking technique may not be ideally suited for medical imaging, which possesses distinct characteristics divergent from natural images. In medical imaging, particularly in pathology, disease-related features are often exceedingly sparse and localized, while the remaining regions appear normal and undifferentiated. Additionally, medical images frequently accompany reports, directly pinpointing pathological changes' location. Inspired by this, we propose Masked medical Image Modelling (MedIM), a novel approach, to our knowledge, the first research that employs radiological reports to guide the masking and restore the informative areas of images, encouraging the network to explore the stronger semantic representations from medical images. We introduce two mutual comprehensive masking strategies, knowledge-driven masking (KDM), and sentence-driven masking (SDM). KDM uses Medical Subject Headings (MeSH) words unique to radiology reports to identify symptom clues mapped to MeSH words (e.g., cardiac, edema, vascular, pulmonary) and guide the mask generation. Recognizing that radiological reports often comprise several sentences detailing varied findings, SDM integrates sentence-level information to identify key regions for masking. MedIM reconstructs images informed by this masking from the KDM and SDM modules, promoting a comprehensive and enriched medical image representation. Our extensive experiments on seven downstream tasks covering multi-label/class image classification, pneumothorax segmentation, and medical image-report analysis, demonstrate that MedIM with report-guided masking achieves competitive performance. Our method substantially outperforms ImageNet pre-training, MIM-based pre-training, and medical image-report pre-training counterparts. Codes are available at https://github.com/YtongXie/MedIM.

Global Patch-wise Attention is Masterful Facilitator for Masked Image Modeling

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

Revealing the Dark Secrets of Masked Image Modeling

Hard Patches Mining for Masked Image Modeling

Learning with Unmasked Tokens Drives Stronger Vision Learners

Stare at What You See: Masked Image Modeling Without Reconstruction

Symmetric masking strategy enhances the performance of Masked Image Modeling

Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining.

SimMIM: A Simple Framework for Masked Image Modeling

Information-density Masking Strategy for Masked Image Modeling

Masked Image Modeling with Local Multi-Scale Reconstruction.

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Gradient-coupled Cross-Patch Attention Map for Weakly Supervised Semantic Segmentation

SemanticMIM: Marring Masked Image Modeling with Semantics Compression for General Visual Representation

BIM: Block-Wise Self-Supervised Learning with Masked Image Modeling

Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation

Rethinking masked image modelling for medical image representation