Abstract:Whole slide image (WSI) classification is a critical task in computational pathology. However, the gigapixel-size of such images remains a major challenge for the current state of deep-learning. Current methods rely on multiple-instance learning (MIL) models with frozen feature extractors. Given the the high number of instances in each image, MIL methods have long assumed independence and permutation-invariance of patches, disregarding the tissue structure and correlation between patches. Recent works started studying this correlation between instances but the computational workload of such a high number of tokens remained a limiting factor. In particular, relative position of patches remains unaddressed. We propose to apply a straightforward encoding module, namely a RoFormer layer , relying on memory-efficient exact self-attention and relative positional encoding. This module can perform full self-attention with relative position encoding on patches of large and arbitrary shaped WSIs, solving the need for correlation between instances and spatial modeling of tissues. We demonstrate that our method outperforms state-of-the-art MIL models on three commonly used public datasets (TCGA-NSCLC, BRACS and Camelyon16)) on weakly supervised classification tasks. Code is available at <a class="link-external link-https" href="https://github.com/Sanofi-Public/DDS-RoFormerMIL" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively model the correlation between tissue structures and instances in Whole Slide Image (WSI) classification. Specifically, existing methods usually assume that each patch is independent and ignore the spatial information of tissue structures, which limits the model's ability to understand complex tissue structures. In addition, since the size of WSI usually reaches the gigapixel level, processing these large - scale images is also a major challenge in terms of computational cost. To solve these problems, the authors propose an encoding module based on the RoFormer layer. This module utilizes a memory - efficient exact self - attention mechanism and relative position encoding to process large - scale and arbitrarily - shaped WSI images. Through this method, the correlation between instances and the spatial structure of tissues can be effectively modeled, thereby improving the performance of WSI classification. ### Main Contributions 1. **Introduction of Relative Position Encoding**: By using the RoFormer layer, the paper proposes an encoding module that can handle large - scale WSI images and can perform full self - attention operations with relative position encoding without any approximation. 2. **Improvement of Classification Performance**: Experimental results show that this method outperforms existing MIL models in weakly - supervised classification tasks on three commonly - used public datasets (TCGA - NSCLC, BRACS, and Camelyon16). 3. **Memory - Efficiency**: This method can run on consumer - grade GPUs, and can even be implemented on 8GB GPUs, demonstrating its feasibility in practical applications. ### Method Overview - **WSI Segmentation**: Divide the WSI into small image patches and retain the coordinate information of each patch. - **RoFormer Layer**: Use the RoFormer layer for encoding. This layer combines Rotary Position Encoding (RoPE) and can handle large - scale and irregular - shaped WSI images. - **Self - Attention Mechanism**: Through the full self - attention mechanism, each image patch can update its feature vector, thereby modeling the interaction between instances and tissue regions. - **Downstream Model**: The encoded output is sent to the MIL classification head for the final classification task. ### Experimental Results - **Ablation Study**: The ablation study conducted on the TCGA - NSCLC dataset shows that adding the RoFormer layer significantly improves the classification performance, especially in terms of AUROC and average precision. - **Multi - Dataset Validation**: The experimental results on the three datasets of TCGA - NSCLC, Camelyon16, and BRACS further verify the effectiveness of this method, especially on the Camelyon16 and BRACS datasets. In conclusion, by introducing relative position encoding and the full self - attention mechanism, this paper effectively solves the limitations of existing MIL models in processing large - scale WSI images and improves the classification performance.

RoFormer for Position Aware Multiple Instance Learning in Whole Slide Image Classification

Positional Encoding-Guided Transformer-Based Multiple Instance Learning for Histopathology Whole Slide Images Classification

A universal multiple instance learning framework for whole slide image analysis

Multi-Cohort Framework with Cohort-Aware Attention and Adversarial Mutual-Information Minimization for Whole Slide Image Classification

Rethinking Attention-Based Multiple Instance Learning for Whole-Slide Pathological Image Classification: An Instance Attribute Viewpoint

Iterative multiple instance learning for weakly annotated whole slide image classification

RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

Finding Regions of Interest in Whole Slide Images Using Multiple Instance Learning

Dual-Attention Multiple Instance Learning Framework for Pathology Whole-Slide Image Classification

Gigapixel Whole-Slide Images Classification using Locally Supervised Learning

Iteratively Coupled Multiple Instance Learning from Instance to Bag Classifier for Whole Slide Image Classification

Bayesian Collaborative Learning for Whole-Slide Image Classification

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification

FR-MIL: Distribution Re-calibration based Multiple Instance Learning with Transformer for Whole Slide Image Classification

Deep Hierarchical Multiple Instance Learning for Whole Slide Image Classification

The Whole Pathological Slide Classification via Weakly Supervised Learning

Multi-scale Multi-Instance Contrastive Learning for Whole Slide Image Classification

Rethinking Overfitting of Multiple Instance Learning for Whole Slide Image Classification

Cluster-to-Conquer: A Framework for End-to-End Multi-Instance Learning for Whole Slide Image Classification

Rethinking Pre-trained Feature Extractor Selection in Multiple Instance Learning for Whole Slide Image Classification