RoFormer for Position Aware Multiple Instance Learning in Whole Slide Image Classification

Etienne Pochet,Rami Maroun,Roger Trullo
2023-10-03
Abstract:Whole slide image (WSI) classification is a critical task in computational pathology. However, the gigapixel-size of such images remains a major challenge for the current state of deep-learning. Current methods rely on multiple-instance learning (MIL) models with frozen feature extractors. Given the the high number of instances in each image, MIL methods have long assumed independence and permutation-invariance of patches, disregarding the tissue structure and correlation between patches. Recent works started studying this correlation between instances but the computational workload of such a high number of tokens remained a limiting factor. In particular, relative position of patches remains unaddressed. We propose to apply a straightforward encoding module, namely a RoFormer layer , relying on memory-efficient exact self-attention and relative positional encoding. This module can perform full self-attention with relative position encoding on patches of large and arbitrary shaped WSIs, solving the need for correlation between instances and spatial modeling of tissues. We demonstrate that our method outperforms state-of-the-art MIL models on three commonly used public datasets (TCGA-NSCLC, BRACS and Camelyon16)) on weakly supervised classification tasks. Code is available at <a class="link-external link-https" href="https://github.com/Sanofi-Public/DDS-RoFormerMIL" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively model the correlation between tissue structures and instances in Whole Slide Image (WSI) classification. Specifically, existing methods usually assume that each patch is independent and ignore the spatial information of tissue structures, which limits the model's ability to understand complex tissue structures. In addition, since the size of WSI usually reaches the gigapixel level, processing these large - scale images is also a major challenge in terms of computational cost. To solve these problems, the authors propose an encoding module based on the RoFormer layer. This module utilizes a memory - efficient exact self - attention mechanism and relative position encoding to process large - scale and arbitrarily - shaped WSI images. Through this method, the correlation between instances and the spatial structure of tissues can be effectively modeled, thereby improving the performance of WSI classification. ### Main Contributions 1. **Introduction of Relative Position Encoding**: By using the RoFormer layer, the paper proposes an encoding module that can handle large - scale WSI images and can perform full self - attention operations with relative position encoding without any approximation. 2. **Improvement of Classification Performance**: Experimental results show that this method outperforms existing MIL models in weakly - supervised classification tasks on three commonly - used public datasets (TCGA - NSCLC, BRACS, and Camelyon16). 3. **Memory - Efficiency**: This method can run on consumer - grade GPUs, and can even be implemented on 8GB GPUs, demonstrating its feasibility in practical applications. ### Method Overview - **WSI Segmentation**: Divide the WSI into small image patches and retain the coordinate information of each patch. - **RoFormer Layer**: Use the RoFormer layer for encoding. This layer combines Rotary Position Encoding (RoPE) and can handle large - scale and irregular - shaped WSI images. - **Self - Attention Mechanism**: Through the full self - attention mechanism, each image patch can update its feature vector, thereby modeling the interaction between instances and tissue regions. - **Downstream Model**: The encoded output is sent to the MIL classification head for the final classification task. ### Experimental Results - **Ablation Study**: The ablation study conducted on the TCGA - NSCLC dataset shows that adding the RoFormer layer significantly improves the classification performance, especially in terms of AUROC and average precision. - **Multi - Dataset Validation**: The experimental results on the three datasets of TCGA - NSCLC, Camelyon16, and BRACS further verify the effectiveness of this method, especially on the Camelyon16 and BRACS datasets. In conclusion, by introducing relative position encoding and the full self - attention mechanism, this paper effectively solves the limitations of existing MIL models in processing large - scale WSI images and improves the classification performance.