Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Honglin Li,Yunlong Zhang,Pingyi Chen,Zhongyi Shui,Chenglu Zhu,Lin Yang

2024-10-18

Abstract:Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at <a class="link-external link-http" href="http://github.com/invoker-LL/Long-MIL" rel="external noopener nofollow">this http URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the performance bottlenecks and computational complexity issues of existing Transformer models when processing long sequence data in Whole Slide Image (WSI) analysis. Specifically: 1. **Limitations of Existing Methods**: - **Traditional Attention Mechanisms**: Methods such as AB-MIL, DS-MIL, and CLAM use simple attention mechanisms that, while computationally efficient, cannot model contextual information and interactions between instances in WSI. - **Self-Attention Mechanism**: Although it can model contextual information, its computational complexity is O(n^2) when processing long sequences, leading to high computational costs. - **Approximate Methods**: Methods like Nyströmformer used in TransMIL and region partitioning in HIPT reduce computational burden but do not perform as well as full self-attention mechanisms. 2. **Low-Rank Bottleneck**: - In WSI analysis, since the sequence length n is much greater than the embedding dimension d, the rank of the attention matrix is limited by d, which restricts the representation capability. - Experiments have shown that even after training, the rank of the attention matrix remains low, affecting the model's performance. 3. **Locality and Sparsity**: - The authors found that the attention patterns of lower-layer Transformers exhibit locality and sparsity, which inspired them to design a local attention mask to improve the model. 4. **Objectives**: - **Improve Representation Capability**: By introducing a local attention mask, the rank of the attention matrix is increased, thereby enhancing the model's representation capability. - **Reduce Computational Complexity**: Using a local attention mask reduces the computational complexity from O(n^2) to linear complexity O(bnd), where b is the local bandwidth. - **Enhance Generalization Ability**: The local attention mask helps the model better handle unseen or underfitted positions. In summary, this paper aims to address the performance and computational efficiency issues of existing methods when processing long sequence WSI by designing a local-global hybrid Transformer model, thereby improving the overall performance, memory usage, and speed of the model, and enhancing its generalization ability.

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Long-MIL: Scaling Long Contextual Multiple Instance Learning for Histopathology Whole Slide Image Analysis

Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification

RetMIL: Retentive Multiple Instance Learning for Histopathological Whole Slide Image Classification

Integrative Graph-Transformer Framework for Histopathology Whole Slide Image Representation and Classification

Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images

Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics

Attention Multiple Instance Learning with Transformer Aggregation for Breast Cancer Whole Slide Image Classification

Kernel Attention Transformer for Histopathology Whole Slide Image Analysis and Assistant Cancer Diagnosis

MG-Trans: Multi-Scale Graph Transformer with Information Bottleneck for Whole Slide Image Classification.

Masked pre-training of transformers for histology image analysis

Multi-Scale Prototypical Transformer for Whole Slide Image Classification

Sparse and Hierarchical Transformer for Survival Analysis on Whole Slide Images

Neighborhood attention transformer multiple instance learning for whole slide image classification

Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification

Transformer based multiple instance learning for WSI breast cancer classification

Position-Aware Masked Autoencoder for Histopathology WSI Representation Learning

AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images

Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction

Multi-class Cancer Classification of Whole Slide Images Through Transformer and Multiple Instance Learning.

Positional Encoding-Guided Transformer-Based Multiple Instance Learning for Histopathology Whole Slide Images Classification