Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

Honglin Li,Yunlong Zhang,Pingyi Chen,Zhongyi Shui,Chenglu Zhu,Lin Yang
2024-10-18
Abstract:Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at <a class="link-external link-http" href="http://github.com/invoker-LL/Long-MIL" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the performance bottlenecks and computational complexity issues of existing Transformer models when processing long sequence data in Whole Slide Image (WSI) analysis. Specifically: 1. **Limitations of Existing Methods**: - **Traditional Attention Mechanisms**: Methods such as AB-MIL, DS-MIL, and CLAM use simple attention mechanisms that, while computationally efficient, cannot model contextual information and interactions between instances in WSI. - **Self-Attention Mechanism**: Although it can model contextual information, its computational complexity is O(n^2) when processing long sequences, leading to high computational costs. - **Approximate Methods**: Methods like Nyströmformer used in TransMIL and region partitioning in HIPT reduce computational burden but do not perform as well as full self-attention mechanisms. 2. **Low-Rank Bottleneck**: - In WSI analysis, since the sequence length n is much greater than the embedding dimension d, the rank of the attention matrix is limited by d, which restricts the representation capability. - Experiments have shown that even after training, the rank of the attention matrix remains low, affecting the model's performance. 3. **Locality and Sparsity**: - The authors found that the attention patterns of lower-layer Transformers exhibit locality and sparsity, which inspired them to design a local attention mask to improve the model. 4. **Objectives**: - **Improve Representation Capability**: By introducing a local attention mask, the rank of the attention matrix is increased, thereby enhancing the model's representation capability. - **Reduce Computational Complexity**: Using a local attention mask reduces the computational complexity from O(n^2) to linear complexity O(bnd), where b is the local bandwidth. - **Enhance Generalization Ability**: The local attention mask helps the model better handle unseen or underfitted positions. In summary, this paper aims to address the performance and computational efficiency issues of existing methods when processing long sequence WSI by designing a local-global hybrid Transformer model, thereby improving the overall performance, memory usage, and speed of the model, and enhancing its generalization ability.