Abstract:Histopathology Whole Slide Image (WSI) analysis serves as the gold standard for clinical cancer diagnosis in the daily routines of doctors. To develop computer-aided diagnosis model for WSIs, previous methods typically employ Multi-Instance Learning to enable slide-level prediction given only slide-level labels. Among these models, vanilla attention mechanisms without pairwise interactions have traditionally been employed but are unable to model contextual information. More recently, self-attention models have been utilized to address this issue. To alleviate the computational complexity of long sequences in large WSIs, methods like HIPT use region-slicing, and TransMIL employs approximation of full self-attention. Both approaches suffer from suboptimal performance due to the loss of key information. Moreover, their use of absolute positional embedding struggles to effectively handle long contextual dependencies in shape-varying WSIs. In this paper, we first analyze how the low-rank nature of the long-sequence attention matrix constrains the representation ability of WSI modelling. Then, we demonstrate that the rank of attention matrix can be improved by focusing on local interactions via a local attention mask. Our analysis shows that the local mask aligns with the attention patterns in the lower layers of the Transformer. Furthermore, the local attention mask can be implemented during chunked attention calculation, reducing the quadratic computational complexity to linear with a small local bandwidth. Building on this, we propose a local-global hybrid Transformer for both computational acceleration and local-global information interactions modelling. Our method, Long-contextual MIL (LongMIL), is evaluated through extensive experiments on various WSI tasks to validate its superiority. Our code will be available at <a class="link-external link-http" href="http://github.com/invoker-LL/Long-MIL" rel="external noopener nofollow">this http URL</a>.

Sparse and Hierarchical Transformer for Survival Analysis on Whole Slide Images

Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics

Generating Hypergraph-Based High-Order Representations of Whole-Slide Histopathological Images for Survival Prediction

Transformer-Based Multimodal Fusion for Survival Prediction by Integrating Whole Slide Images, Clinical, and Genomic Data

HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image

Transformer-Based Video-Structure Multi-Instance Learning for Whole Slide Image Classification

Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images

Rethinking Transformer for Long Contextual Histopathology Whole Slide Image Analysis

HIGT: Hierarchical Interaction Graph-Transformer for Whole Slide Image Analysis

PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis

Explainable survival analysis with uncertainty using convolution-involved vision transformer

Big-Hypergraph Factorization Neural Network for Survival Prediction From Whole Slide Image

Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification

Multi-Scale Prototypical Transformer for Whole Slide Image Classification

What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis

Evaluating Transformer-based Semantic Segmentation Networks for Pathological Image Segmentation

Multi-scale Efficient Graph-Transformer for Whole Slide Image Classification

Kernel Attention Transformer for Histopathology Whole Slide Image Analysis and Assistant Cancer Diagnosis

Adaptive Transformer Modelling of Density Function for Nonparametric Survival Analysis

Shared-Specific Feature Learning with Bottleneck Fusion Transformer for Multi-Modal Whole Slide Image Analysis.