Abstract:Integrating whole-slide images (WSIs) and bulk transcriptomics for predicting patient survival can improve our understanding of patient prognosis. However, this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor, while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context, our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way?, and (2) how can we capture dense multimodal interactions between these two modalities? Specifically, we propose to learn biological pathway tokens from transcriptomics that can encode specific cellular functions. Together with histology patch tokens that encode the different morphological patterns in the WSI, we argue that they form appropriate reasoning units for downstream interpretability analyses. We propose fusing both modalities using a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens. Our proposed model, SURVPATH, achieves state-of-the-art performance when evaluated against both unimodal and multimodal baselines on five datasets from The Cancer Genome Atlas. Our interpretability framework identifies key multimodal prognostic factors, and, as such, can provide valuable insights into the interaction between genotype and phenotype, enabling a deeper understanding of the underlying biological mechanisms at play. We make our code public at: <a class="link-external link-https" href="https://github.com/ajv012/SurvPath" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two key challenges in the multimodal task of predicting patient survival time using whole - slide images (WSIs) and bulk transcriptomics data: 1. **How to tokenize transcriptomics data in a semantically meaningful and interpretable way?** - Transcriptomics data is already naturally represented as feature vectors, but directly splicing it with data of other modalities will limit multimodal learning to late - fusion operations. The paper proposes a tokenization method based on biological pathways, which groups genes according to known biological pathways to generate biological pathway tokens (Pathway Tokens) with specific cellular functions. This method not only provides a more fine - grained representation but also enhances the interpretability of the model. 2. **How to capture the dense multimodal interactions between these two modalities?** - Early - fusion methods can capture pairwise similarities between all tokens through Transformer models, but due to the high - dimensionality of WSIs and the complexity of transcriptomics data, such models face huge challenges in computation and memory. The paper introduces a new unified and memory - efficient attention mechanism, which effectively models the interactions between patch tokens and pathway tokens by designing shared parameters for queries, keys, and values and simplifying the attention layer to ignore the interactions between patch tokens. ### Model overview The model proposed in the paper is called **SURVPATH**, and its main contributions include: 1. **Transcriptomics tokenizer**: Generate biological pathway tokens using existing cell biology knowledge. 2. **SURVPATH model**: A memory - efficient and resolution - independent multimodal Transformer model for integrating transcriptomics and patch tokens to predict patient survival. 3. **Multi - level interpretability framework**: Enable users to understand prediction results from unimodal and cross - modal perspectives. 4. **Experimental verification**: A series of experiments and ablation studies were carried out using five datasets from The Cancer Genome Atlas (TCGA), demonstrating the predictive ability of SURVPATH and benchmarking it against unimodal and multimodal fusion methods. ### Method overview 1. **Pathway tokenizer**: - **Composing pathways**: Select appropriate inference units, such as biological pathways, which are composed of a set of genes or sub - pathways involved in specific biological processes. - **Encoding pathways**: Given a set of transcriptomics measurements \( g\in\mathbb{R}^{N_G} \) containing \( N_G \) genes, construct pathway - level tokens \( X(P)\in\mathbb{R}^{N_P\times d} \), where \( d \) represents the dimension of the tokens. Learn the weights \( \phi_i \) of each pathway through a multi - layer perceptron (MLP), that is, \( x(P)_i=\phi_i(g_{P_i}) \), where \( g_{P_i} \) is the set of genes in pathway \( P_i \). 2. **Histological patch tokenizer**: - Given an input WSI, extract low - dimensional patch embeddings to define patch tokens. First, identify tissue regions, and then decompose them into non - overlapping patches. Each patch is mapped to a patch embedding \( x(H)_i = f(h_i) \) by a pre - trained feature extractor \( f(\cdot) \). Finally, transform the patch embeddings into patch tokens \( X(H)\in\mathbb{R}^{N_H\times d} \) that match the token dimension \( d \) through a learnable linear transformation. 3. **Multimodal fusion**: - Design an early - fusion mechanism to capture the dense multimodal interactions between pathway tokens and patch tokens through the Transformer attention mechanism. Specifically, splice the pathway and patch tokens into a sequence \( X\in\mathbb{R}^{(N_P + N_H)\times d} \) of \((N_H + N_P) \) tokens, and extract queries through three linear projections.

Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction

Generating Hypergraph-Based High-Order Representations of Whole-Slide Histopathological Images for Survival Prediction

Multimodal Prototyping for cancer survival prediction

Pathology-and-genomics Multimodal Transformer for Survival Outcome Prediction

Multimodal Survival Ensemble Network: Integrating Genomic and Histopathological Insights for Enhanced Cancer Prognosis.

Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images

Hierarchical Transformer for Survival Prediction Using Multimodality Whole Slide Images and Genomics

Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images

TransSurv: Transformer-Based Survival Analysis Model Integrating Histopathological Images and Genomic Data for Colorectal Cancer.

Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis

Transformer-Based Multimodal Fusion for Survival Prediction by Integrating Whole Slide Images, Clinical, and Genomic Data

Surformer: an interpretable pattern-perceptive survival transformer for cancer survival prediction from histopathology whole slide images

Spatial transcriptomics inferred from pathology whole-slide images links tumor heterogeneity to survival in breast and lung cancer

MGCT: Mutual-Guided Cross-Modality Transformer for Survival Outcome Prediction using Integrative Histopathology-Genomic Features

Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction

Pathology-genomic fusion via biologically informed cross-modality graph learning for survival analysis

Survival Prediction Via Hierarchical Multimodal Co-Attention Transformer: A Computational Histology-Radiology Solution.

HVTSurv: Hierarchical Vision Transformer for Patient-Level Survival Prediction from Whole Slide Image

Deep Biological Pathway Informed Pathology-Genomic Multimodal Survival Prediction

Pathformer: a Biological Pathway Informed Transformer for Disease Diagnosis and Prognosis Using Multi-Omics Data.

SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction