Transformer with convolution and graph-node co-embedding: An accurate and interpretable vision backbone for predicting gene expressions from local histopathological image

Xiao Xiao,Yan Kong,Ronghan Li,Zuoheng Wang,Hui Lu
DOI: https://doi.org/10.1016/j.media.2023.103040
Abstract:Inferring gene expressions from histopathological images has long been a fascinating yet challenging task, primarily due to the substantial disparities between the two modality. Existing strategies using local or global features of histological images are suffering model complexity, GPU consumption, low interpretability, insufficient encoding of local features, and over-smooth prediction of gene expressions among neighboring sites. In this paper, we develop TCGN (Transformer with Convolution and Graph-Node co-embedding method) for gene expression estimation from H&E-stained pathological slide images. TCGN comprises a combination of convolutional layers, transformer encoders, and graph neural networks, and is the first to integrate these blocks in a general and interpretable computer vision backbone. Notably, TCGN uniquely operates with just a single spot image as input for histopathological image analysis, simplifying the process while maintaining interpretability. We validate TCGN on three publicly available spatial transcriptomic datasets. TCGN consistently exhibited the best performance (with median PCC 0.232). TCGN offers superior accuracy while keeping parameters to a minimum (just 86.241 million), and it consumes minimal memory, allowing it to run smoothly even on personal computers. Moreover, TCGN can be extended to handle bulk RNA-seq data while providing the interpretability. Enhancing the accuracy of omics information prediction from pathological images not only establishes a connection between genotype and phenotype, enabling the prediction of costly-to-measure biomarkers from affordable histopathological images, but also lays the groundwork for future multi-modal data modeling. Our results confirm that TCGN is a powerful tool for inferring gene expressions from histopathological images in precision health applications.
What problem does this paper attempt to address?