HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis

Guillaume Jaume,Paul Doucet,Andrew H. Song,Ming Y. Lu,Cristina Almagro-Pérez,Sophia J. Wagner,Anurag J. Vaidya,Richard J. Chen,Drew F.K. Williamson,Ahrong Kim,Faisal Mahmood
2024-06-24
Abstract:Spatial transcriptomics (ST) enables interrogating the molecular composition of tissue with ever-increasing resolution, depth, and sensitivity. However, costs, rapidly evolving technology, and lack of standards have constrained computational methods in ST to narrow tasks and small cohorts. In addition, the underlying tissue morphology as reflected by H&E-stained whole slide images (WSIs) encodes rich information often overlooked in ST studies. Here, we introduce HEST-1k, a collection of 1,108 spatial transcriptomic profiles, each linked to a WSI and metadata. HEST-1k was assembled using HEST-Library from 131 public and internal cohorts encompassing 25 organs, two species (Homo Sapiens and Mus Musculus), and 320 cancer samples from 25 cancer types. HEST-1k processing enabled the identification of 1.5 million expression--morphology pairs and 60 million nuclei. HEST-1k is tested on three use cases: (1) benchmarking foundation models for histopathology (HEST-Benchmark), (2) biomarker identification, and (3) multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark can be freely accessed via <a class="link-external link-https" href="https://github.com/mahmoodlab/hest" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the following major issues: 1. **Lack of Dataset and Standardization**: Existing spatial transcriptomics (ST) datasets are limited to narrow tasks and small-scale cohorts due to high costs, rapid technological iteration, and lack of standards. Additionally, tissue morphology information is often overlooked. 2. **Multimodal Data Analysis**: By integrating spatial transcriptomics data with Hematoxylin and Eosin (H&E) whole slide images (WSI), it is possible to better analyze the relationship between gene expression and morphology, thereby discovering new morphological biomarkers. 3. **Foundation Model Benchmarking**: To evaluate the performance of different foundation models on pathology images, a diversified benchmark dataset is needed. Existing tasks such as Gleason scoring have reached saturation and cannot effectively distinguish the performance of new models. To address these issues, the authors introduce the HEST-1k dataset, which contains 1,108 paired spatial transcriptomics and H&E stained whole slide images, along with detailed metadata. The HEST-1k dataset can be used for the following three application scenarios: - Foundation model benchmarking (HEST-Benchmark); - Biomarker identification; - Multimodal representation learning. Through these applications, HEST-1k aims to advance the development of spatial transcriptomics and pathology image analysis.