STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics

Jiawen Chen,Muqing Zhou,Wenrong Wu,Jinwei Zhang,Yun Li,Didong Li
2024-06-21
Abstract:Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
Computer Vision and Pattern Recognition,Computation and Language,Genomics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that in the existing medical image - text datasets, the text descriptions are too brief to fully describe the detailed information of each small area in the pathological images. For example, an image may cover a large - area tissue containing cancerous and healthy areas, but the accompanying text may only indicate that this is a cancer slice, lacking the details required for in - depth analysis. To solve this problem, the author introduced the STimage - 1K4M dataset, which is a new spatial transcriptomics image - gene expression dataset, aiming to fill this gap by providing the gene expression characteristics of each small area. ### Specific problems 1. **Limitations of existing datasets**: - In the existing medical image - text datasets, the text usually provides high - level summaries and cannot describe the sub - areas in the images in detail. - For example, a pathological image may contain cancerous and healthy areas, but the text may simply be labeled as "cancer slice", lacking detailed descriptions of specific areas. 2. **Improving the resolution of image - gene expression data**: - Existing datasets have limitations in combining pathological images with gene expression data and cannot provide high - resolution gene expression information. - This restricts the application and development of multimodal models in computational pathology. ### Solutions 1. **STimage - 1K4M dataset**: - The STimage - 1K4M dataset contains 1,149 images, which are from spatial transcriptomics data and can capture the gene expression information of a single spatial point. - Each image is decomposed into smaller sub - image blocks, and each sub - image block is paired with 15,000 - 30,000 - dimensional gene expression data. - The dataset contains a total of 4,293,195 pairs of sub - images and gene expression data, providing unprecedented fine - grained information. 2. **Characteristics of the dataset**: - **Diversity**: The dataset covers 10 different species and 50 different tissue types. - **Detailed annotations**: In addition to the images and gene expression data, it also provides the annotations of pathologists, which helps to reduce the effort required to collect labeled data with "true labels". - **Technical sources**: The images and gene expression data in the dataset are from three leading spatial transcriptomics techniques: Spatial Transcriptomics, Visium, and VisiumHD. ### Goals - **Promote multimodal data analysis**: By providing detailed gene expression data, the STimage - 1K4M dataset can support more advanced multimodal data analysis and promote the development of computational pathology and personalized medicine. - **Improve research efficiency**: The richness and diversity of the dataset can significantly simplify the data collection process, enabling researchers to focus on developing innovative methods and in - depth understanding of tissue structures and gene expression patterns. ### Conclusion This paper aims to solve the problem of overly brief text descriptions in the existing medical image - text datasets by introducing the STimage - 1K4M dataset, and promotes multimodal data analysis and the development of computational pathology by providing high - resolution gene expression data.