UCF-MultiOrgan-Path: A Public Benchmark Dataset of Histopathologic Images for Deep Learning Model Based Organ Classification

Md Sanzid Bin Hossain,Yelena Piazza,Jacob Braun,Anthony Bilic,Michael Hsieh,Samir Fouissi,Alexander Borowsky,Hatem Kaseb,Amoy Fraser,Britney-Ann Wray,Chen Chen,Liqiang Wang,Mujtaba Husain,Dexter Hadley
DOI: https://doi.org/10.1101/2024.11.05.24316736
2024-11-06
Abstract:A pathologist makes a diagnosis using a light microscope on glass slides containing tissue samples. The entire tissue specimen can be stored as a Whole Slide Image (WSI) for further analysis. However, managing and manually diagnosing hundreds of images is time-consuming and requires specific expertise. As a result, there is extensive ongoing research for computer-aided diagnosis of these digitally acquired pathology images. Deep learning has gained significant attention for its effectiveness for disease classification and segmentation of cancer cells in histopathologic images. Building a robust and accurate model for deep learning requires a large number of annotated images. However, it is challenging to find a sufficient number of annotated public images to validate or construct a new pre-trained model based on pathology images due to the labor-intensive and time-consuming nature of annotation, the need for expert knowledge, and privacy concerns surrounding medical data. Current public datasets are often limited to specific organs, types of cancer, or binary classification tasks, which hinders their ability to generalize across diverse pathology applications. This lack of diversity makes it challenging to develop models that can perform well on a wide range of diseases, organs, or multiclass classification problems, limiting their use in broader real-world diagnostic scenarios. To combat this limitation, we are introducing UCF multi-organ histopathologic (UCF-MultiOrgan-Path) dataset where 977 WSIs are available from cadavers containing tissues of multiple organs such as the lung, kidney, liver, pancreas, etc. We constructed the WSI dataset filtering from ∼ 1700 WSIs with 15 distinct organ classes and ∼ 2.38 million patches with a size of 512X512 pixels. For technical validation, we provide two approaches: a patch-based approach for patch and slide-level classification and a slide-based approach using multiple instance learning (MIL) for slide-level classification. Our dataset can be used as a benchmark dataset for training and validating deep learning models, especially organ classification models, which contain a large number of WSIs with millions of extracted patches representative of diverse organ classes.
Pathology
What problem does this paper attempt to address?