Abstract:Background Artificial intelligence (AI) has numerous applications in pathology, supporting diagnosis and prognostication in cancer. However, most AI models are trained on highly selected data, typically one tissue slide per patient. In reality, especially for large surgical resection specimens, dozens of slides can be available for each patient. Manually sorting and labelling whole‐slide images (WSIs) is a very time‐consuming process, hindering the direct application of AI on the collected tissue samples from large cohorts. In this study we addressed this issue by developing a deep‐learning (DL)‐based method for automatic curation of large pathology datasets with several slides per patient. Methods We collected multiple large multicentric datasets of colorectal cancer histopathological slides from the United Kingdom (FOXTROT, N = 21,384 slides; CR07, N = 7985 slides) and Germany (DACHS, N = 3606 slides). These datasets contained multiple types of tissue slides, including bowel resection specimens, endoscopic biopsies, lymph node resections, immunohistochemistry‐stained slides, and tissue microarrays. We developed, trained, and tested a deep convolutional neural network model to predict the type of slide from the slide overview (thumbnail) image. The primary statistical endpoint was the macro‐averaged area under the receiver operating curve (AUROCs) for detection of the type of slide. Results In the primary dataset (FOXTROT), with an AUROC of 0.995 [95% confidence interval [CI]: 0.994–0.996] the algorithm achieved a high classification performance and was able to accurately predict the type of slide from the thumbnail image alone. In the two external test cohorts (CR07, DACHS) AUROCs of 0.982 [95% CI: 0.979–0.985] and 0.875 [95% CI: 0.864–0.887] were observed, which indicates the generalizability of the trained model on unseen datasets. With a confidence threshold of 0.95, the model reached an accuracy of 94.6% (7331 classified cases) in CR07 and 85.1% (2752 classified cases) for the DACHS cohort. Conclusion Our findings show that using the low‐resolution thumbnail image is sufficient to accurately classify the type of slide in digital pathology. This can support researchers to make the vast resource of existing pathology archives accessible to modern AI models with only minimal manual annotations.

Scalable deep learning artificial intelligence histopathology slide analysis and validation

Automated curation of large‐scale cancer histopathology image datasets using deep learning

Deep Learning in Digital Pathology Analysis

Deep Learning Models for Digital Pathology

A Generalized Deep Learning Framework for Whole-Slide Image Segmentation and Analysis

Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis

DeepTree: Pathological Image Classification Through Imitating Tree-Like Strategies of Pathologists

Deep learning-based framework for slide-based histopathological image analysis

Artificial intelligence neuropathologist for glioma classification using deep learning on hematoxylin and eosin stained slide images and molecular markers

Artificial Intelligence for Digital and Computational Pathology

Learning generalizable AI models for multi-center histopathology image classification

Clinical-grade computational pathology using weakly supervised deep learning on whole slide images

Augmenting the Pathology Lab: An Intelligent Whole Slide Image Classification System for the Real World

Deep learning in histopathology: the path to the clinic

Preparing Data for Artificial Intelligence in Pathology with Clinical-Grade Performance

A promising deep learning-assistive algorithm for histopathological screening of colorectal cancer

Pan-Cancer Diagnostic Consensus Through Searching Archival Histopathology Images Using Artificial Intelligence

Deep learning in cancer genomics and histopathology

Automated Diagnosis of Lymphoma with Digital Pathology Images Using Deep Learning

DeepGleason: a System for Automated Gleason Grading of Prostate Cancer using Deep Neural Networks