An interpretable machine learning system for colorectal cancer diagnosis from pathology slides

Pedro C. Neto,Diana Montezuma,Sara P. Oliveira,Domingos Oliveira,João Fraga,Ana Monteiro,João Monteiro,Liliana Ribeiro,Sofia Gonçalves,Stefan Reinhard,Inti Zlobec,Isabel M. Pinto,Jaime S. Cardoso
DOI: https://doi.org/10.1038/s41698-024-00539-4
2024-03-05
npj Precision Oncology
Abstract:Abstract Considering the profound transformation affecting pathology practice, we aimed to develop a scalable artificial intelligence (AI) system to diagnose colorectal cancer from whole-slide images (WSI). For this, we propose a deep learning (DL) system that learns from weak labels, a sampling strategy that reduces the number of training samples by a factor of six without compromising performance, an approach to leverage a small subset of fully annotated samples, and a prototype with explainable predictions, active learning features and parallelisation. Noting some problems in the literature, this study is conducted with one of the largest WSI colorectal samples dataset with approximately 10,500 WSIs. Of these samples, 900 are testing samples. Furthermore, the robustness of the proposed method is assessed with two additional external datasets (TCGA and PAIP) and a dataset of samples collected directly from the proposed prototype. Our proposed method predicts, for the patch-based tiles, a class based on the severity of the dysplasia and uses that information to classify the whole slide. It is trained with an interpretable mixed-supervision scheme to leverage the domain knowledge introduced by pathologists through spatial annotations. The mixed-supervision scheme allowed for an intelligent sampling strategy effectively evaluated in several different scenarios without compromising the performance. On the internal dataset, the method shows an accuracy of 93.44% and a sensitivity between positive (low-grade and high-grade dysplasia) and non-neoplastic samples of 0.996. On the external test samples varied with TCGA being the most challenging dataset with an overall accuracy of 84.91% and a sensitivity of 0.996.
oncology
What problem does this paper attempt to address?
The paper aims to address several key issues in the pathological diagnosis of colorectal cancer (CRC): 1. **Data Volume and Image Resolution Issues**: Due to the enormous data volume and extremely high resolution of whole slide images (WSI), traditional deep learning methods face bottlenecks when processing such images. The paper proposes an efficient sampling strategy that reduces the number of training samples without sacrificing classification performance, thereby accelerating model training. 2. **Weak Labeling and Interpretability Issues**: By incorporating the expertise of pathologists, the model can generate pseudo-labels and use these pseudo-labels for training. Additionally, the method employs a hybrid supervision scheme, which not only ensures high accuracy but also provides a certain level of interpretability, making it easier for clinicians to understand and accept. 3. **Application of Large-Scale Datasets**: The researchers used a large-scale dataset containing approximately 10,500 high-quality WSIs, which is one of the largest available colorectal cancer sample datasets. By validating the model's performance on multiple external datasets, they ensure its generalization ability and robustness. 4. **Feasibility of Clinical Application**: To truly apply the computer-aided diagnosis system in clinical practice, the research team developed a prototype system that provides visualized prediction results and allows pathologists to give feedback, further optimizing model performance. In summary, the paper aims to improve the accuracy and efficiency of pathological diagnosis of colorectal cancer through innovative machine learning techniques and efficient data processing methods, and to promote the application of computer-aided diagnostic tools in clinical practice.