External validation of an artificial intelligence model for Gleason grading of prostate cancer on prostatectomy specimens

Bogdana Schmidt,Simon John Christoph Soerensen,Hriday P. Bhambhvani,Richard E. Fan,Indrani Bhattacharya,Moon Hyung Choi,Christian A. Kunder,Chia‐Sui Kao,John Higgins,Mirabela Rusu,Geoffrey A. Sonn
DOI: https://doi.org/10.1111/bju.16464
2024-07-13
BJU International
Abstract:Objectives To externally validate the performance of the DeepDx Prostate artificial intelligence (AI) algorithm (Deep Bio Inc., Seoul, South Korea) for Gleason grading on whole‐mount prostate histopathology, considering potential variations observed when applying AI models trained on biopsy samples to radical prostatectomy (RP) specimens due to inherent differences in tissue representation and sample size. Materials and Methods The commercially available DeepDx Prostate AI algorithm is an automated Gleason grading system that was previously trained using 1133 prostate core biopsy images and validated on 700 biopsy images from two institutions. We assessed the AI algorithm's performance, which outputs Gleason patterns (3, 4, or 5), on 500 1‐mm2 tiles created from 150 whole‐mount RP specimens from a third institution. These patterns were then grouped into grade groups (GGs) for comparison with expert pathologist assessments. The reference standard was the International Society of Urological Pathology GG as established by two experienced uropathologists with a third expert to adjudicate discordant cases. We defined the main metric as the agreement with the reference standard, using Cohen's kappa. Results The agreement between the two experienced pathologists in determining GGs at the tile level had a quadratically weighted Cohen's kappa of 0.94. The agreement between the AI algorithm and the reference standard in differentiating cancerous vs non‐cancerous tissue had an unweighted Cohen's kappa of 0.91. Additionally, the AI algorithm's agreement with the reference standard in classifying tiles into GGs had a quadratically weighted Cohen's kappa of 0.89. In distinguishing cancerous vs non‐cancerous tissue, the AI algorithm achieved a sensitivity of 0.997 and specificity of 0.88; in classifying GG ≥2 vs GG 1 and non‐cancerous tissue, it demonstrated a sensitivity of 0.98 and specificity of 0.85. Conclusion The DeepDx Prostate AI algorithm had excellent agreement with expert uropathologists and performance in cancer identification and grading on RP specimens, despite being trained on biopsy specimens from an entirely different patient population.
urology & nephrology
What problem does this paper attempt to address?