AUGMENT: a framework for robust assessment of the clinical utility of segmentation algorithms
Cathal McCague,Thomas Buddenkotte,Lorena Escudero Sanchez,David Hulse,Roxana Pintican,Leonardo Rundo,Susan Freeman,Stephanie Nougaret,Stefania Rizzo,Will Loughborough,Adrian Andreou,Caron Parsons,Pubudu Piyatissa,Tony Aloysius,Carina Mouritsen Luxhoj,Iqbal Aniq,Sujil James,Balraj Dhesi,Katja De Paepe,James Tanner,Osama Abulaban,Janice Lee,Veronika Majcher,Maeve O Sullivan,Veronica Celli,Anna Colarieti,Alex Samoshkin,Evis Carcani,Syafiq Ramlee,Mohammad S. Al Sad,Simon J. Doran,Woonchan Cho,James DArcy,James D. Brenton,Dominique Laurent Couturier,Ozan Oktem,Ramona Woitek,Carola Bibiane Schoenlieb,Evis Sala,Mireia Crispin Ortuzar
DOI: https://doi.org/10.1101/2024.09.20.24313970
2024-09-23
Abstract:Background: Evaluating AI-based segmentation models primarily relies on quantitative metrics, but it remains unclear if this approach leads to practical, clinically-applicable tools.
Purpose: To create a systematic framework for evaluating the performance of segmentation models using clinically relevant criteria.
Materials and Methods: We developed the AUGMENT framework (Assessing Utility of seGMENtation Tools), based on a structured classification of main categories of error in segmentation tasks. To evaluate the framework we assembled a team of 20 clinicians covering a broad range of radiological expertise, and analysed the challenging task of segmenting metastatic ovarian cancer using AI. We used three evaluation methods: (i) Dice Similarity Coefficient (DSC), (ii) visual Turing test, assessing 429 segmented disease-sites on 80 CT scans from the Cancer Imaging Atlas), and (iii) AUGMENT framework, where 3 radiologists and the AI-model created segmentations of 784 separate disease sites on 27 CT scans from a multi-institution dataset.
Results: The AI model had modest technical performance (DSC=72+/-19 for the pelvic and ovarian disease, and 64+/-24 for omental disease), and it failed the visual Turing test. However, the AUGMENT framework revealed that (i) the AI model produced segmentations of the same quality as radiologists (p=.46), and (ii) it enabled radiologists to produce human+AI collaborative segmentations of significantly higher quality (p=<.001) and in significantly less time (p=<.001).
Conclusion: Quantitative performance metrics of segmentation algorithms can mask their clinical utility. The AUGMENT framework enables the systematic identification of clinically usable AI-models, and highlights the importance of assessing the interaction between AI tools and radiologists.