Artificial intelligence (AI) systems for detection of Barrett's neoplasia: time to bridge domain gaps and explore human–AI interaction

Albert Jeroen De Groof
DOI: https://doi.org/10.1055/a-2335-1331
2024-06-20
Endoscopy
Abstract:Endoscopic recognition of early Barrett's neoplasia may be difficult; lesions often only show subtle abnormalities and most endoscopists rarely encounter early Barrett's neoplasia, making them unfamiliar with the appearance of these lesions [1] [2]. Assistance for detection of Barrett's neoplasia has therefore always been one of the most appealing artificial intelligence (AI) applications. In this issue of Endoscopy , Meinikheim et al. present an ex vivo study in which they evaluated the effect of AI assistance on the performance of endoscopists in their recognition of Barrett's neoplasia on endoscopic videos [3]. The AI system was trained with endoscopic still images and video frames from 557 patients recorded with high definition white-light endoscopy or (optical) chromoendoscopy techniques. Histopathology labels were available for 456 patients. Neoplastic images of 304 patients were additionally delineated by expert endoscopists. In a two-phase design, 96 prospectively collected endoscopic video recordings of 72 patients with Barrett's esophagus were evaluated for the presence of neoplastic lesions by a group of six expert and 16 nonexpert endoscopists. Endoscopist assessors were randomized to begin either with or without AI assistance. Performance did not change significantly in the group that began with AI assistance, yet this was probably due to a learning effect as there was no washout in between phases. In the eight nonexpert endoscopists who started without AI assistance, sensitivity and specificity increased significantly (from 70% to 78% and 67% to 73%, respectively) when AI assistance was provided in the second phase, although it did not match the level of standalone AI performance in terms of sensitivity (sensitivity 92%; specificity 69%). "Now that the feasibility of AI systems has been confirmed, in the coming years we will need to focus on improving the robustness of these systems for their successful integration into general endoscopic practice, while also developing a deeper understanding of human–AI interaction." These findings align with observations from other studies, where endoscopists were found not to consistently adhere to AI recommendations. In a recent publication, Fockens et al. described the development of a computer-aided detection system for Barrett's neoplasia that was trained with imagery of 2506 patients from 15 different centers and tested on multiple prospective image- and video-based test sets by 112 endoscopists. Video-based detections by endoscopists increased from 67% to 79% when AI assistance was provided, while standalone AI sensitivity was 91% [4]. First, it is essential to recognize that determination of standalone AI performance is inherently prone to bias – especially for video-based applications – and may not directly relate to actual clinical performance. Acknowledging this, the interaction between humans and AI remains a relatively unexplored area in endoscopy and warrants evaluation in the upcoming years. The initiation of this evaluation by the authors, elucidating the impact of AI on the diagnostic confidence of endoscopists, merits commendation. The employed AI system in this study was designed to detect neoplasia in overview (positive detections are labeled "region at risk") followed by detailed inspection in magnification (positive detections are labeled as Barrett's esophagus-related neoplasia or "BERN"). This is in line with the procedural workflow of a Barrett's surveillance endoscopy, where visible abnormalities are first detected in overview, followed by detailed inspection of the region of interest. Unfortunately, detailed information on the exact composition and content of the videos in the test set (e.g. the distribution of overview versus magnified imagery) is not provided. It is therefore not clear how many lesions were picked up by the AI system in overview, which from a clinical perspective is the essential first step. In this regard, training and testing primary detection systems with prospective imagery without specific focus on any lesion (i.e. mimicking a setting where lesions are generally overlooked) is crucial in preventing important hidden bias that is inherent in the use of retrospectively collected imagery, which is generally focused on the lesion ([Fig. 1]) [4]. Although this is not explicitly described in the manuscript, the AI system was apparently trained with retrospectively collected data from multiple centers. The test set was prospectively collected in a single center. Hence, caution is warranted when interpreting results in terms of generalizability, as the authors rightfully acknowledge. Lack of generalizability of results is a common problem in preclinical AI studies. Generally, AI systems are developed by the use of selected, high quality imagery collected at a limited number of centers, which leads to homogeneous training data. How -Abstract Truncated-
gastroenterology & hepatology,surgery
What problem does this paper attempt to address?