REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

Carlo Biffi,Giulio Antonelli,Sebastian Bernhofer,Cesare Hassan,Daizen Hirata,Mineo Iwatate,Andreas Maieron,Pietro Salvagnini,Andrea Cherubini
DOI: https://doi.org/10.1038/s41597-024-03359-0
2024-05-25
Scientific Data
Abstract:Detection and diagnosis of colon polyps are key to preventing colorectal cancer. Recent evidence suggests that AI-based computer-aided detection (CADe) and computer-aided diagnosis (CADx) systems can enhance endoscopists' performance and boost colonoscopy effectiveness. However, most available public datasets primarily consist of still images or video clips, often at a down-sampled resolution, and do not accurately represent real-world colonoscopy procedures. We introduce the REAL-Colon (Real-world multi-center Endoscopy Annotated video Library) dataset: a compilation of 2.7 M native video frames from sixty full-resolution, real-world colonoscopy recordings across multiple centers. The dataset contains 350k bounding-box annotations, each created under the supervision of expert gastroenterologists. Comprehensive patient clinical data, colonoscopy acquisition information, and polyp histopathological information are also included in each video. With its unprecedented size, quality, and heterogeneity, the REAL-Colon dataset is a unique resource for researchers and developers aiming to advance AI research in colonoscopy. Its openness and transparency facilitate rigorous and reproducible research, fostering the development and benchmarking of more accurate and reliable colonoscopy-related algorithms and models.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper aims to address the issues of polyp detection and diagnosis in colonoscopy and to promote the application of artificial intelligence (AI) technology in colonoscopy by introducing a brand-new, high-quality, real-world multi-center dataset (REaL-Colon). Specifically, the paper focuses on the following points: 1. **Background and Challenges**: Colorectal cancer (CRC) is a significant global health issue, with approximately 2 million new cases each year. More than 95% of colorectal cancers originate from precancerous adenomatous polyps, so timely detection and removal of these polyps can significantly reduce the incidence and mortality of colorectal cancer. However, the quality of existing colonoscopies is unstable due to differences in operator skills. 2. **Limitations of Existing Datasets**: Currently available public datasets mostly consist of static images or low-resolution video clips, which cannot accurately reflect the real colonoscopy process. This leads to poor performance of AI models trained on these datasets. 3. **Features of the REaL-Colon Dataset**: This dataset includes 60 full-resolution real colonoscopy video recordings from multiple medical centers, totaling 2.7 million frames, with each polyp annotated with bounding boxes. Additionally, the dataset contains detailed patient clinical information, colonoscopy acquisition information, and histopathological information of the polyps. 4. **Application Value of the Dataset**: Due to its large scale, high quality, and rich diversity, the REaL-Colon dataset becomes a unique resource for researchers and developers to advance AI research related to colonoscopy. It can promote the development and benchmarking of more accurate and reliable colonoscopy algorithms and models. In summary, the paper aims to improve the performance and reliability of AI systems in the colonoscopy process by constructing and releasing the REaL-Colon dataset, thereby bridging the gap between open research and privately funded research.