OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Ming Hu,Peng Xia,Lin Wang,Siyuan Yan,Feilong Tang,Zhongxing Xu,Yimin Luo,Kaimin Song,Jurgen Leitner,Xuelian Cheng,Jun Cheng,Chi Liu,Kaijing Zhou,Zongyuan Ge
2024-07-19
Abstract:Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 285 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Code and dataset are available at: <a class="link-external link-https" href="https://minghu0830.github.io/OphNet-benchmark/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of large - scale, diverse, and detailed - annotated datasets in ophthalmic surgical video analysis. Specifically, the current surgical video datasets have the following deficiencies: 1. **Small scale**: Most existing video datasets contain no more than 100 videos. For example, the CATARACTS [23] and CatRelDet [23] datasets have only 50 and 21 surgical videos respectively, which are not sufficient for large - scale validation. 2. **Limited surgical and stage categories**: Almost all ophthalmic surgical video datasets only include cataract surgeries and do not further classify specific surgical types. In addition, the number of stage categories is also limited. For example, CatRelDet [23] contains only 4 different stage labels, which cannot meet the evaluation requirements in the clinical environment. 3. **Coarse - grained annotation**: Due to the high cost of annotation, existing benchmarks usually have coarse - grained action definitions. For example, adhesive injection may occur in two different stages: main incision and capsulotomy, and thus may be classified into different stage categories. Coarse - grained action definitions may lead to annotation bias. 4. **Single - time - boundary annotation**: These datasets only annotate the specified stages in the video, ignoring the continuity between different stages of ophthalmic surgery and the hierarchical relationships between surgeries, stages, and operations. For example, LensID [22] is limited to a binary - classification task, distinguishing between intraocular lens implantation and other unrelated stages. 5. **Uniform domain**: The videos are carefully collected. Although the video quality is ensured, the uniform style is not conducive to testing the domain generalization ability of the model. To solve these problems, the authors introduced OphNet, which is a large - scale, expert - annotated video benchmark specifically for ophthalmic surgical workflow understanding. The features of OphNet are as follows: - **Large - scale and diversity**: OphNet is currently the largest and most richly - annotated surgical workflow analysis dataset. It contains 20 times the number of videos of the largest existing ophthalmic surgery benchmark, covering 66 different types of ophthalmic surgeries (such as cataract, glaucoma, and corneal surgeries), as well as 102 unique surgical stages and 150 fine - grained operations. - **Fine - grained, sequential, and hierarchical annotation**: Each video is annotated with an average of 22 operations on average, and provides accurate annotations at the surgical, stage, and operation levels to meet the needs of training models for specific challenges. - **Expert - level manual annotation**: The annotation work was completed by ten experienced ophthalmologists and five professionals with ophthalmic experience, ensuring the quality and professionalism of OphNet. By constructing OphNet, the authors hope to promote the development of intelligent systems in surgical workflow analysis, especially in fields such as robotic surgery, telesurgery, and AI - assisted surgery.