Abstract:Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 285 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Code and dataset are available at: <a class="link-external link-https" href="https://minghu0830.github.io/OphNet-benchmark/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of the lack of large - scale, diverse, and detailed - annotated datasets in ophthalmic surgical video analysis. Specifically, the current surgical video datasets have the following deficiencies: 1. **Small scale**: Most existing video datasets contain no more than 100 videos. For example, the CATARACTS [23] and CatRelDet [23] datasets have only 50 and 21 surgical videos respectively, which are not sufficient for large - scale validation. 2. **Limited surgical and stage categories**: Almost all ophthalmic surgical video datasets only include cataract surgeries and do not further classify specific surgical types. In addition, the number of stage categories is also limited. For example, CatRelDet [23] contains only 4 different stage labels, which cannot meet the evaluation requirements in the clinical environment. 3. **Coarse - grained annotation**: Due to the high cost of annotation, existing benchmarks usually have coarse - grained action definitions. For example, adhesive injection may occur in two different stages: main incision and capsulotomy, and thus may be classified into different stage categories. Coarse - grained action definitions may lead to annotation bias. 4. **Single - time - boundary annotation**: These datasets only annotate the specified stages in the video, ignoring the continuity between different stages of ophthalmic surgery and the hierarchical relationships between surgeries, stages, and operations. For example, LensID [22] is limited to a binary - classification task, distinguishing between intraocular lens implantation and other unrelated stages. 5. **Uniform domain**: The videos are carefully collected. Although the video quality is ensured, the uniform style is not conducive to testing the domain generalization ability of the model. To solve these problems, the authors introduced OphNet, which is a large - scale, expert - annotated video benchmark specifically for ophthalmic surgical workflow understanding. The features of OphNet are as follows: - **Large - scale and diversity**: OphNet is currently the largest and most richly - annotated surgical workflow analysis dataset. It contains 20 times the number of videos of the largest existing ophthalmic surgery benchmark, covering 66 different types of ophthalmic surgeries (such as cataract, glaucoma, and corneal surgeries), as well as 102 unique surgical stages and 150 fine - grained operations. - **Fine - grained, sequential, and hierarchical annotation**: Each video is annotated with an average of 22 operations on average, and provides accurate annotations at the surgical, stage, and operation levels to meet the needs of training models for specific challenges. - **Expert - level manual annotation**: The annotation work was completed by ten experienced ophthalmologists and five professionals with ophthalmic experience, ensuring the quality and professionalism of OphNet. By constructing OphNet, the authors hope to promote the development of intelligent systems in surgical workflow analysis, especially in fields such as robotic surgery, telesurgery, and AI - assisted surgery.

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Cataract-1K Dataset for Deep-Learning-Assisted Analysis of Cataract Surgery Videos

Cataract-1K: Cataract Surgery Dataset for Scene Segmentation, Phase Recognition, and Irregularity Detection

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Automated Surgical Skill Assessment in Endoscopic Pituitary Surgery using Real-time Instrument Tracking on a High-fidelity Bench-top Phantom

A comprehensive survey on recent deep learning-based methods applied to surgical data

Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support

Toward Intraoperative Visual Intelligence: Real-Time Surgical Instrument Segmentation for Enhanced Surgical Monitoring

Towards Deep Learning Guided Autonomous Eye Surgery Using Microscope and iOCT Images

PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery

[Pulmonary interstitial fibrosis as a systemic manifestation of active chronic hepatitis].

CholecTrack20: A Dataset for Multi-Class Multiple Tool Tracking in Laparoscopic Surgery

POV-Surgery: A Dataset for Egocentric Hand and Tool Pose Estimation During Surgical Activities

Not End-to-End: Explore Multi-Stage Architecture for Online Surgical Phase Recognition

Instrument-tissue Interaction Detection Framework for Surgical Video Understanding

Tracking Everything in Robotic-Assisted Surgery

EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos

EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos

CathAction: A Benchmark for Endovascular Intervention Understanding

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Open-Source Periorbital Segmentation Dataset for Ophthalmic Applications