Abstract:Estimating precise metric depth and scene reconstruction from monocular endoscopy is a fundamental task for surgical navigation in robotic surgery. However, traditional stereo matching adopts binocular images to perceive the depth information, which is difficult to transfer to the soft robotics-based surgical systems due to the use of monocular endoscopy. In this paper, we present a novel framework that combines robot kinematics and monocular endoscope images with deep unsupervised learning into a single network for metric depth estimation and then achieve 3D reconstruction of complex anatomy. Specifically, we first obtain the relative depth maps of surgical scenes by leveraging a brightness-aware monocular depth estimation method. Then, the corresponding endoscope poses are computed based on non-linear optimization of geometric and photometric reprojection residuals. Afterwards, we develop a Depth-driven Sliding Optimization (DDSO) algorithm to extract the scaling coefficient from kinematics and calculated poses offline. By coupling the metric scale and relative depth data, we form a robust ensemble that represents the metric and consistent depth. Next, we treat the ensemble as supervisory labels to train a metric depth estimation network for surgeries (i.e., MetricDepthS-Net) that distills the embeddings from the robot kinematics, endoscopic videos, and poses. With accurate metric depth estimation, we utilize a dense visual reconstruction method to recover the 3D structure of the whole surgical site. We have extensively evaluated the proposed framework on public SCARED and achieved comparable performance with stereo-based depth estimation methods. Our results demonstrate the feasibility of the proposed approach to recover the metric depth and 3D structure with monocular inputs.

DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

EndoDAC: Efficient Adapting Foundation Model for Self-Supervised Depth Estimation from Any Endoscopic Camera

Distilled Visual and Robot Kinematics Embeddings for Metric Depth Estimation in Monocular Scene Reconstruction

Learning How To Robustly Estimate Camera Pose in Endoscopic Videos

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy

Self-Supervised Siamese Learning on Stereo Image Pairs for Depth Estimation in Robotic Surgery

Self-Supervised Learning for Monocular Depth Estimation on Minimally Invasive Surgery Scenes

Surgical Depth Anything: Depth Estimation for Surgical Scenes using Foundation Models

Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Benchmarking Robustness of Endoscopic Depth Estimation with Synthetically Corrupted Data

EndoDepth: A Benchmark for Assessing Robustness in Endoscopic Depth Prediction

Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Self-Supervised Monocular Depth and Ego-Motion Estimation in Endoscopy: Appearance Flow to the Rescue

Self-Supervised Monocular Depth Estimation for Endoscopic Imaging

Unsupervised-Learning-Based Continuous Depth and Motion Estimation with Monocular Endoscopy for Virtual Reality Minimally Invasive Surgery

SEDMamba: Enhancing Selective State Space Modelling with Bottleneck Mechanism and Fine-to-Coarse Temporal Fusion for Efficient Error Detection in Robot-Assisted Surgery

MonoLoT: Self-Supervised Monocular Depth Estimation in Low-Texture Scenes for Automatic Robotic Endoscopy

Augmented Reality for Depth Cues in Monocular Minimally Invasive Surgery