Abstract:Estimating precise metric depth and scene reconstruction from monocular endoscopy is a fundamental task for surgical navigation in robotic surgery. However, traditional stereo matching adopts binocular images to perceive the depth information, which is difficult to transfer to the soft robotics-based surgical systems due to the use of monocular endoscopy. In this paper, we present a novel framework that combines robot kinematics and monocular endoscope images with deep unsupervised learning into a single network for metric depth estimation and then achieve 3D reconstruction of complex anatomy. Specifically, we first obtain the relative depth maps of surgical scenes by leveraging a brightness-aware monocular depth estimation method. Then, the corresponding endoscope poses are computed based on non-linear optimization of geometric and photometric reprojection residuals. Afterwards, we develop a Depth-driven Sliding Optimization (DDSO) algorithm to extract the scaling coefficient from kinematics and calculated poses offline. By coupling the metric scale and relative depth data, we form a robust ensemble that represents the metric and consistent depth. Next, we treat the ensemble as supervisory labels to train a metric depth estimation network for surgeries (i.e., MetricDepthS-Net) that distills the embeddings from the robot kinematics, endoscopic videos, and poses. With accurate metric depth estimation, we utilize a dense visual reconstruction method to recover the 3D structure of the whole surgical site. We have extensively evaluated the proposed framework on public SCARED and achieved comparable performance with stereo-based depth estimation methods. Our results demonstrate the feasibility of the proposed approach to recover the metric depth and 3D structure with monocular inputs.

SVT-SDE: Spatiotemporal Vision Transformers-Based Self-Supervised Depth Estimation in Stereoscopic Surgical Videos

Self-Supervised Siamese Learning on Stereo Image Pairs for Depth Estimation in Robotic Surgery

E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception

Spatio-Temporal Segmentation with Depth-Inferred Videos of Static Scenes

Spatio-Temporal Depth Recovery of Dynamic Scenes with Multiple Handheld Cameras

High-Quality Depth Recovery Via Interactive Multi-view Stereo

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

Self-Supervised Generative Adversarial Network for Depth Estimation in Laparoscopic Images

SPDET: Edge-Aware Self-Supervised Panoramic Depth Estimation Transformer With Spherical Geometry

Learning Stereo Depth Estimation with Bio-Inspired Spike Cameras

STS: Surround-view Temporal Stereo for Multi-view 3D Detection

Self-Supervised Depth Estimation in Laparoscopic Image using 3D Geometric Consistency

Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers

WS-SfMLearner: Self-supervised Monocular Depth and Ego-motion Estimation on Surgical Videos with Unknown Camera Parameters

Generalizable stereo depth estimation with masked image modelling

Distilled Visual and Robot Kinematics Embeddings for Metric Depth Estimation in Monocular Scene Reconstruction

SST: Real-time End-to-end Monocular 3D Reconstruction via Sparse Spatial-Temporal Guidance

Self-Supervised Monocular Depth Estimation With Positional Shift Depth Variance and Adaptive Disparity Quantization

Stereo Matching by Self-supervision of Multiscopic Vision.

Details preserved unsupervised depth estimation by fusing traditional stereo knowledge from laparoscopic images