NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation

Yiran Wang,Min Shi,Jiaqi Li,Chaoyi Hong,Zihao Huang,Juewen Peng,Zhiguo Cao,Jianming Zhang,Ke Xian,Guosheng Lin

2024-10-04

Abstract:Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS+ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation. Code and dataset are available at: <a class="link-external link-https" href="https://github.com/RaymondWang987/NVDS" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper primarily aims to address two core issues in video depth estimation: 1. **Temporal Consistency**: Existing video depth estimation methods often exhibit flickering when predicting depth across consecutive frames, indicating poor temporal consistency. 2. **Efficiency and Robustness**: Current methods mostly require fine-tuning the single-frame depth model during the testing phase, which is computationally expensive and relies on camera pose estimation, leading to insufficient robustness. Specifically, the paper proposes a framework named NVDS+ with the following features: - It can be directly applied to different single-frame depth models without additional training. - A large-scale natural scene video depth dataset (VDW) is introduced, containing over 14,203 videos and more than 2 million frames, making it the largest natural scene video depth dataset to date. - A bidirectional inference strategy is designed, and an optical flow-based fusion method is introduced to further improve temporal consistency. - A model family ranging from small to large is implemented to meet the needs of different application scenarios. - NVDS+ is extended to the video semantic segmentation task, demonstrating its generality in dense prediction tasks. Through these methods, NVDS+ significantly improves the temporal consistency, accuracy, and efficiency of depth estimation in multiple benchmark tests.

NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation

CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth

Depth Generation Network: Estimating Real World Depth From Stereo And Depth Images

Towards Practical Consistent Video Depth Estimation.

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Boosting Monocular Depth Estimation with Sparse Guided Points

Video Depth without Video Models

Consistent depth maps recovery from a video sequence.

DRO: Deep Recurrent Optimizer for Video to Depth

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Depth Any Video with Scalable Synthetic Data

Non-parametric Depth Distribution Modelling based Depth Inference for Multi-view Stereo

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Monocular Visual-Inertial Depth Estimation

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

FutureDepth: Learning to Predict the Future Improves Video Depth Estimation

NR-MVSNet: Learning Multi-View Stereo Based on Normal Consistency and Depth Refinement

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

3DVNet: Multi-View Depth Prediction and Volumetric Refinement

Unsupervised Scale-Consistent Depth Learning from Video