NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation

Yiran Wang,Min Shi,Jiaqi Li,Chaoyi Hong,Zihao Huang,Juewen Peng,Zhiguo Cao,Jianming Zhang,Ke Xian,Guosheng Lin
2024-10-04
Abstract:Video depth estimation aims to infer temporally consistent depth. One approach is to finetune a single-image model on each video with geometry constraints, which proves inefficient and lacks robustness. An alternative is learning to enforce consistency from data, which requires well-designed models and sufficient video depth data. To address both challenges, we introduce NVDS+ that stabilizes inconsistent depth estimated by various single-image models in a plug-and-play manner. We also elaborate a large-scale Video Depth in the Wild (VDW) dataset, which contains 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Additionally, a bidirectional inference strategy is designed to improve consistency by adaptively fusing forward and backward predictions. We instantiate a model family ranging from small to large scales for different applications. The method is evaluated on VDW dataset and three public benchmarks. To further prove the versatility, we extend NVDS+ to video semantic segmentation and several downstream applications like bokeh rendering, novel view synthesis, and 3D reconstruction. Experimental results show that our method achieves significant improvements in consistency, accuracy, and efficiency. Our work serves as a solid baseline and data foundation for learning-based video depth estimation. Code and dataset are available at: <a class="link-external link-https" href="https://github.com/RaymondWang987/NVDS" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper primarily aims to address two core issues in video depth estimation: 1. **Temporal Consistency**: Existing video depth estimation methods often exhibit flickering when predicting depth across consecutive frames, indicating poor temporal consistency. 2. **Efficiency and Robustness**: Current methods mostly require fine-tuning the single-frame depth model during the testing phase, which is computationally expensive and relies on camera pose estimation, leading to insufficient robustness. Specifically, the paper proposes a framework named NVDS+ with the following features: - It can be directly applied to different single-frame depth models without additional training. - A large-scale natural scene video depth dataset (VDW) is introduced, containing over 14,203 videos and more than 2 million frames, making it the largest natural scene video depth dataset to date. - A bidirectional inference strategy is designed, and an optical flow-based fusion method is introduced to further improve temporal consistency. - A model family ranging from small to large is implemented to meet the needs of different application scenarios. - NVDS+ is extended to the video semantic segmentation task, demonstrating its generality in dense prediction tasks. Through these methods, NVDS+ significantly improves the temporal consistency, accuracy, and efficiency of depth estimation in multiple benchmark tests.