Combing Transformer and CNN for Monocular Depth Estimation

Chuanwu Ling,Xiaogang Zhang,Hua Chen,Xiaoyu Zhu,Wenbin Yan
DOI: https://doi.org/10.1109/cac57257.2022.10055348
2022-01-01
Abstract:The goal of this study is to explore the monocular depth estimation problem.. Since visual transformer performs exceptionally well in modeling long-range correlation, we suggest modeling the global context using the Swin Transformer for accurate depth estimation. Besides, we create a CNN branch to assist the network in gathering local information. The ablation experiments verify the effectiveness of CNN branches in capturing local features. Experimental outcomes on the NYUv2 dataset reveal that our method achieves better performance than existing state-of-the-art monocular depth estimation methods.
What problem does this paper attempt to address?