TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Shaokang Li,Chengzhi Lyu,Bin Xia,Ziheng Chen,Lei Zhang
DOI: https://doi.org/10.1007/s00371-024-03332-3
IF: 2.835
2024-03-30
The Visual Computer
Abstract:Self-supervised monocular depth estimation presents a promising result, which utilizes image sequences instead of challenging-to-source ground truth for training. The framework of most current studies on self-supervised depth estimation is based on fully convolutional or transformer architectures, and there is little discussion on the hybrid architecture. In this paper, we proposed TAMDepth, a novel framework that can effectively capture the local and global features of image sequences by combining convolutional blocks and transformer blocks. TAMDepth adopts multi-scale feature fusion convolutional modules capture local details in shallow layers while transformer blocks build the global dependency in higher layers. Furthermore, to enhance the representation of architecture, we introduce an adapter modulation that injects the spatial prior to the transformer blocks through cross-attention, which improves the ability of modeling the scene. Experiments demonstrate that our model exhibits state-of-the-art performance on the KITTI dataset and also shows strong generalization performance on the Make3D dataset. Source code is available at https://github.com/deansaice/TAMDepth.
computer science, software engineering
What problem does this paper attempt to address?