TinyDepth: Lightweight Self-Supervised Monocular Depth Estimation Based on Transformer

Zeyu Cheng,Yi Zhang,Yang Yu,Zhe Song,Chengkai Tang
DOI: https://doi.org/10.1016/j.engappai.2024.109313
2024-01-01
Abstract:Monocular depth estimation plays an important role in autonomous driving, virtual reality, augmented reality, and other fields. Self-supervised monocular depth estimation has received much attention because it does not require hard-to-obtain depth labels during training. The previously used convolutional neural network (CNN) has shown limitations in modeling large-scale spatial dependencies. A new idea for monocular depth estimation is replacing the CNN architecture or merging it with a Vision Transformer (ViT) architecture that can model large-scale spatial dependencies in images. However, there are still problems with too many parameters and calculations, making deployment difficult on mobile platforms. In response to these problems, we propose TinyDepth, a lightweight self-supervised monocular depth estimation method based on Transformer that employs hierarchical representation learning suitable for dense prediction, uses mobile convolution to reduce parameters and computational overhead. and includes a novel decoder based on multi-scale fusion attention that improves the local and global inference capability of the network through scale-wise attention processing and layer-wise fusion sampling for more accurate depth prediction. In experiments, TinyDepth achieved state-of-the-art results with few parameters on the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI) dataset, and exhibited good generalization ability on the challenging indoor New York University (NYU) dataset. Source code is available at https://github.com/ZYCheng777/TinyDepth.
What problem does this paper attempt to address?