Abstract:The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (∼20 fps). We also present a fast 3D reconstruction (∼17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.

Real-Time Monocular Depth Estimation Merging Vision Transformers on Edge Devices for AIoT

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

Real-Time Stereo Image Depth Estimation Network with Group-Wise L1 Distance for Edge Devices Towards Autonomous Driving

Lightweight Monocular Depth Estimation via Token-Sharing Transformer

FastDepth: Fast Monocular Depth Estimation on Embedded Systems

Lightweight Monocular Depth Estimation with an Edge Guided Network

Real-time Monocular Depth Estimation on Embedded Systems

Lightweight Monocular Depth Estimation on Edge Devices

Lightweight monocular depth estimation using a fusion-improved transformer

Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation

EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation

Real-Time Monocular Human Depth Estimation and Segmentation on Embedded Systems

MobileXNet: An Efficient Convolutional Neural Network for Monocular Depth Estimation

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

DTS-Depth: Real-Time Single-Image Depth Estimation Using Depth-to-Space Image Construction

Depth Estimation with Simplified Transformer