Abstract:Monocular 3-D object detection is a challenging and important task in autonomous driving. In 3-D object detection, depth information are crucial for determining the position, size, and posture of objects. However, in the 2-D images obtained by monocular cameras, depth information are compressed into a plane, which blurs the relative position and size relationships between objects. In addition, the pose changes of objects in 3-D space (such as rotation and tilt) can cause changes in their projected shape, size, and position in 2-D images, which increases the difficulty of detecting 3-D objects from monocular 2-D images. In this article, we propose a depth-vision-decoupled transformer for monocular multiclass 3-D object detection. The proposed scheme is constructed with the following novel components: 1) a cascaded group-multiscale convolutional attention (CG-MSCA) with multiscale-perception and direction-perception capabilities to focus on the local characteristics of the complex scene; 2) a decoupled-feature-aware transformer (DFTR) module, which globally decouples depth and visual features, encodes and decodes them separately to avoid the monocularly estimated inaccurate depth interfering with the model's learning of visual information and alleviate the mismatch between stereo information and geometric shape of the object; and 3) a cross-attention-guided fusion module (CAFM) to rationalize the fusion of decoupled depth and visual features before prediction. Experiments on the KITTI, DAIR-V2X-V, and DAIR-V2X-I datasets show that our proposed method produces competitive performance compared with Transformer-based methods and other state-of-the-art (SOTA) methods. For the Car Category, our method achieves 17.04% average precision of 3-D ( ) for the moderate setting (the most important setting) of the KITTI dataset with an intersection over union (IoU) threshold of 0.7.

TinyDepth: Lightweight Self-Supervised Monocular Depth Estimation Based on Transformer

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Self-supervised Monocular Depth Estimation with Large Kernel Attention

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Lightweight Monocular Depth Estimation via Token-Sharing Transformer

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Complete contextual information extraction for self-supervised monocular depth estimation

Deep Neighbor Layer Aggregation for Lightweight Self-Supervised Monocular Depth Estimation

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Monocular Depth Estimation Algorithm Integrating Parallel Transformer and Multi-Scale Features

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Lightweight monocular depth estimation using a fusion-improved transformer

Depth Estimation with Simplified Transformer

Real-Time Monocular Depth Estimation Merging Vision Transformers on Edge Devices for AIoT

FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation