Abstract:Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervised attention mechanisms without any geometry appearance awareness in transformers are susceptible to producing noisy features for query points, which severely limits the network performance and also makes the model have a poor ability to detect multi-category objects in a single training process. To tackle this problem, this paper proposes a novel ``Supervised Shape&Scale-perceptive Deformable Attention'' (S$^3$-DA) module for monocular 3D object detection. Concretely, S$^3$-DA utilizes visual and depth features to generate diverse local features with various shapes and scales and predict the corresponding matching distribution simultaneously to impose valuable shape&scale perception for each query. Benefiting from this, S$^3$-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features. Besides, we propose a Multi-classification-based Shape&Scale Matching (MSM) loss to supervise the above process. Extensive experiments on KITTI and Waymo Open datasets demonstrate that S$^3$-DA significantly improves the detection accuracy, yielding state-of-the-art performance of single-category and multi-category 3D object detection in a single training process compared to the existing approaches. The source code will be made publicly available at <a class="link-external link-https" href="https://github.com/mikasa3lili/S3-MonoDETR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "S3 - MonoDETR: Supervised Shape & Scale - perceptive Deformable Transformer for Monocular 3D Object Detection" aims to solve several key problems in monocular 3D object detection: 1. **Poor quality of query points**: - Current Transformer - based methods perform well in monocular 3D object detection. However, these methods usually use visual and depth representations to generate query points, and the quality of query points plays a decisive role in detection accuracy. - However, the existing unsupervised attention mechanisms in Transformer lack geometric appearance - awareness ability, are prone to generate noisy features, which severely limit the network performance and make the model's ability to detect multi - class objects in a single training process poor. 2. **Challenges in multi - class object detection**: - Current methods usually need to train different models for different classes of objects separately and carefully adjust the hyper - parameters. Although this can improve the detection performance, it is inefficient and impractical in practical applications, especially in safety - critical applications such as autonomous driving. 3. **Difficulties in small object detection**: - The detection of small objects is particularly difficult because the existing deformable attention mechanisms only focus on the exploration of relative key points and ignore estimating the receptive field range of query points, resulting in easy introduction of noisy key points on small objects, which seriously affects the detection effect. ### Solutions To solve the above problems, the paper proposes a novel "Supervised Shape and Scale - aware Deformable Attention module" (S3 - DA), which specifically includes the following aspects: 1. **Supervised Shape and Scale - aware Deformable Attention module (S3 - DA)**: - S3 - DA uses visual and depth features to generate diverse local features with different shapes and scales and simultaneously predicts the corresponding matching distributions, thereby providing valuable shape and scale - awareness for each query. - In this way, S3 - DA can effectively estimate the receptive field range of query points and generate robust query features. 2. **Multi - class Shape and Scale Matching Loss (MSM loss)**: - A multi - class Shape and Scale Matching Loss (MSM loss) is proposed to supervise the above process, helping S3 - DA to allocate visual cues, enabling query points to have high - quality features, thereby improving the accuracy of 3D attribute prediction. 3. **Multi - class joint training**: - Through the above design, the model can detect multi - class objects with different geometric appearances in a single training process, significantly improving the detection accuracy, and the experimental results on the KITTI and Waymo Open datasets verify the effectiveness of this method. ### Main contributions 1. **Proposing the S3 - DA module**: - Compared with the traditional deformable attention mechanisms, S3 - DA can better estimate the receptive field range of query points, support accurate key point generation, and thus generate high - quality query features for 3D attribute prediction. 2. **The first multi - class monocular 3D object detector**: - This is the first monocular 3D object detector that can detect different classes of objects in a single training process and has important practical application value. 3. **Excellent experimental results**: - Extensive experiments on the KITTI and Waymo Open datasets show that the performance of this method on the moderate and difficult subsets is better than that of existing methods, and its near - real - time inference speed makes it highly applicable in autonomous driving applications. ### Conclusion By introducing the Supervised Shape and Scale - aware Deformable Attention module (S3 - DA) and the Multi - class Shape and Scale Matching Loss (MSM loss), this paper significantly improves the performance of monocular 3D object detection, especially in multi - class object detection and small object detection. These innovative designs make the model more efficient and reliable in practical applications.

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection

MT-SSD: Single-Stage 3D Object Detector Based on Magnification Transformation

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

SEFormer: Structure Embedding Transformer for 3D Object Detection

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

LAM3D: Leveraging Attention for Monocular 3D Object Detection

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

SGM3D: Stereo Guided Monocular 3D Object Detection

TSSTDet: Transformation-Based 3-D Object Detection via a Spatial Shape Transformer

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

Introducing Depth into Transformer-based 3D Object Detection

MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection

Improving 3D Object Detection with Channel-wise Transformer

Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection