Abstract:Monocular 3D object detection has attracted great attention due to simplicity and low cost. Existing methods typically follow conventional 2D detection paradigms, first locating object centers and then predicting 3D attributes via neighboring features. However, these methods predominantly rely on progressive cross-scale feature aggregation and focus solely on local information, which may result in a lack of global awareness and the omission of small-scale objects. In addition, due to large variation in object scales across different scenes and depths, inaccurate receptive fields often lead to background noise and degraded feature representation. To address these issues, we introduces MonoASRH, a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH). Specifically, EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects and leverages lightweight convolutional modules to efficiently aggregate visual features across different scales. The ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM through a scale-semantic feature fusion module. The scale-semantic feature fusion module guides ASRH in learning dynamic receptive field offsets, incorporating scale priors into 3D position prediction for better scale-awareness. Extensive experiments on the KITTI and Waymo datasets demonstrate that MonoASRH achieves state-of-the-art performance.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in monocular 3D object detection: 1. **Insufficient global perception and omission of small - scale objects**: - Existing methods usually rely on step - by - step cross - scale feature aggregation and mainly focus on local information, which may lead to a lack of global awareness and omission of small - scale objects. 2. **Large object scale changes in different scenes and depths**: - The scale of objects changes greatly in different scenes and depths, resulting in inaccurate receptive fields, introducing background noise and reducing the quality of feature representation. To address these problems, the authors propose a new monocular 3D detection framework named MonoASRH, which consists of two main modules: - **Efficient Hybrid Feature Aggregation Module (EH - FAM)**: - Use the multi - head attention mechanism to extract semantic features with a global receptive field, which is especially suitable for small - scale objects. - Utilize lightweight convolutional modules to efficiently aggregate visual features at different scales. - **Adaptive Scale - Aware 3D Regression Head (ASRH)**: - Encode 2D bounding box dimensions to capture scale features, and fuse these features with the semantic features aggregated by EH - FAM through the scale - semantic feature fusion module. - Introduce scale prior information into 3D position prediction, learn dynamic receptive field offsets, and improve scale - awareness ability. Specifically, EH - FAM uses the multi - head attention mechanism and lightweight convolutional operations to effectively aggregate features at different scales, while ASRH combines 2D bounding box size information and semantic features, and guides the network to learn dynamic receptive field offsets through the scale - semantic feature fusion module, so as to better handle objects of different scales. Experimental results show that MonoASRH achieves state - of - the - art performance on the KITTI and Waymo datasets. In summary, this paper aims to improve the accuracy and robustness of monocular 3D object detection by improving feature aggregation and introducing a scale - awareness mechanism.

Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Monocular 3D Object Detection With Sequential Feature Association and Depth Hint Augmentation

SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection

SGM3D: Stereo Guided Monocular 3D Object Detection

Sparse Embedded Convolution Based Dual Feature Aggregation 3D Object Detection Network

MonoGRNet: A General Framework for Monocular 3D Object Detection

Towards Model Generalization for Monocular 3D Object Detection

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection.

Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

Every Dataset Counts: Scaling up Monocular 3D Object Detection with Joint Datasets Training

Spatial Information Enhancement with Multi-Scale Feature Aggregation for Long-Range Object and Small Reflective Area Object Detection from Point Cloud

A Novel Adaptive Edge Aggregation and Multiscale Feature Interaction Detector for Object Detection in Remote Sensing Images

Shape-Aware Monocular 3D Object Detection

ABC: Aligning Binary Centers for Single-Stage Monocular 3D Object Detection