Abstract:The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.

What problem does this paper attempt to address?

The paper primarily addresses the issues present in query-based methods for multi-camera 3D object detection and proposes a novel solution. Specifically, existing query-based multi-camera 3D object detection methods often initialize query points in a grid distribution within the 3D space. A key problem with this approach is that query points located near the same camera ray are likely to extract similar features from very close pixel locations in the image, leading to feature ambiguity and thus affecting detection accuracy. To solve the above issues, the paper proposes the RayFormer method, a query-based 3D object detector based on camera ray characteristics. The main innovations of RayFormer include: 1. **Ray-inspired query initialization**: RayFormer initializes the positions of query points by mimicking the optical characteristics of the camera. Specifically, query points are initialized sparsely and uniformly along the camera ray direction. This "radial" distribution reduces the likelihood of multiple query points falling on the same object, thereby obtaining more distinctive features. 2. **Ray sampling method**: To further address the issue of query points at different distances on the same ray extracting similar features, RayFormer designs a ray sampling method. This method extracts features not only from the image perspective but also from the bird's-eye view (BEV) perspective. Each query point selects several sampling points on its ray segment, ensuring that each query point can extract unique object-level features. 3. **Incorporating 2D prior knowledge**: In addition to the radially initialized base query points, RayFormer also utilizes 2D object detection results to supplement additional query points. By extending the height of the 2D bounding boxes and selecting rays intersecting with these bounding boxes in the image, foreground query points can be obtained. 4. **Performance validation**: The authors conducted extensive experiments on the nuScenes dataset to validate the effectiveness of RayFormer. The results show that RayFormer achieved 55.5% mean Average Precision (mAP) and 63.3% nuScenes Detection Score (NDS) on the test set, improving by 1.2% and 0.6% respectively compared to the baseline SparseBEV. In summary, RayFormer aims to improve the accuracy of query-based multi-camera 3D object detection by enhancing the initialization of query points and the feature sampling strategy.

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

SEFormer: Structure Embedding Transformer for 3D Object Detection

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Enhance the 3D Object Detection With 2D Prior

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view Attention for Multi-view 3D Object Detection

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

A Robust Diffusion Modeling Framework for Radar Camera 3D Object Detection

OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

LRCFormer: lightweight transformer based radar-camera fusion for 3D target detection

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation

Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting

Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Object as Query: Equipping Any 2D Object Detector with 3D Detection Ability