RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Xiaomeng Chu,Jiajun Deng,Guoliang You,Yifan Duan,Yao Li,Yanyong Zhang
2024-07-27
Abstract:The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issues present in query-based methods for multi-camera 3D object detection and proposes a novel solution. Specifically, existing query-based multi-camera 3D object detection methods often initialize query points in a grid distribution within the 3D space. A key problem with this approach is that query points located near the same camera ray are likely to extract similar features from very close pixel locations in the image, leading to feature ambiguity and thus affecting detection accuracy. To solve the above issues, the paper proposes the RayFormer method, a query-based 3D object detector based on camera ray characteristics. The main innovations of RayFormer include: 1. **Ray-inspired query initialization**: RayFormer initializes the positions of query points by mimicking the optical characteristics of the camera. Specifically, query points are initialized sparsely and uniformly along the camera ray direction. This "radial" distribution reduces the likelihood of multiple query points falling on the same object, thereby obtaining more distinctive features. 2. **Ray sampling method**: To further address the issue of query points at different distances on the same ray extracting similar features, RayFormer designs a ray sampling method. This method extracts features not only from the image perspective but also from the bird's-eye view (BEV) perspective. Each query point selects several sampling points on its ray segment, ensuring that each query point can extract unique object-level features. 3. **Incorporating 2D prior knowledge**: In addition to the radially initialized base query points, RayFormer also utilizes 2D object detection results to supplement additional query points. By extending the height of the 2D bounding boxes and selecting rays intersecting with these bounding boxes in the image, foreground query points can be obtained. 4. **Performance validation**: The authors conducted extensive experiments on the nuScenes dataset to validate the effectiveness of RayFormer. The results show that RayFormer achieved 55.5% mean Average Precision (mAP) and 63.3% nuScenes Detection Score (NDS) on the test set, improving by 1.2% and 0.6% respectively compared to the baseline SparseBEV. In summary, RayFormer aims to improve the accuracy of query-based multi-camera 3D object detection by enhancing the initialization of query points and the feature sampling strategy.