Improving 3D Object Detection with Context-Aware and Dimensional Interaction Attention

Jing Zhou,Zixin Gong,Junchi Zhang
DOI: https://doi.org/10.1007/s11063-024-11447-w
IF: 2.565
2024-02-11
Neural Processing Letters
Abstract:Recently, 3D object detection technology based on point clouds has developed rapidly. However, too few points of distant and occluded objects are scanned by the sensor, and thus these objects suffer from too insufficient features to be detected. This case damages the detection accuracy. Therefore, we constitute a novel 3D object detection with Context-aware and dimensional Interaction Attention Network (CIANet) to explore vital geometric cues for enriching the feature representation of the object, thus boosting the overall detection performance. Specifically, in the first stage, we employ the 3D sparse convolution to extract voxel features, and then construct a Channel-Spatial Hybrid Attention (CSHA) module and a Contextual Self-Attention (CSA) module to enhance voxel features for generating proposals. The CSHA module aims to enhance the key information of the channel and spatial domains of 2D Bird's Eye View (BEV) features, and the CSA module is applied to supplement contextual information to the enhanced BEV features, thus generating accurate proposals. In the second stage, we construct a Dimensional Interaction Attention (DIA) module to refine Region of Interest (RoI) features within the proposals. It enhances the interactions among the channel and spatial dimensions of RoI features to learn accurate boundaries of objects for proposal refinement. Extensive experiments on the KITTI and Waymo benchmarks show the superior detection performance of CIANet compared to recent methods, especially for objects such as pedestrians and cyclists.
computer science, artificial intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient features due to too few scanning points for far - distance and occluded objects in point - cloud - based 3D object detection technology. Specifically: 1. **Detection challenges of far - distance and occluded objects**: - In real - world scenarios, the number of points of far - distance or occluded objects scanned by LiDAR sensors is too small, making it difficult to fully depict the boundaries of these objects and thus lacking sufficient spatial features. - This situation makes existing 3D object detection methods difficult to accurately detect these weak objects, thereby affecting the overall detection performance. 2. **Limitations of existing methods**: - **View - based methods**: By projecting the point cloud onto a 2D view for detection, although it can utilize mature 2D CNNs, it will lose 3D spatial information, limiting the detection performance. - **Point - based methods**: Directly extract features from the original point cloud, but with high computational complexity and slow inference speed. - **Voxel - based methods**: Convert the point cloud into regular voxels and use 3D sparse convolution to extract features. Although it improves the computational speed, it will lose spatial geometric information in some cases, affecting the detection accuracy. ### Solutions proposed in the paper To solve the above problems, the authors propose a new 3D object detection network based on context - aware and dimensional interaction attention mechanisms - CIANet (Context - aware and Dimensional Interaction Attention Network). Specific improvement measures include: 1. **First stage**: - Use 3D sparse convolution to extract voxel features. - Construct a **Channel - Spatial Hybrid Attention module (CSHA)** to enhance the key information of BEV features. - Construct a **Context Self - Attention module (CSA)** to supplement global spatial context information and generate high - quality candidate boxes (proposals). 2. **Second stage**: - Use voxel RoI pooling operations to capture RoI features within the candidate boxes. - Construct a **Dimensional Interaction Attention module (DIA)** to enhance the interaction between the spatial and channel dimensions of RoI features, learn more accurate object boundaries, and thus refine the candidate boxes. ### Main contributions 1. **Proposing CSHA and CSA modules**: In the first stage, these two modules enhance the key channel - spatial features and aggregate rich global context information to generate more accurate candidate boxes. 2. **Designing DIA module**: In the second stage, this module integrates the interaction between the channel dimension and the spatial dimension, enhances the RoI features, and further refines the candidate boxes to generate the final accurate detection boxes. 3. **Experimental results**: CIANet performs excellently in the KITTI and Waymo benchmarks, especially outperforming other advanced methods in detecting small weak objects such as pedestrians and cyclists. By introducing these attention mechanisms, CIANet can better focus on the boundary information of weak objects in real - world scenarios, thereby improving the overall detection performance.