Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

Rui Song,Chenwei Liang,Hu Cao,Zhiran Yan,Walter Zimmer,Markus Gross,Andreas Festag,Alois Knoll
2024-04-25
Abstract:Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve more accurate and comprehensive 3D semantic occupancy prediction in Connected Automated Vehicles (CAVs). Specifically, the existing camera - based collaborative 3D perception methods usually use 3D bounding boxes or bird - eye views as environmental representations, and these methods are insufficient in providing comprehensive 3D environmental predictions. To bridge this gap, the authors propose a new method - the Collaborative Hybrid Feature Fusion Framework (CoHFF), which improves local 3D semantic occupancy prediction through feature fusion in the following two aspects: 1. **Fusion of semantic and occupancy task features**: By combining the features of semantic segmentation and occupancy prediction tasks, the prediction accuracy of the occupancy state and semantic category of each voxel in 3D space is improved. 2. **Sharing of compressed orthogonal attention features**: By sharing the compressed orthogonal attention features among vehicles, the prediction performance is further enhanced. In addition, due to the lack of a collaborative perception dataset specifically designed for semantic occupancy prediction, the authors also extend an existing collaborative perception dataset by adding 3D collaborative semantic occupancy labels for more robust evaluation. ### Main contributions 1. **Propose the first camera - based collaborative semantic occupancy prediction framework**: Through feature sharing in the V2X communication network, more accurate and comprehensive 3D semantic occupancy segmentation than single - vehicle systems is achieved, with a performance improvement of more than 30%. 2. **Introduce a hybrid feature fusion method**: It not only promotes efficient collaboration among CAVs but also significantly improves the performance of models pre - trained only for occupancy prediction or semantic voxel segmentation. 3. **Enrich the collaborative perception dataset OPV2V**: Add voxel ground truth containing 12 - category semantics to enhance the evaluation of the framework. CoHFF achieves results comparable to the current leading methods in subsequent 3D perception applications and provides more semantic details about the road environment. ### Method overview The CoHFF framework consists of four key modules: 1. **Occupancy prediction task network**: Convert from 2D image data to 3D occupancy grids and extract occupancy task features. 2. **Semantic segmentation task network**: Process RGB data to generate feature maps and map them to the 3D semantic segmentation space through deformable cross - attention. 3. **V2X feature fusion**: Merge features among CAVs through the deformable self - attention mechanism. 4. **Task feature fusion**: Combine all task features to enhance semantic occupancy prediction. ### Experimental results The experimental results show that CoHFF outperforms existing methods in both 3D object detection and BEV semantic segmentation tasks. In particular, in 3D object detection, CoHFF reaches 48.51 and 36.39 on the AP@0.5 and AP@0.7 metrics respectively, significantly outperforming other methods. In the BEV semantic segmentation task, CoHFF also performs well in vehicle and road predictions and can detect a wider range of other semantic categories. ### Ablation study The ablation study shows that the independently obtained semantic and occupancy feature information can strengthen the original semantic and occupancy tasks simultaneously. Specifically: - **Occupancy prediction task**: By processing depth prediction through the occupancy prediction task network, the overall prediction accuracy is improved. Combining the features of the semantic segmentation task network significantly improves the prediction accuracy of large objects, but the mIoU slightly decreases. - **Semantic segmentation task**: After integrating the occupancy prediction features, the IoU is increased by about 2% and the mIoU is increased by more than 41%. This is attributed to the fact that the occupancy prediction features are helpful for the detection of smaller - scale objects. In conclusion, this paper solves the deficiencies of existing methods in 3D semantic occupancy prediction by proposing the CoHFF framework and provides a new direction for the further development of the collaborative perception field.