Abstract:Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve more accurate and comprehensive 3D semantic occupancy prediction in Connected Automated Vehicles (CAVs). Specifically, the existing camera - based collaborative 3D perception methods usually use 3D bounding boxes or bird - eye views as environmental representations, and these methods are insufficient in providing comprehensive 3D environmental predictions. To bridge this gap, the authors propose a new method - the Collaborative Hybrid Feature Fusion Framework (CoHFF), which improves local 3D semantic occupancy prediction through feature fusion in the following two aspects: 1. **Fusion of semantic and occupancy task features**: By combining the features of semantic segmentation and occupancy prediction tasks, the prediction accuracy of the occupancy state and semantic category of each voxel in 3D space is improved. 2. **Sharing of compressed orthogonal attention features**: By sharing the compressed orthogonal attention features among vehicles, the prediction performance is further enhanced. In addition, due to the lack of a collaborative perception dataset specifically designed for semantic occupancy prediction, the authors also extend an existing collaborative perception dataset by adding 3D collaborative semantic occupancy labels for more robust evaluation. ### Main contributions 1. **Propose the first camera - based collaborative semantic occupancy prediction framework**: Through feature sharing in the V2X communication network, more accurate and comprehensive 3D semantic occupancy segmentation than single - vehicle systems is achieved, with a performance improvement of more than 30%. 2. **Introduce a hybrid feature fusion method**: It not only promotes efficient collaboration among CAVs but also significantly improves the performance of models pre - trained only for occupancy prediction or semantic voxel segmentation. 3. **Enrich the collaborative perception dataset OPV2V**: Add voxel ground truth containing 12 - category semantics to enhance the evaluation of the framework. CoHFF achieves results comparable to the current leading methods in subsequent 3D perception applications and provides more semantic details about the road environment. ### Method overview The CoHFF framework consists of four key modules: 1. **Occupancy prediction task network**: Convert from 2D image data to 3D occupancy grids and extract occupancy task features. 2. **Semantic segmentation task network**: Process RGB data to generate feature maps and map them to the 3D semantic segmentation space through deformable cross - attention. 3. **V2X feature fusion**: Merge features among CAVs through the deformable self - attention mechanism. 4. **Task feature fusion**: Combine all task features to enhance semantic occupancy prediction. ### Experimental results The experimental results show that CoHFF outperforms existing methods in both 3D object detection and BEV semantic segmentation tasks. In particular, in 3D object detection, CoHFF reaches 48.51 and 36.39 on the AP@0.5 and AP@0.7 metrics respectively, significantly outperforming other methods. In the BEV semantic segmentation task, CoHFF also performs well in vehicle and road predictions and can detect a wider range of other semantic categories. ### Ablation study The ablation study shows that the independently obtained semantic and occupancy feature information can strengthen the original semantic and occupancy tasks simultaneously. Specifically: - **Occupancy prediction task**: By processing depth prediction through the occupancy prediction task network, the overall prediction accuracy is improved. Combining the features of the semantic segmentation task network significantly improves the prediction accuracy of large objects, but the mIoU slightly decreases. - **Semantic segmentation task**: After integrating the occupancy prediction features, the IoU is increased by about 2% and the mIoU is increased by more than 41%. This is attributed to the fact that the occupancy prediction features are helpful for the detection of smaller - scale objects. In conclusion, this paper solves the deficiencies of existing methods in 3D semantic occupancy prediction by proposing the CoHFF framework and provides a new direction for the further development of the collaborative perception field.

Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

AFOcc: Multi-Modal Semantic Occupancy Prediction with Accurate Fusion

Collaborative Joint Perception and Prediction for Autonomous Driving

Semantic Scene Completion in Autonomous Driving: A Two-Stream Multi-Vehicle Collaboration Approach

OccFusion: Multi-Sensor Fusion Framework for 3D Semantic Occupancy Prediction

Collaborative Perception in Autonomous Driving: Methods, Datasets and Challenges

Practical Collaborative Perception: A Framework for Asynchronous and Multi-Agent 3D Object Detection

Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Offboard Occupancy Refinement with Hybrid Propagation for Autonomous Driving

Predicting Future Spatiotemporal Occupancy Grids with Semantics for Autonomous Driving

SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving

Cooper: Cooperative Perception for Connected Autonomous Vehicles based on 3D Point Clouds

Collaborative Perception for Connected and Autonomous Driving: Challenges, Possible Solutions and Opportunities

Real-Time 3D Occupancy Prediction via Geometric-Semantic Disentanglement

Adaptive Feature Fusion for Cooperative Perception using LiDAR Point Clouds

Edge Computing-Based Collaborative Vehicles 3D Mapping in Real Time

AdaOcc: Adaptive-Resolution Occupancy Prediction

CMP: Cooperative Motion Prediction with Multi-Agent Communication

Multimedia Fusion at Semantic Level in Vehicle Cooperactive Perception